Electronic device and method for controlling spatial design based on audio

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The method employs a head-mounted device with sensors and AI to efficiently collect user preferences and generate virtual objects, addressing inefficiencies in spatial design processes by reducing time and repetition.

WO2026127587A1PCT designated stage Publication Date: 2026-06-18SAMSUNG ELECTRONICS CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: SAMSUNG ELECTRONICS CO LTD
Filing Date: 2025-12-09
Publication Date: 2026-06-18

Application Information

Patent Timeline

09 Dec 2025

Application

18 Jun 2026

Publication

WO2026127587A1

IPC: G06F3/04815; G06F3/0484; G02B27/01; G06F3/00; G06F3/16; G10L15/26; G06F3/01

AI Tagging

Application Domain

Input/output for user-computer interaction Sound input/output

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Semiconductor inventory equipment maintenance system and method
CN120087937Blower requirementEasy to carry outInput/output for user-computer interaction Data processing applications
Device for work support in a predefined work area within an assigned spatial profile
DE102013201309B4Input/output for user-computer interactionMeasuring points marking
AR head-mounted device, and AR head-mounted device and terminal device combination system
CN114967926BInput/output for user-computer interaction Graph reading
Eye tracking cross-device interaction method and apparatus
CN122195247AInput/output for user-computer interaction Character and pattern recognition
Methods and apparatus for invoking public or private interactions during multi-user communication sessions
CN115280261BInput/output for user-computer interaction Image analysis

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure KR2025021093_18062026_PF_FP_ABST

Patent Text Reader

Abstract

An electronic device of the present disclosure comprises: a frame on which glass including a display is mounted; a wearing structure coupled to the frame to allow the frame to be seated on the head of a user; at least one sensor; a camera including a depth camera; a microphone; a speaker; at least one processor; and a memory for storing instructions, wherein the instructions stored in the memory, when executed individually or collectively by the at least one processor, may instruct the electronic device to receive a voice input related to a space through the microphone, obtain an image and depth information related to the space through the depth camera on the basis of the voice input, convert voice collected through the microphone into text, generate a virtual object to be disposed in the space on the basis of the converted text, and display the virtual object on the display.

Need to check novelty before this filing date? Find Prior Art

Description

Method for controlling spatial design based on electronic devices and audio

[0001] The present disclosure relates to a method for controlling a spatial design based on an electronic device and audio.

[0002] Spatial design refers to the work of designing and organizing physical spaces, such as interiors, architecture, and landscaping, based on aesthetics, functionality, and user experience.

[0003] Until the space design is finally completed, there is a difficulty in wasting time due to the repeated design and review processes between users and designers, starting with the collection of user preferences.

[0004] The information described above may be provided as related art for the purpose of aiding understanding of the present disclosure. No claim or determination is made as to whether any of the foregoing may be applied as prior art related to the present disclosure.

[0005] The electronic device of the present disclosure may include a frame mounted with glass including a display.

[0006] The electronic device of the present disclosure may include a wearable structure coupled to the frame so that the frame is seated on the user's head.

[0007] The electronic device of the present disclosure may include at least one sensor.

[0008] The electronic device of the present disclosure may include a camera including a depth camera.

[0009] The electronic device of the present disclosure may include a microphone.

[0010] The electronic device of the present disclosure may include a speaker.

[0011] The electronic device of the present disclosure may include at least one processor.

[0012] The instructions stored in the memory of the present disclosure, when executed individually or collectively by the at least one processor, can cause the electronic device to receive a voice input related to space through the microphone.

[0013] The instructions stored in the memory of the present disclosure, when executed individually or collectively by the at least one processor, can enable the electronic device to acquire depth information related to images and space through the depth camera based on the voice input.

[0014] The instructions stored in the memory of the present disclosure, when executed individually or collectively by the at least one processor, can cause the electronic device to convert voice collected through the microphone into text.

[0015] When the instructions stored in the memory of the present disclosure are executed individually or collectively by the at least one processor, the electronic device may generate a virtual object to be placed in space based on the converted text.

[0016] The instructions stored in the memory of the present disclosure, when executed individually or collectively by the at least one processor, can cause the electronic device to display the virtual object on the display.

[0017] A spatial design method using an electronic device of the present disclosure may include the operation of receiving voice input regarding spatial design through a microphone.

[0018] A spatial design method using the electronic device of the present disclosure may include the operation of receiving a voice input related to the space through a microphone.

[0019] A spatial design method using an electronic device of the present disclosure may include an operation of acquiring depth information related to an image and space through a depth camera based on the voice input.

[0020] A spatial design method using the electronic device of the present disclosure may include an operation of converting voice collected through the microphone into text.

[0021] A spatial design method using an electronic device of the present disclosure may include the operation of creating a virtual object to be placed in space based on the converted text.

[0022] A spatial design method using the electronic device of the present disclosure may include an operation of displaying the virtual object on a display.

[0023] In relation to the description of the drawings, the same or similar reference numerals may be used for identical or similar components.

[0024] FIG. 1 is a block diagram of an electronic device in a network environment according to one embodiment of the present disclosure.

[0025] FIG. 2a is a drawing showing the front view of an electronic device according to one embodiment of the present disclosure.

[0026] FIG. 2b is a drawing showing the back side of an electronic device according to one embodiment of the present disclosure.

[0027] FIG. 3 is a drawing showing an electronic device according to one embodiment of the present invention.

[0028] FIG. 4 is a drawing showing a display, an eye-tracking camera, and a waveguide according to one embodiment of the present disclosure.

[0029] FIG. 5 is a flowchart illustrating a method for controlling an audio-based spatial design of an electronic device according to one embodiment of the present disclosure.

[0030] FIG. 6 is a flowchart specifically illustrating the operation of generating candidate images according to one embodiment of the present disclosure.

[0031] FIG. 8 is a diagram illustrating a method for controlling an audio-based spatial design of an electronic device according to one embodiment of the present disclosure.

[0032] FIG. 9 is a diagram illustrating a method for controlling an audio-based spatial design of an electronic device according to one embodiment of the present disclosure.

[0033] FIG. 10 is a drawing illustrating a method for controlling an audio-based spatial design of an electronic device according to one embodiment of the present disclosure.

[0034] FIGS. 11a, FIGS. 11b and FIGS. 11c are drawings illustrating a method for controlling an audio-based spatial design of an electronic device according to an embodiment of the present invention.

[0035] FIGS. 12a, FIGS. 12b, FIGS. 12c, FIGS. 12d, FIGS. 12e and FIGS. 12f are drawings illustrating a method for controlling an audio-based spatial design of an electronic device according to one embodiment of the present disclosure.

[0036] FIG. 13 is a flowchart illustrating a method for controlling an audio-based spatial design of an electronic device according to one embodiment of the present disclosure.

[0037] The method for controlling space design based on the electronic device and audio of the present invention can perform the task of iteratively designing a space based on the user's requirements and preferences, based on a head-mounted display device and artificial intelligence (AI).

[0038] The method for controlling space design based on the electronic device and audio of the present invention can reduce the time required for space design by repeatedly designing the space based on the user's requirements and preferences.

[0039] FIG. 1 is a block diagram of an electronic device (100) in a network environment according to one embodiment of the present disclosure.

[0040] Referring to FIG. 1, in a network environment, an electronic device (100) may communicate with an electronic device (102) through a first network (198) (e.g., a short-range wireless communication network) or with at least one of an electronic device (104) or a server (108) through a second network (199) (e.g., a long-range wireless communication network).

[0041] In one embodiment, the electronic device (100) can communicate with the electronic device (104) through the server (108).

[0042] In one embodiment, the electronic device (100) may include a processor (120), memory (130), input circuit (150), sound output circuit (155), display (160), audio circuit (170), sensor (176), interface (177), connection terminal (178), haptic circuit (179), camera (180), power management circuit (188), battery (189), communication circuit (190), subscriber identification circuit (196), or antenna (197).

[0043] In one embodiment, the processor (120) may include at least one processor. The processor (120) may include a processing circuit.

[0044] In one embodiment, the electronic device (100) may have at least one of these components (e.g., connection terminal (178)) omitted, or one or more other components added.

[0045] In one embodiment, some of the components of the electronic device (100) (e.g., sensor (176), camera (180), or antenna (197)) may be integrated into a single component (e.g., display (160)).

[0046] In one embodiment, the processor (120) can control at least one other component (e.g., hardware or software component) of the electronic device (100) connected to the processor (120) by executing software (e.g., program (140)), and can perform various data processing or operations.

[0047] In one embodiment, as at least part of the data processing or operation, the processor (120) may store commands or data received from other components (e.g., a sensor (176) or a communication circuit (190)) in a volatile memory (132), process the commands or data stored in the volatile memory (132), and store the resulting data in a non-volatile memory (134).

[0048] In one embodiment, the processor (120) may include a main processor (121) (e.g., a central processing unit or an application processor) or an auxiliary processor (123) that can operate independently or together with it (e.g., a graphics processing unit, a neural processing unit (NPU), an image signal processor, a sensor hub processor, or a communication processor). For example, if the electronic device (100) includes a main processor (121) and an auxiliary processor (123), the auxiliary processor (123) may be configured to use less power than the main processor (121) or to be specialized for a designated function. The auxiliary processor (123) may be implemented separately from the main processor (121) or as part thereof.

[0049] In one embodiment, the auxiliary processor (123) can control at least some of the functions or states associated with at least one component of the electronic device (100) (e.g., display (160), sensor (176), or communication circuit (190)) on behalf of the main processor (121) while the main processor (121) is in an inactive (e.g., sleep) state, or together with the main processor (121) while the main processor (121) is in an active (e.g., application execution) state.

[0050] In one embodiment, the auxiliary processor (123) (e.g., image signal processor or communication processor) may be implemented as part of other functionally related components (e.g., camera (180) or communication circuit (190)).

[0051] In one embodiment, the auxiliary processor (123) (e.g., a neural network processing unit) may include a hardware structure specialized for processing an artificial intelligence model. The artificial intelligence model may be generated through machine learning. Such learning may be performed, for example, on the electronic device (100) itself where the artificial intelligence model is executed, or through a separate server (e.g., a server (108)). The learning algorithm may include, for example, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but is not limited to the examples described above. The artificial intelligence model may include a plurality of artificial neural network layers. An artificial neural network may be a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), a deep Q-network, or a combination of two or more of the above, but is not limited to the examples described above. In addition to the hardware structure, the artificial intelligence model may include a software structure, either additionally or substantially.

[0052] In one embodiment, the memory (130) may store various data used by at least one component of the electronic device (100) (e.g., processor (120) or sensor (176)). The data may include, for example, input data or output data for software (e.g., program (140)) and related instructions. The memory (130) may include volatile memory (132) or non-volatile memory (134).

[0053] In one embodiment, the program (140) may be stored as software in memory (130) and may include, for example, an operating system (142), middleware (144), or an application (146).

[0054] In one embodiment, the input circuit (150) may receive commands or data to be used for a component of the electronic device (100) (e.g., processor (120)) from outside the electronic device (100) (e.g., user). The input circuit (150) may include, for example, a microphone, a mouse, a keyboard, a key (e.g., a button), or a digital pen (e.g., a stylus pen).

[0055] In one embodiment, the acoustic output circuit (155) may output an acoustic signal to the outside of the electronic device (100). The acoustic output circuit (155) may include, for example, a speaker or a receiver. The speaker may be used for general purposes, such as multimedia playback or recording playback. The receiver may be used to receive incoming calls. In one embodiment, the receiver may be implemented separately from the speaker or as part thereof.

[0056] In one embodiment, the display (160) can visually provide information to an external (e.g., user) of the electronic device (100). The display (160) may include, for example, a display, a holographic device, or a projector and a control circuit for controlling said device.

[0057] In one embodiment, the display (160) may include a touch sensor set to detect a touch, or a pressure sensor set to measure the intensity of the force generated by the touch.

[0058] In one embodiment, the audio circuit (170) can convert sound into an electrical signal or, conversely, convert an electrical signal into sound.

[0059] In one embodiment, the audio circuit (170) may acquire sound through the input circuit (150) or output sound through the sound output circuit (155) or an external electronic device (e.g., electronic device (102)) (e.g., speaker, headphones, case, or phone) that is directly or wirelessly connected to the electronic device (100).

[0060] In one embodiment, the sensor (176) may detect the operating state of the electronic device (100) (e.g., power or temperature) or the external environmental state (e.g., user state) and generate an electrical signal or data value corresponding to the detected state. In one embodiment, the sensor (176) may include, for example, a gesture sensor, a gyroscope sensor, a barometric pressure sensor, a magnetic sensor, an accelerometer sensor, a grip sensor, a proximity sensor, a color sensor, an IR (infrared) sensor, a biosensor, a temperature sensor, a humidity sensor, or an illuminance sensor.

[0061] In one embodiment, the sensor (176) may include at least one of an IR sensor, an RGB (red green blue) sensor, or an image sensor.

[0062] In one embodiment, the interface (177) may support one or more specified protocols that can be used for the electronic device (100) to be connected directly or wirelessly to an external electronic device (e.g., electronic device (102)).

[0063] In one embodiment, the interface (177) may include, for example, an HDMI (high definition multimedia interface), a USB (universal serial bus) interface, an SD card interface, or an audio interface.

[0064] In one embodiment, the electronic device (102) may be the same or a different type of device as the electronic device (100).

[0065] In one embodiment, the electronic device (102) may include at least some of the components included in the electronic device (100). The electronic device (102) may include, for example, memory, a processor, a battery, or a power management circuit. The memory included in the electronic device (102) may store instructions, data, or programs.

[0066] In one embodiment, all or part of the operations performed in the electronic device (100) may be performed in the electronic device (102). For example, when the electronic device (100) needs to perform a function or service automatically or in response to a request from a user or another device, the electronic device (100) may request one or more external electronic devices (e.g., electronic device (102)) to perform at least part of the function or service instead of performing the function or service itself or additionally. Upon receiving the request, one or more external electronic devices (e.g., electronic device (102)) may perform at least part of the requested function or service, or additional functions or services related to the request, and transmit the result of the execution to the electronic device (100). The electronic device (100) may provide the result as is or additionally processed as at least part of the response to the request. For example, the electronic device (102) renders content data executed in an application and transmits it to the electronic device (100), and the electronic device (100) that receives the data can output the content data to the display (160). If the electronic device (100) detects user movement through an IMU sensor, etc., the processor (120) of the electronic device (100) can correct the rendering data received from the electronic device (102) based on the movement information and output it to the display (160). Alternatively, the electronic device (100) can transmit the movement information to the electronic device (102) and request rendering so that the screen data is updated accordingly.

[0067] In one embodiment, the electronic device (102) may be a device of various forms, such as a case device capable of storing and charging the electronic device (100).

[0068] In one embodiment, the connection terminal (178) may include a connector through which the electronic device (100) can be physically connected to an external electronic device (e.g., electronic device (102)).

[0069] In one embodiment, the connection terminal (178) may include, for example, an HDMI connector, a USB connector, an SD card connector, or an audio connector (e.g., a headphone connector).

[0070] In one embodiment, the haptic circuit (179) can convert an electrical signal into a mechanical stimulus (e.g., vibration or movement) or an electrical stimulus that a user can perceive through tactile or kinesthetic senses. In one embodiment, the haptic circuit (179) may include, for example, a motor, a piezoelectric element, or an electric stimulation device.

[0071] In one embodiment, the camera (180) can capture still images and video. In one embodiment, the camera (180) may include one or more lenses, image sensors, image signal processors, or flashes.

[0072] In one embodiment, the electronic device (100) may include at least one camera (180). For example, the at least one camera (180) included in the electronic device (100) may include a camera that acquires real-world images facing outward from the electronic device (100) and a camera that tracks the eyeball of the wearer of the electronic device (100).

[0073] In one embodiment, the power management circuit (188) can manage power supplied to the electronic device (100). The power management circuit (188) can be implemented, for example, as at least part of a power management integrated circuit (PMIC).

[0074] In one embodiment, the battery (189) can supply power to at least one component of the electronic device (100). In one embodiment, the battery (189) may include, for example, a non-rechargeable primary battery, a rechargeable secondary battery, or a fuel cell.

[0075] In one embodiment, the communication circuit (190) may support the establishment of a direct (e.g., wired) communication channel or a wireless communication channel between an electronic device (100) and an external electronic device (e.g., electronic device (102), electronic device (104), or server (108)), and the performance of communication through the established communication channel. The communication circuit (190) may include one or more communication processors that operate independently of the processor (120) (e.g., application processor) and support direct (e.g., wired) communication or wireless communication.

[0076] In one embodiment, the communication circuit (190) may include a wireless communication circuit (192) (e.g., a cellular communication circuit, a short-range wireless communication circuit, or a GNSS (global navigation satellite system) communication circuit) or a wired communication circuit (194) (e.g., a LAN (local area network) communication circuit, or a power line communication circuit). The corresponding communication circuit among these communication circuits may communicate with an external electronic device (104) via a first network (198) (e.g., a short-range communication network such as Bluetooth, WiFi (wireless fidelity) direct, or IrDA (infrared data association)) or a second network (199) (e.g., a legacy cellular network, a 5G network, a next-generation communication network, the Internet, or a computer network (e.g., a LAN or WAN)). These various types of communication circuits may be integrated into a single component (e.g., a single chip) or implemented as multiple separate components (e.g., multiple chips). The wireless communication circuit (192) can identify or authenticate an electronic device (100) within a communication network, such as a first network (198) or a second network (199), using subscriber information (e.g., International Mobile Subscriber Identifier (IMSI)) stored in a subscriber identification circuit (196).

[0077] In one embodiment, the wireless communication circuit (192) can support a 5G network following a 4G network and next-generation communication technology, for example, new radio access technology. The NR access technology can support high-speed transmission of high-capacity data (enhanced mobile broadband (eMBB)), minimization of terminal power and connection of multiple terminals (massive machine type communications (mMTC)), or high reliability and low latency (ultra-reliable and low-latency communications (URLLC)). The wireless communication circuit (192) can support a high-frequency band (e.g., mmWave band) to achieve a high data transmission rate, for example. The wireless communication circuit (192) can support various technologies for securing performance in the high-frequency band, such as beamforming, massive MIMO (multiple-input and multiple-output), full-dimensional MIMO (FD-MIMO), array antenna, analog beam-forming, or large-scale antenna. The wireless communication circuit (192) can support various requirements specified in the electronic device (100), external electronic device (e.g., electronic device (104)), or network system (e.g., second network (199)).

[0078] In one embodiment, the wireless communication circuit (192) may support a Peak data rate (e.g., 20 Gbps or more) for eMBB realization, loss coverage (e.g., 164 dB or less) for mMTC realization, or U-plane latency (e.g., downlink (DL) and uplink (UL) each 0.5 ms or less, or round trip 1 ms or less) for URLLC realization.

[0079] In one embodiment, the antenna (197) can transmit a signal or power to an external (e.g., an external electronic device) or receive it from an external source.

[0080] In one embodiment, the antenna (197) may include an antenna comprising a radiator made of a conductor or a conductive pattern formed on a substrate (e.g., PCB).

[0081] In one embodiment, the antenna (197) may include a plurality of antennas (e.g., array antennas). In this case, at least one antenna suitable for a communication method used in a communication network such as a first network (198) or a second network (199) may be selected from the plurality of antennas, for example, by a communication circuit (190). A signal or power may be transmitted or received between the communication circuit (190) and an external electronic device through the selected at least one antenna.

[0082] In one embodiment, in addition to the radiator, other components (e.g., RFIC (radio frequency integrated circuit)) may be additionally formed as part of the antenna (197).

[0083] In one embodiment, the antenna (197) can form a mmWave antenna circuit.

[0084] In one embodiment, the mmWave antenna circuit may include a printed circuit board, an RFIC disposed on or adjacent to a first surface (e.g., bottom surface) of the printed circuit board and capable of supporting a specified high frequency band (e.g., mmWave band), and a plurality of antennas (e.g., array antennas) disposed on or adjacent to a second surface (e.g., top surface or side surface) of the printed circuit board and capable of transmitting or receiving a signal of the specified high frequency band.

[0085] In one embodiment, at least some of the components included in the electronic device (100) may be connected to each other via a communication method between peripheral devices (e.g., bus, GPIO (general purpose input and output), SPI (serial peripheral interface), or MIPI (mobile industry processor interface)) and may exchange signals (e.g., commands or data) with each other.

[0086] In one embodiment, commands or data may be transmitted or received between the electronic device (100) and an external electronic device (104) through a server (108) connected to a second network (199). Each of the external electronic devices (102, or 104) may be the same or a different type of device as the electronic device (100).

[0087] In one embodiment, all or part of the operations performed on the electronic device (100) may be performed on one or more external electronic devices (102, 104, or 108). For example, when the electronic device (100) needs to perform a function or service automatically or in response to a request from a user or another device, the head-mounted display device may request one or more external electronic devices to perform at least part of the function or service instead of performing the function or service itself, or additionally. Upon receiving the request, one or more external electronic devices may perform at least part of the requested function or service, or additional functions or services related to the request, and transmit the result of the execution to the electronic device (100). The electronic device (100) may provide the result as is or additionally processed as at least part of the response to the request. For this purpose, for example, cloud computing, distributed computing, mobile edge computing (MEC), or client-server computing technology may be used. The electronic device (100) can provide ultra-low latency services, for example, by using distributed computing or mobile edge computing. In another embodiment, the external electronic device (1804) may include an Internet of Things (IoT) device. The server (108) may be an intelligent server using machine learning and / or neural networks.

[0088] In one embodiment, an external electronic device (104) or server (108) may be included within the second network (199). The electronic device (100) may be applied to intelligent services (e.g., smart home, smart city, smart car, or healthcare) based on 5G communication technology and IoT-related technology.

[0089] FIG. 2a is a drawing showing the front view of an electronic device (100) according to one embodiment of the present disclosure.

[0090] FIG. 2b is a drawing showing the back side of an electronic device (100) according to one embodiment of the present disclosure.

[0091] Referring to FIG. 2a and FIG. 2, the electronic device (100) can be worn on a part of the user's body to provide a user interface.

[0092] In one embodiment, the electronic device (100) may output photos and / or images to the user. Alternatively, the electronic device (100) may provide images related to augmented reality services and / or virtual reality services. For example, the electronic device (100) may provide the user with an experience of augmented reality, virtual reality, mixed reality, and / or extended reality.

[0093] For example, the electronic device (100) can provide augmented reality to the user. The electronic device (100) can transmit a virtual object image output from a display (160) toward the user's eyes, and the virtual object image can utilize data on a real world image captured through a plurality of cameras (230a, 230b, 230c).

[0094] In one embodiment, the electronic device (100) may be, for example, a head-mounted display (HMD) or a face-mounted display (FMD), or may be smart glasses or a headset that provide extended reality such as augmented reality (AR), virtual reality (VR), or mixed reality, but is not limited thereto.

[0095] In one embodiment, the electronic device (100) may include at least some of a housing (201), a plurality of cameras (230a, 230b, 230c) and a display (160).

[0096] In one embodiment, the electronic device (100) may include a housing (201). The housing (201) may be configured to accommodate at least one component. The housing (201) may include a first surface (211a) (e.g., front), a second surface (211b) opposite to the first surface (211a) (e.g., rear or wearing surface), and a third surface (211c) (e.g., side surface) between the first surface (211a) and the second surface (211b).

[0097] In one embodiment, the housing (201) may include a bridge (214). The bridge (214) may be configured to face a part of the user's body (e.g., nose). For example, the bridge (214) may be supported by the user's nose.

[0098] In one embodiment, the housing (201) may correspond to the main body of the electronic device (100). The housing (201) may be identical to the main body of the electronic device (100). The housing (201) may include the main body of the electronic device (100).

[0099] In one embodiment, the housing (201) can be mounted on the user's head by means of a wearing structure such as temples or a strap.

[0100] In one embodiment, the electronic device (100) may include a lens structure (210, 220). The lens structure (210, 220) may include a plurality of lenses configured to adjust the focus of an image provided to a user. For example, the plurality of lenses may be configured to adjust the focus of an image output by a display (160). The plurality of lenses may be positioned at a location corresponding to the position of the display (160). The plurality of lenses may include, for example, a Fresnel lens, a pancake lens, a multichannel lens, and / or any other suitable lens.

[0101] In one embodiment, the display (160) may be positioned at a location corresponding to the lens structure (210, 220).

[0102] In one embodiment, the electronic device (100) may include a display (160). The display (160) may be configured to provide an image (e.g., a virtual image) to a user. For example, the display (160) may include a liquid crystal display (LCD), a digital mirror device (DMD), a liquid crystal on silicon (LCoS), an organic light emitting diode (OLED), and / or a micro light emitting diode (micro LED).

[0103] In one embodiment, when the display (160) includes at least one of a liquid crystal display device, a digital mirror display device, or a silicon liquid crystal display device, the electronic device (100) may include a light source that irradiates light to a screen output area of the display (160).

[0104] In one embodiment, if the display (160) can generate light on its own, for example, if the display (160) includes at least one of an organic light-emitting diode or a micro LED, the electronic device (100) can provide a good quality virtual image to the user without including a separate light source.

[0105] In one embodiment, if the display (160) includes an organic light-emitting diode or a micro LED, a light source is unnecessary, so the electronic device (100) can be made lighter. The electronic device (100) may include a display (160) and at least one transparent member. A user may use the electronic device (100) while wearing it on their face. At least one transparent member may be formed of a glass plate, a plastic plate, or a polymer and may be made transparent or translucent.

[0106] In one embodiment, at least one transparent member may be positioned facing the user's right or left eye.

[0107] In one embodiment, when the display (160) is transparent, it can be positioned to face the user's eyes to form a screen display.

[0108] In one embodiment, the display (160) may include a light source (not shown) configured to transmit a light signal to an area where an image is output.

[0109] In one embodiment, the display (160) can provide an image to the user by generating an optical signal itself.

[0110] In one embodiment, the display (160) may be positioned on a second surface (211b) of the housing (201). For example, one surface of a pair of lenses of the display (160) may be positioned so as to be exposed to the outside through the second surface (211b).

[0111] In one embodiment, the display (160) may be composed of organic light emitting diodes (OLEDs). For example, the OLED can express red (R), green (G), and blue (B) through the self-luminescence of the organic material. However, it is not limited thereto, and a single pixel may include R, G, and B, and a single chip may be implemented with multiple pixels including R, G, and B.

[0112] In one embodiment, the display (160) can display various images. Here, the image is a concept that includes still images and video, and the display (160) can display various images such as broadcast content, multimedia content, etc. Additionally, the display (160) may display a user interface (UI) and icons.

[0113] In one embodiment, the display (160) includes a separate IC chip, and the IC chip can display an image based on an image signal received from the processor (120). In one embodiment, the IC chip can display an image by generating a driving signal for a plurality of light-emitting elements based on an image signal received from the processor (120) and controlling the light emission of a plurality of pixels included in the display panel based on the driving signal.

[0114] In one embodiment, the display (160) may include a plurality of pixels for displaying a virtual image. The display (160) may further include infrared pixels that emit infrared light.

[0115] In one embodiment, the display (160) may further include a light-receiving pixel (e.g., a photo sensor pixel) disposed between pixels, which receives light reflected from the user's eye, converts it into electrical energy, and outputs it. The light-receiving pixel may be referred to as an 'eye-tracking sensor'. The eye-tracking sensor can detect infrared light reflected by the user's eye, which is light emitted by an infrared pixel included in the display (160).

[0116] In one embodiment, the electronic device (100) can detect the direction of the user's gaze (e.g., eye movement) through light-receiving pixels.

[0117] In one embodiment, the electronic device (100) may determine the location of the center of the virtual image according to the gaze direction of the user's left and right eyes detected through one or more light-receiving pixels (e.g., the direction in which the pupils of the user's left and right eyes gaze).

[0118] In one embodiment, the electronic device (100) may include at least one display (160).

[0119] In one embodiment, the display (160) may include a light-collecting lens and / or a transparent waveguide. For example, the transparent waveguide may be located at least partially in a part of the glass.

[0120] In one embodiment, light emitted from the display (160) may be received at one end of the glass, and the received light may be transmitted to the user through a waveguide and / or waveguide (e.g., waveguide) formed within the glass. The waveguide may be made of glass, plastic, or polymer and may include a nano-pattern formed on one surface, for example, a polygonal or curved grating structure.

[0121] In one embodiment, the incoming light can be propagated or reflected inside the waveguide by the nano pattern and provided to the user.

[0122] In one embodiment, the waveguide may include at least one diffractive element (e.g., DOE (diffractive optical element), HOE (holographic optical element)) or a reflective element (e.g., a reflective mirror).

[0123] In one embodiment, the waveguide can guide display light emitted from the light source to the user's eye using at least one diffraction element or reflection element.

[0124] In one embodiment, the waveguide serves to transmit a light source generated by the display to the user's eye.

[0125] In one embodiment, the waveguide may be made of glass, plastic, or polymer and may include a nano pattern formed on some internal or external surface, for example, a polygonal or curved grating structure.

[0126] In one embodiment, light incident on one end of a waveguide can be propagated within the optical waveguide of a display (140) by a nano-pattern and provided to a user. Additionally, an optical waveguide composed of a free-form prism can provide the incident light to a user through a reflective mirror. The optical waveguide may include at least one diffractive element (e.g., DOE (Diffractive Optical Element), HOE (Holographic Optical Element)) or a reflective element (e.g., a reflective mirror). The optical waveguide can guide display light emitted from a light source to the user's eyes using at least one diffractive element or reflective element included in the optical waveguide.

[0127] In one embodiment, the diffraction element may include an input optical member / an output optical member (not shown). For example, the input optical member may refer to an input grating area, and the output optical member (not shown) may refer to an output grating area. The input grating area may serve as an input terminal that diffracts (or reflects) light output from (e.g., a Micro LED) to transmit light to a transparent member of a screen display (e.g., a first transparent member, a second transparent member). The output grating area may serve as an output terminal that diffracts (or reflects) light transmitted to a transparent member of a waveguide (e.g., a first transparent member, a second transparent member) to the user's eye.

[0128] In one embodiment, the reflection element may include a total internal reflection optical element or a total internal reflection waveguide for total internal reflection (TIR). For example, total internal reflection is a method of inducing light, which may mean creating an angle of incidence such that light (e.g., a virtual image) input through an input grating area is 100% reflected from one side (e.g., a specific side) of the waveguide and is transmitted 100% to an output grating area.

[0129] In one embodiment, light emitted from the display (160) may be guided along a light path to a waveguide through an input optical member. Light traveling within the waveguide may be guided toward the user's eye through an output optical member. The screen display may be determined based on the light emitted toward the eye.

[0130] In one embodiment, the electronic device (100) may include a sensor (176). The sensor (176) may be configured to detect the depth of a subject. The sensor (176) may be configured to transmit a signal toward the subject and / or receive a signal from the subject. For example, the transmitted signal may include near-infrared, ultrasonic, and / or laser. The sensor (176) may be configured to measure the time of flight (ToF) of the signal to measure the distance between the electronic device (100) and the subject. The sensor (176) may be placed on a first surface (211a) of the housing (201).

[0131] In one embodiment, the sensor (176) may include a depth sensor. The depth sensor may be used to determine the distance to an object. The depth sensor (e.g., the depth sensor (235) of FIG. 2) may include Time of Flight (ToF) technology. ToF technology may include a technology that measures the distance to an object using a signal (near-infrared, ultrasonic, laser, etc.). ToF technology may emit a signal from a transmitter and measure the signal from a receiver, and may measure the flight time of the signal.

[0132] In one embodiment, the camera (180) of FIG. 1 may include a plurality of cameras (230a, 230b, 230c).

[0133] In one embodiment, the plurality of cameras (230a, 230b, 230c) may include at least some of the first camera (230a), the second camera (230b), or the third camera (230c). The plurality of cameras (230a, 230b, 230c) may photograph the outside of the housing (201), for example, a user and / or other subjects. For example, the plurality of cameras (230a, 230b, 230c) may convert optical signals into input data and provide them to the processor (120). In one embodiment, the processor (120) may receive the input data and transmit output data to the display (160). The processor (120) may combine the data received from each of the plurality of cameras (230a, 230b, 230c), process the combined data, and control the display (160).

[0134] In one embodiment, a first camera (230a) including at least one camera for shooting and a second camera (230b) including at least one camera for recognition are spaced apart from a first surface (211a) of the housing (201) so as to be able to photograph the direction in which the first surface (211a) of the housing (201) faces.

[0135] In one embodiment, the camera (180) of FIG. 1 may include at least some of the first camera (230a), the second camera (230b), or the third camera (230c).

[0136] In one embodiment, the first camera (230a) and the second camera (230b) may be spaced apart from each other on the first surface (211a) of the housing (201). The first camera (230a) and the second camera (230b) may be positioned to face different directions to capture various directions, such as the first surface (211a) or the third surface (211c).

[0137] In one embodiment, the first camera (230a) may be configured to acquire an image from a subject. The first camera (230a) may be formed in plurality, and one of the first cameras (230a) may be placed in a portion of the first surface (211a) of the housing (201), and another first camera (230a) may be placed in a different portion of the first surface (211a) of the housing (201) and a different portion of the housing (201).

[0138] In one embodiment, a plurality of first cameras (230a) may be positioned on each side of the depth sensor (235). The plurality of first cameras (230a) may include an image stabilizer actuator (not shown) and / or an autofocus actuator (not shown). For example, the plurality of first cameras (230a) may include at least one camera configured to acquire a color image, a global shutter camera, or a rolling shutter camera, or a combination thereof.

[0139] In one embodiment, the second camera (230b) may be configured to recognize a subject. The second camera (230b) may be formed in plurality, and the plurality of second cameras (230b) may be configured to detect and / or track objects (e.g., a human head or hand) or spaces with 3 degrees of freedom or 6 degrees of freedom. For example, the plurality of second cameras (230b) may include a global shutter camera. The plurality of second cameras (1530b) may be configured to perform simultaneous localization and mapping (SLAM) using depth information of the subject. The plurality of second cameras (230b) may be configured to recognize gestures of the subject.

[0140] In one embodiment, a plurality of second cameras (230b) may be placed on the first surface (211a) of the housing (201).

[0141] In one embodiment, the first camera (230a) and the second camera (230b) may be cameras for capturing images, may be referred to as HR (high resolution) or PV (photo video), and may include high-resolution cameras. The first camera (230a) and the second camera (230b) may include color cameras equipped with functions for obtaining high-quality images, such as AF (auto focus) and optical image stabilizer (OIS). Not limited thereto, the first camera (230a) and the second camera (230b) may include a global shutter (GS) camera or a rolling shutter (RS) camera.

[0142] In one embodiment, the electronic device (100) may include a plurality of third cameras (230c). The plurality of third cameras (230c) may be configured to recognize a user's face. For example, the plurality of third cameras (230c) may be configured to detect and track a user's facial expression.

[0143] In one embodiment, the third camera (230c) may include at least one facial recognition camera or at least one eye tracking camera.

[0144] In one embodiment, the electronic device (100) may further include an eye-tracking camera in at least some of the plurality of third cameras (230c). The eye-tracking camera may be used to detect and track the pupil.

[0145] In one embodiment, the third camera (230c) can detect and track the pupil. The third camera (230c) may include a plurality of cameras corresponding to the left eye and the right eye.

[0146] In one embodiment, at least one of the plurality of cameras (230a, 230b, 230c) may include a camera used for 3 degrees of freedom (DoF), 6 degrees of freedom (DoF) head tracking, hand detection and tracking, gesture and / or spatial recognition.

[0147] In one embodiment, at least one of the plurality of cameras (230a, 230b, 230c) may include a global shutter (GS) camera to detect and track the movement of the head and hand. For example, two global shutter (GS) cameras of the same specifications and performance may be used for head tracking and spatial recognition, and a rolling shutter (RS) camera may be used to detect and track fine movements such as fast hand movements and fingers.

[0148] In one embodiment, at least one of the plurality of cameras (230a, 230b, 230c) may primarily be a global shutter (GS) camera with superior performance relative to the camera (e.g., image drag), but is not necessarily limited thereto, and, for example, a rolling shutter (RS) camera may be used. At least one of the plurality of cameras (230a, 230b, 230c) may perform spatial recognition for 6 degrees of freedom (DoF) and simultaneous localization and mapping (SLAM) functions through depth capture. At least one of the plurality of cameras (230a, 230b, 230c) may also perform user gesture recognition functions.

[0149] In one embodiment, the electronic device (100) may include an inertial measurement unit (IMU) sensor. The IMU sensor may include at least one of an accelerometer, a gyroscope, or a magnetometer. The electronic device (100) may detect the movement of a user based on the IMU sensor.

[0150] In one embodiment, although not shown in the drawings, the electronic device (100) may include at least some of a sensor (not shown), a lighting unit (not shown), a plurality of microphones (not shown), a plurality of speakers (not shown), a battery (not shown), and a printed circuit board (not shown).

[0151] In one embodiment, the sensor (not shown) may exist as one or more for various purposes (e.g., gyroscope sensor, accelerometer, geomagnetic sensor, and / or gesture sensor), and, for example, the sensor (not shown) may perform at least one of head tracking for 6 degrees of freedom (DoF), pose estimation and prediction, gesture and / or spatial recognition, and / or slam function through depth capture.

[0152] In one embodiment, the lighting unit (not shown) may have various uses depending on the location where it is attached. For example, the lighting unit (not shown) may be attached around the second side (211b) of the electronic device (100). The lighting unit (not shown) may be used as an auxiliary means to facilitate eye gaze detection when the eye tracking camera (not shown) photographs the pupil. The lighting unit (not shown) may use an IR LED (infra-red light emitting device) of visible light wavelength or infrared wavelength.

[0153] For example, a lighting unit (not shown) may be attached to the front (211a) of the head-mounted display device (1200) or around it. The lighting unit (not shown) may be used as a means to supplement ambient brightness when multiple front cameras (230a, 230b) are shooting. The lighting unit (not shown) may be used when it is difficult to detect the subject to be shot, especially in a dark environment or due to the mixing of multiple light sources and reflected light.

[0154] In one embodiment, a lighting unit (not shown) may be omitted. The lighting unit (not shown) may be replaced by an infrared pixel included in the display (140). The lighting unit (not shown) may be included in the electronic device (100) to assist the infrared pixel included in the display (160).

[0155] In one embodiment, a plurality of microphones (not shown) can process external acoustic signals into electrical voice data. The processed voice data can be utilized in various ways depending on the function (or application running) being performed on the electronic device (100).

[0156] In one embodiment, a plurality of speakers (not shown) can output audio data received from a communication circuit or stored in a memory (120).

[0157] In one embodiment, one or more batteries (not shown) may be included in the electronic device (100) and may supply power to the components constituting the electronic device (100).

[0158] In one embodiment, a printed circuit board (not shown) can transmit electrical signals to each circuit (e.g., camera, display, audio, or sensor) and other printed circuit boards through a flexible printed circuit board (FPCB).

[0159] In one embodiment, a control circuit (not shown) that controls a component constituting an electronic device (100) may be located on a printed circuit board (not shown).

[0160] FIG. 3 is a drawing showing an electronic device (100) according to one embodiment of the present invention.

[0161] The electronic device (100) of FIGS. 2a and 2b may include an immersive head-mounted display device, and the electronic device (100) of FIG. 3 may include a glass-type head-mounted display device.

[0162] The electronic device (100) of FIG. 3 can provide information to the user through a display (314-1, 314-2) while seeing through the external environment through a glass (e.g., a first glass (320) and a second glass (330)).

[0163] In one embodiment, the electronic device (100) may include a frame (223) comprising a display (314-1, 314-2) (e.g., the display (140) of FIG. 1), glass (320, 330), a camera (e.g., a camera for taking pictures (313), a camera for eye tracking (312), a camera for recognition (311-1, 311-2)), a microphone (341-1, 341-2) and / or an ambient light sensor (342-1, 342-2), a printed circuit board (331-1, 331-2), an audio circuit (332-1, 332-2), and / or a battery (333-1, 333-2), and a temple (e.g., a first temple (321), and / or a second temple (322)) operatively connected to the frame through a hinge portion (340-1, 340-2).

[0164] In one embodiment, the display (314-1, 314-2) may provide visual information to the user through glass (e.g., a first glass (320) and a second glass (330)). The electronic device (100) may include a first glass (320) corresponding to the left eye and / or a second glass (330) corresponding to the right eye.

[0165] In one embodiment, the glass (e.g., the first glass (320) and the second glass (330)) may include a display (314-1, 314-2).

[0166] In one embodiment, the display (314-1, 314-2) may include a display panel and / or a lens. For example, the display panel may include a transparent material such as glass or plastic.

[0167] In one embodiment, the display (314-1, 314-2) may include, for example, a liquid crystal display (LCD), a digital mirror device (DMD), a liquid crystal on silicon (LCoS), an organic light emitting diode (OLED), or a micro light emitting diode (micro LED).

[0168] Although not illustrated, in one embodiment, when the display (314-1, 314-2) is made of one of a liquid crystal display, a digital mirror display, or a silicon liquid crystal display, the electronic device (100) may include a light source that irradiates light onto a screen output area of the display (314-1, 314-2).

[0169] In one embodiment, if the display (314-1, 314-2) can generate light on its own, for example, one of an organic light-emitting diode or a micro LED, the electronic device (100) can provide a good quality virtual image to the user without including a separate light source.

[0170] In one embodiment, if the display (314-1, 314-2) is implemented as an organic light-emitting diode or a micro LED, a light source is unnecessary, so the electronic device (100) can be made lighter.

[0171] In one embodiment, the electronic device (100) may include a display (314-1, 314-2) and a glass (e.g., a first glass (320) and a second glass (330)), and the user may use the electronic device while wearing it on their face.

[0172] In one embodiment, the glass (e.g., the first glass (320) and the second glass (330)) may be formed from a glass plate, a plastic plate, or a polymer, and may be made transparent or translucent.

[0173] In one embodiment, the display (314-1, 314-2) may include a condensing lens and / or a transparent waveguide located in a portion of the glass (e.g., the first glass (320) and the second glass (330)). For example, the transparent waveguide may be located at least partially in a portion of the glass.

[0174] In one embodiment, light emitted from a display (314-1, 314-2) may be received at one end of the glass through the first glass (320) and the second glass (330), and the received light may be transmitted to a user through a waveguide and / or waveguide (e.g., waveguide) formed within the glass. The waveguide may be made of glass, plastic, or polymer and may include a nano-pattern formed on one surface, for example, a polygonal or curved grating structure.

[0175] In one embodiment, the incoming light can be propagated or reflected inside the waveguide by the nano pattern and provided to the user.

[0176] In one embodiment, the waveguide may include at least one diffractive element (e.g., DOE (diffractive optical element), HOE (holographic optical element)) or a reflective element (e.g., a reflective mirror).

[0177] In one embodiment, the waveguide can guide display light emitted from the light source to the user's eye using at least one diffraction element or reflection element.

[0178] According to one embodiment, a virtual object output through a display (314-1, 314-2) may include information related to an application program running on an electronic device (100) and / or information related to an external object located in real space corresponding to an area determined to be the user's field of view (FoV). For example, the electronic device (100) may identify an external object included in at least a portion of the image information related to real space acquired through the camera of the electronic device (100) (e.g., a shooting camera (313) or a depth camera) that corresponds to an area determined to be the user's field of view (FoV). The electronic device (100) may output (or display) a virtual object related to the external object identified in at least a portion through an area determined to be the user's field of view among the display areas of the electronic device (100). The external object may include an object existing in real space.

[0179] In one embodiment, the display area where the electronic device (100) displays a virtual object may include a part of the display (314-1, 314-2) (e.g., at least a part of the display panel).

[0180] In one embodiment, the display area may be located on a part of the first glass (320) and / or the second glass (330).

[0181] In one embodiment, the electronic device (100) is worn on the user's head and can provide the user with images related to an augmented reality service.

[0182] In one embodiment, the electronic device (100) may provide an augmented reality service that outputs at least one virtual object overlaid in an area determined to be the user's field of view (FoV). For example, the area determined to be the user's field of view is an area determined to be perceptible through the electronic device (100) by a user wearing the electronic device (100), and may include all or at least part of the display (314-1, 314-2) of the electronic device (100).

[0183] In one embodiment, the electronic device (100) may include a plurality of glasses (e.g., a first glass (320) and / or a second glass (330)) corresponding to each of the user's two eyes (e.g., a left eye and / or a right eye). The plurality of glasses may include at least a portion of a display (314-1, 314-2). For example, the first glass (320) corresponding to the user's left eye may include a first display (314-1), and the second glass (330) corresponding to the user's right eye may include a second display module (314-2). For example, the electronic device (100) may be configured in the form of at least one of glasses, goggles, a helmet, or a hat, but is not limited thereto.

[0184] In one embodiment, if the display (314-1, 314-2) is a transparent uLED, the waveguide configuration within the glass (e.g., the first glass (320) and the second glass (330)) may be omitted. According to another embodiment, the display (314-1, 314-2) may be composed of a transparent element, and a user may perceive the actual space behind the display (314-1, 314-2) by passing through the display (314-1, 314-2). The display (314-1, 314-2) may display a virtual object in at least a portion of the transparent element so that the user appears to have the virtual object superimposed on at least a portion of the actual space.

[0185] In one embodiment, the electronic device (100) may include a VR (virtual reality) device (e.g., a virtual reality device). If the electronic device (100) is a VR device, the first glass (320) may be a first display (314-1) and the second glass (330) may be a second display module (314-2).

[0186] According to one embodiment, the electronic device (100) can operate the first display panel included in the first glass (320) and the second display panel included in the second glass (330) as independent components, respectively. For example, the electronic device (100) can determine the display performance of the first display panel based on first setting information and determine the display performance of the second display panel based on second setting information.

[0187] According to one embodiment, at least one camera may include a camera (313) for capturing an image corresponding to the user's field of view (FoV) and / or measuring the distance to an object, an eye tracking camera (312) for checking the direction of the user's gaze, and / or a gesture camera (311-1, 311-2) for recognizing a certain space.

[0188] In one embodiment, the electronic device (100) may include a camera (313) (e.g., RGB camera) for capturing an image corresponding to the user's field of view (FoV) and / or measuring the distance to an object, an eye tracking camera (312) for checking the direction of the user's gaze, and / or a recognition camera (311-1, 311-2) (e.g., gesture camera) for recognizing a certain space.

[0189] In one embodiment, the electronic device (100) can measure the distance to an object located in the user's front direction (e.g., direction A) using a camera (313).

[0190] In one embodiment, the electronic device (100) may have a plurality of eye-tracking cameras (312) positioned corresponding to the user's eyes. The eye-tracking cameras (312) can detect the user's gaze direction (e.g., eye movement). For example, the eye-tracking cameras (312) may include a first eye-tracking camera (212-1) for tracking the gaze direction of the user's left eye and a second eye-tracking camera (212-2) for tracking the gaze direction of the user's right eye.

[0191] In one embodiment, the electronic device (100) can detect a user gesture within a preset distance (e.g., a certain space) using a recognition camera (311-1, 311-2). For example, the recognition camera (311-1, 311-2) may be composed of multiple units and may be positioned on both sides of the electronic device (100). The electronic device (100) can detect the eyes corresponding to the dominant eye and the auxiliary eye using at least one camera. For example, the eyes corresponding to the dominant eye and the auxiliary eye may be detected based on the direction of the user's gaze toward an external object or a virtual object.

[0192] In one embodiment, the camera (313) for shooting may include a high-resolution camera such as an HR (high resolution) camera and a PV (photo video) camera. According to one embodiment, the eye tracking camera (312) can detect the user's pupils to track the direction of gaze, and can be utilized so that the center of the virtual image moves in correspondence with the direction of gaze. For example, the eye tracking camera (312) may be divided into a first eye tracking camera (312-1) corresponding to the left eye and a second eye tracking camera (312-2) corresponding to the right eye, and the performance and specifications of the cameras may be substantially the same.

[0193] In one embodiment, the recognition camera (311-1, 311-2) may be used for detecting a user's hand (gesture) and spatial recognition, and may include a GS (global shutter) camera. For example, the recognition camera (311-1, 311-2) may include a GS camera with low screen drag, such as a RS (rolling shutter) camera, to detect and track fast hand movements and fine movements such as fingers.

[0194] In one embodiment, the electronic device (100) can display virtual objects related to an augmented reality service together based on image information related to a real space acquired through the camera of the electronic device (100).

[0195] In one embodiment, the electronic device (100) can display virtual objects based on displays (314-1, 314-2) positioned corresponding to the user's eyes.

[0196] In one embodiment, the electronic device (100) can display a virtual object based on preset setting information (e.g., resolution, frame rate, brightness, and / or display area).

[0197] The number and location of at least one camera (e.g., RGB camera (313), eye tracking camera (312) and / or gesture camera (311-1, 311-2)) included in the electronic device (100) illustrated in FIG. 3 may not be limited. For example, the number and location of at least one camera (e.g., RGB camera (313), eye tracking camera (312) and / or gesture camera (311-1, 311-2)) may vary depending on the form (e.g., shape or size) of the electronic device (100).

[0198] In one embodiment, the electronic device (100) may include a microphone (341-1, 341-2) for receiving the user's voice and ambient sound. For example, the microphone (341-1, 341-2) may include an audio circuit. The electronic device (100) may include an illuminance sensor (342-1, 342-2) for checking ambient brightness. For example, the illuminance sensor (342-1, 342-2) may be included in the sensor (170) of FIG. 1.

[0199] In one embodiment, the first temple (321) and / or the second temple (322) may include a printed circuit board (331-1, 331-2) for transmitting an electrical signal to each component of the electronic device (100), a speaker (332-1, 332-2) for outputting an audio signal, a battery (333-1, 333-2), and / or a hinge portion (340-1, 340-2) for at least partially connecting to the frame (323, frame) of the electronic device (100). According to one embodiment, the speaker (332-1, 332-2) may include a first speaker (332-1) for transmitting an audio signal to the user's left ear and a second speaker (332-2) for transmitting an audio signal to the user's right ear. The speaker (232-1, 232-2) may include an audio circuit.

[0200] In one embodiment, the electronic device (100) may be equipped with a plurality of batteries (333-1, 333-2) and may supply power to printed circuit boards (331-1, 331-2) through a power management module.

[0201] In one embodiment, the first temple (321) and / or the second temple (322) may include a printed circuit board (PCB) (331-1, 331-2), a speaker (332-1, 332-2), and / or a battery (333-1, 333-2). In one embodiment, the first temple (321) and / or the second temple (322) are support members of the electronic device (100), and the first temple (321) and / or the second temple (322) may support a frame (323) to allow the electronic device (100) to be seated on the user's body when worn.

[0202] FIG. 4 is a drawing showing a display (160), an eye-tracking camera (410), and a waveguide (430) according to one embodiment of the present disclosure.

[0203] In one embodiment, the display (160) may include a waveguide (430). Light emitted from the display (160) may be transmitted to the user through the waveguide (430).

[0204] In one embodiment, the waveguide (430) may be made of glass, plastic, or polymer and may include a nano pattern formed on an inner or outer surface, for example, a polygonal or curved grating structure. The waveguide (430) may include a waveguide.

[0205] In one embodiment, light emitted from a display (160) is transmitted to a waveguide (430) through an input optical structure (451), and light input to the waveguide (430) can be propagated or reflected within the waveguide (430) and provided to a user through an output optical structure (452).

[0206] In one embodiment, the waveguide (430) may include at least one diffractive element (e.g., DOE (diffractive optical element), HOE (holographic optical element)) or reflective element (e.g., reflective mirror).

[0207] In one embodiment, the waveguide (430) can guide light from the light source unit of the display (160) to the user's eye using at least one diffraction element or reflection element.

[0208] In one embodiment, the waveguide (430) serves to transmit a light source generated by the display (160) to the user's eye.

[0209] In one embodiment, the waveguide (430) may be made of glass, plastic, or polymer and may include a nano pattern formed on some of the inner or outer surfaces, for example, a polygonal or curved grating structure.

[0210] In one embodiment, light incident on one end of the waveguide (430) can be propagated within the waveguide by a nano-pattern and provided to the user. Additionally, the waveguide composed of a free-form prism can provide the incident light to the user through a reflective mirror (e.g., output optical structure (452)).

[0211] In one embodiment, the diffraction element may include an input optical structure (451) and an output optical structure (452). For example, the output optical structure (452) may include an input grating area. The output optical structure (452) may include an output grating area.

[0212] In one embodiment, the input grating area may serve as an input terminal that diffracts (or reflects) light output from (e.g., Micro LED) to transmit light to a transparent member of the screen display.

[0213] In one embodiment, the output grating area may serve as an outlet that diffracts (or reflects) light transmitted to the transparent member of the waveguide (430) into the user's eye.

[0214] In one embodiment, the reflection element may include a total internal reflection optical element or a total internal reflection waveguide for total internal reflection (TIR). For example, total internal reflection is a method of inducing light, which may mean creating an angle of incidence such that light (e.g., a virtual image) input through an input grating area is reflected 100% from one side (e.g., a specific side) of the waveguide (430) and transmitted 100% to an output grating area.

[0215] In one embodiment, light emitted from the display (160) can be guided along a light path to a waveguide (430) through an input optical member. Light traveling inside the waveguide (430) can be guided toward the user's eye through an output optical structure (452).

[0216] In one embodiment, the eye-tracking camera (410) may include the third camera (230c) of FIG. 3.

[0217] In one embodiment, the eye tracking camera (410) may be used to detect and track the pupil. The eye tracking camera (410) may include an eye tracking sensor (e.g., an infrared sensor) (411). The eye tracking camera (410) can detect the user's pupil and track rapid pupil movements through the eye tracking sensor (e.g., an infrared sensor) (411). When light reflecting the user's eye is transmitted through an input structure (460) via a waveguide (440) for the eye tracking camera, the light reflecting the user's eye can be transmitted to the eye tracking camera (410) via the waveguide (440) for the eye tracking camera.

[0218] FIG. 5 is a flowchart illustrating a method for controlling an audio-based spatial design of an electronic device (100) according to one embodiment of the present disclosure.

[0219] In one embodiment, in operation 501, instructions stored in memory (130) can cause the electronic device (100) to obtain voice input regarding space when executed individually or collectively by at least one processor (120). For example, the voice input may include text or voice regarding space design.

[0220] In one embodiment, in operation 501, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to receive voice input regarding spatial design through microphones (341-1, 341-2).

[0221] In one embodiment, in operation 501, instructions stored in memory (130) can enable the electronic device (100) to obtain character input regarding space when executed individually or collectively by at least one processor (120).

[0222] In one embodiment, in operation 501, instructions stored in memory (130) can enable the electronic device (100) to acquire gesture input regarding space when executed individually or collectively by at least one processor (120).

[0223] In one embodiment, in operation 503, instructions stored in memory (130) can cause the electronic device (100) to acquire an image of the external environment based on a camera (e.g., a camera for taking pictures (313) or a depth camera) when executed individually or collectively by at least one processor (120).

[0224] In one embodiment, the external environment may include a real world corresponding to the field of view of a user wearing the electronic device (100).

[0225] In one embodiment, the external environment may include a space for the real world.

[0226] In one embodiment, in operation 503, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120) to cause the electronic device (100) to turn on a camera (e.g., a camera for shooting (313) or a depth camera) to acquire an image of the external environment when it receives voice input regarding the spatial design.

[0227] In one embodiment, in operation 503, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), so that when the electronic device (100) receives at least one of voice input, text input, or gesture input for a spatial design, it can turn on a camera (e.g., a camera for shooting (313) or a depth camera) to acquire an image of the external environment.

[0228] In one embodiment, the electronic device (100) may further include a depth camera capable of acquiring spatial depth information. The depth camera may acquire depth information based on TOF technology, such as a depth sensor (e.g., the depth sensor (235) of FIG. 2). The electronic device (100) may acquire depth information based on stereo vision technology of a camera (313). The camera (180) may include at least one of the camera (313) or the depth camera.

[0229] In one embodiment, in operation 505, instructions stored in memory (130) can cause the electronic device (100) to obtain a first feature from an image when executed individually or collectively by at least one processor (120).

[0230] In one embodiment, the first feature may include an image feature. The first feature may include a vector value or a feature vector (or image feature vector) for the first feature (or image feature).

[0231] In one embodiment, instructions stored in memory (130) can cause an electronic device (100) to obtain a first feature from an image based on an encoder, as in Equation 1, when executed individually or collectively by at least one processor (120).

[0232]

[0233] In one embodiment, R D can represent a D-dimensional space composed of real numbers. x img can represent a set of image feature vectors. img t can represent an image acquired at a specific time point or at time t. Enc is an encoder that can extract image features and / or image feature vectors from an image acquired at time t.

[0234] For example, the encoder may be included in the text-visual multimodal generation model (701) or text-visual fusion module (710) of FIG. 7.

[0235] For example, the encoder may include a CNN (convolutional neural network).

[0236] In one embodiment, the electronic device (100) may store a computer program, such as a convolutional neural network (CNN), in memory (130).

[0237] In one embodiment, the image features and / or image feature vectors from the image acquired at time t may include at least one of coordinate information for space, depth information for space, coordinate information for an object included in space, depth information for an object included in space, lighting, or texture information.

[0238] In one embodiment, in operation 507, instructions stored in memory (130) can cause the electronic device (100) to convert voice input into text when executed individually or collectively by at least one processor (120).

[0239] In one embodiment, in operation 507, instructions stored in memory (130) can cause the electronic device (100) to convert at least one of voice input, character input, or gesture input into text when executed individually or collectively by at least one processor (120).

[0240] In one embodiment, the electronic device (100) may store a computer program or application for automatic speech recognition in memory (130).

[0241] In one embodiment, in operation 509, instructions stored in memory (130) can cause the electronic device (100) to obtain a second feature from text when executed individually or collectively by at least one processor (120).

[0242] In one embodiment, the second feature may include a text feature. The second feature may include a vector value or a feature vector (or text feature vector) for the second feature (or text feature).

[0243] In one embodiment, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to obtain a second feature from text based on an encoder, as in Equation 2.

[0244]

[0245] In one embodiment, R D can represent a D-dimensional space composed of real numbers. x txt can represent a set of text feature vectors. s t can represent a voice input acquired at a specific time point or at time t. Enc is an encoder that can extract text features and / or text feature vectors from the voice input or text acquired at time t.

[0246] For example, the encoder may be included in the text-visual multimodal generation model (701) or text-visual fusion module (710) of FIG. 7.

[0247] For example, the encoder may include a CNN (convolutional neural network).

[0248] In one embodiment, text features and / or text feature vectors obtained from text at time t may include at least one of coordinate information about space, depth information about space, coordinate information about an object included in space, depth information about an object included in space, lighting, texture information, movement information about an object, or information about an object.

[0249] In one embodiment, in the 511 operation, instructions stored in memory (130) can cause the electronic device (100) to identify a third feature in which an image and text are fused when executed individually or collectively by at least one processor (120).

[0250] For example, the third feature may include a feature in which image features and text features are fused. The third feature may include a fused feature. The third feature may include a vector value or a feature vector (or fused feature vector) for the third feature (or fused feature).

[0251] In one embodiment, in the 511 operation, instructions stored in memory (130) can cause the electronic device (100) to acquire a third feature in which an image and text are fused when executed individually or collectively by at least one processor (120).

[0252] In one embodiment, instructions stored in memory (130) can cause the electronic device (100) to acquire a third feature in which an image and text are fused, as in Equation 3, when executed individually or collectively by at least one processor (120).

[0253]

[0254] In one embodiment, a set of fused feature vectors (x t ) is a set of text feature vectors (x txt) and set of image feature vectors(x img Direct sum of ) It can be represented as a value of linear deformation (W) of ).

[0255] In one embodiment, when instructions stored in memory (130) are executed individually or collectively by at least one processor (120), the electronic device (100) causes a set of text feature vectors (x txt ) and set of image feature vectors(x img Concatenate ) and linearly transform to obtain the set of fused feature vectors (x t You can obtain ).

[0256] For example, direct sum( ) includes a concatenation operation, and W is R D×2D It may include linear projection operators belonging to .

[0257] In one embodiment, in operation 513, instructions stored in memory (130) can cause the electronic device (100) to apply a history state when executed individually or collectively by at least one processor (120).

[0258] In one embodiment, in operation 513, instructions stored in memory (130) can cause the electronic device (100) to update the history state when executed individually or collectively by at least one processor (120).

[0259] In one embodiment, in operation 513, the instructions stored in memory (130), when executed individually or collectively by at least one processor (120), cause the electronic device (100) to obtain a set of fused feature vectors (x) based on a previous voice input and a previous fused feature vector based on a previous image.t ) can be combined.

[0260] In one embodiment, in operation 513, instructions stored in memory (130) can cause the electronic device (100) to continuously aggregate records of user utterances and update the history when executed individually or collectively by at least one processor (120).

[0261] In one embodiment, in operation 513, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to update the history of combining a set of fused feature vectors (x_t) with a previous fused feature vector based on a GRU (gated recurrent unit).

[0262] In one embodiment, in operation 513, instructions stored in memory (130), when executed individually or collectively by at least one processor (120), cause the electronic device (100) to [require] a set of fused feature vectors (x) to be combined with a previous fused feature vector based on a GRU (gated recurrent unit). t You can combine ) to apply it to the history state.

[0263] In one embodiment, when instructions stored in memory (130) are executed individually or collectively by at least one processor (120), the electronic device (100) causes the previous fused feature vector to have a set of fused feature vectors (x) as in Equations 4 and 5. t You can combine ) to apply it to the history state.

[0264]

[0265]

[0266] In one embodiment, the current history (h t) or history status(hs t ) is previous history(h t-1 A set of fused feature vectors (x) in ) t ) learnable weights (W s It can be determined according to ). For example, a GRU (gated recurrent unit) may include a computer program for learning history.

[0267] In Equations 4 and 5, t can represent time. The GRU may include software capable of accumulating and storing inputs over time. The GRU can use a recurrent input method, similar to the input method of a Recurrent Neural Network (RNN). A set of fused feature vectors (x t In ), when a set of fused feature vectors is input into the GRU at time t, the current history (h at that time) t ) and output vector(g t Can output ). Current history(h t ) may include state information in which the information is compressed at each time interval. Output vector (g t Since ) stores information about the entire frame and feature dimensions, it can be output in the form of (number of frames, dimensions). Current history (h t Since it only has state information about time, it can only have the form of a hidden state dimension (1, dimension).

[0268] In one embodiment, in operation 513, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to sequentially aggregate user feedback information to localize candidate images to search.

[0269] In one embodiment, in operation 515, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to generate candidate images based on a third feature (or fused feature).

[0270] In one embodiment, in operation 515, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to generate candidate images or virtual objects based on a third feature (or fused feature).

[0271] FIG. 6 is a flowchart specifically illustrating the operation (515) of generating candidate images according to one embodiment of the present disclosure.

[0272] Referring to FIGS. 5 and 6, in one embodiment, in operation 531, instructions stored in memory (130) can cause the electronic device (100) to determine whether the name of a specific object is included in the fusion feature when executed individually or collectively by at least one processor (120). For example, the name of a specific object may include at least one of the names of furniture, props, or spaces.

[0273] In one embodiment, if the fusion feature includes the name of a specific object, the instructions stored in memory (130) can cause the electronic device (100) to branch from operation 531 to operation 539 when executed individually or collectively by at least one processor (120).

[0274] In one embodiment, if the fusion feature does not have the name of a specific object, the instructions stored in memory (130) can cause the electronic device (100) to branch from operation 531 to operation 533 when executed individually or collectively by at least one processor (120).

[0275] In one embodiment, in operation 533, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to determine whether depth information is included in the fusion feature.

[0276] In one embodiment, in operation 533, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to determine whether depth is presented in the fusion feature.

[0277] In one embodiment, if the fusion feature includes depth information, the instructions stored in memory (130) can cause the electronic device (100) to branch from operation 533 to operation 539 when executed individually or collectively by at least one processor (120).

[0278] In one embodiment, if there is no depth information in the fusion feature, the instructions stored in memory (130) can cause the electronic device (100) to branch from operation 533 to operation 535 when executed individually or collectively by at least one processor (120).

[0279] In one embodiment, in operation 539, instructions stored in memory (130) can cause the electronic device (100) to check depth information on an image in a depth map when executed individually or collectively by at least one processor (120).

[0280] In one embodiment, in operation 539, when the name of a specific object is included in the fusion feature, the instructions stored in memory (130) can cause the electronic device (100) to check depth information on an image in a depth map for an object corresponding to the name of the specific object when executed individually or collectively by at least one processor (120).

[0281] In one embodiment, when the name of a specific object is included in the fusion feature, instructions stored in memory (130) can cause the electronic device (100) to perform a segment operation on the object when executed individually or collectively by at least one processor (120). Instructions stored in memory (130) can cause the electronic device (100) to identify an object corresponding to the name of a specific object among the segmented objects when executed individually or collectively by at least one processor (120). Instructions stored in memory (130) can cause the electronic device (100) to identify depth information about an object corresponding to the name of a specific object when executed individually or collectively by at least one processor (120). Instructions stored in memory (130) can cause the electronic device (100) to check depth information on an image in a depth map for an object corresponding to the name of a specific object when executed individually or collectively by at least one processor (120).

[0282] In one embodiment, in operation 539, if depth information is included in the fusion feature, the instructions stored in memory (130) can cause the electronic device (100) to check the depth information on the image in the depth map when executed individually or collectively by at least one processor (120).

[0283] In one embodiment, in operation 535, instructions stored in memory (130) can cause the electronic device (100) to generate a prompt based on voice input when executed individually or collectively by at least one processor (120).

[0284] In one embodiment, in operation 535, instructions stored in memory (130) can cause the electronic device (100) to generate a prompt based on text converted from voice input when executed individually or collectively by at least one processor (120).

[0285] In one embodiment, the prompt may include commands or questions entered by a user when interacting with an electronic device (100) or a program. For example, the prompt may include text entered when requesting specific information or instructing an artificial intelligence language model to perform a task.

[0286] In one embodiment, in operation 537, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to generate candidate images based on at least one of prompt or depth information.

[0287] In one embodiment, in operation 537, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to generate candidate images or virtual objects based on at least one of prompt or depth information.

[0288] Referring to FIG. 7, candidate images can be generated based on a candidate generation module (730) stored in memory (130). The candidate generation module (730) may include at least one of a prompt generation module, a controlnet module, or a stable diffusion module.

[0289] In one embodiment, the prompt generation module can generate a prompt provided to the stable diffusion module. The prompt generation module can configure the prompt according to importance based on the current user's request information (e.g., text converted from voice input) and previous caption information.

[0290] In one embodiment, the prompt generation module may represent a sentence as a keyword and move important keywords to the front of the sentence. The prompt generation module may change the connection between comma units and words to a first symbol (e.g., parallel bar symbol (∥)), display a second symbol (e.g., parenthesis symbol) based on the number of repeated important words, and display unimportant keywords to be removed as a third symbol (e.g., square bracket symbol).

[0291] For example, if the current user's request information (e.g., text converted from voice input) is “Remove the bed and add a wood table to make it a clean style” and the previous caption information is “The prepared design has a large window, a wood table, and lighting...”, the prompt generation module can treat “wood table” as a keyword, move it to the front of the sentence and display it with a second symbol as many times as it is repeated, and treat “bed” as a keyword to be removed and display it with a third symbol. The prompt generation module can display it with symbols as “((wood table)) ∥ large window, [remove bed]”.

[0292] In one embodiment, the stable diffusion module can generate candidate images based on a prompt. The ControlNet module can precisely control the image generation process by utilizing additional input data. At least one of the depth, composition, or pose of the candidate images generated by the stable diffusion module can be controlled.

[0293] In one embodiment, in operation 517, instructions stored in memory (130) can cause the electronic device (100) to select a representative image among candidate images when executed individually or collectively by at least one processor (120).

[0294] In one embodiment, in operation 517, instructions stored in memory (130) can cause the electronic device (100) to automatically select a representative image from among candidate images when executed individually or collectively by at least one processor (120).

[0295] In one embodiment, in operation 517, instructions stored in memory (130) can cause the electronic device (100) to select a representative image from among candidate images by user input when executed individually or collectively by at least one processor (120). For example, user input may include at least one of voice input, text input, or gesture input.

[0296] In one embodiment, in operation 517, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to select a representative image based on user preference among candidate images.

[0297] In one embodiment, in operation 517, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to select the representative image based on an image selected or preferred by the user among candidate images.

[0298] In one embodiment, user preferences can be determined based on a K-NN (K-Nearest Neighbors) program stored in memory (130). The K-NN program is a supervised learning algorithm that can be used to classify or predict new data based on similarity between data points. For example, the K-NN program can classify by referring to the labels of K other data points that are close to the data. When new data is input, the K-NN program can select the K data points closest to the new data by comparing the new data with the existing data distribution.

[0299] In one embodiment, in operation 519, instructions stored in memory (130) may cause the electronic device (100) to output an image caption as audio through speakers (332-1, 332-2) when executed individually or collectively by at least one processor (120). For example, the image caption may include a description of a representative image.

[0300] In one embodiment, in operation 519, instructions stored in memory (130) can cause the electronic device (100) to display an image caption on a display (314-1, 314-2) when executed individually or collectively by at least one processor (120).

[0301] In one embodiment, in operation 521, instructions stored in memory (130) can cause the electronic device (100) to output a 3D image to a display (314-1, 314-2) when executed individually or collectively by at least one processor (120).

[0302] In one embodiment, in operation 521, instructions stored in memory (130) can cause the electronic device (100) to output at least one of candidate images or representative images to a display (314-1, 314-2) when executed individually or collectively by at least one processor (120).

[0303] In one embodiment, in operation 521, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120) to enable the electronic device (100) to recognize an object in a representative image and convert the recognized object into a three-dimensional (3D) representation to correspond to an external environment.

[0304] In one embodiment, in operation 521, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120) to enable the electronic device (100) to recognize an object in a representative image based on depth information and convert the recognized object into a three-dimensional (3D) representation to correspond to an external environment.

[0305] FIG. 7 is a block diagram showing a text-visual multimodal generation model (701) stored in an electronic device (100) according to one embodiment of the present disclosure.

[0306] In one embodiment, the text-visual multimodal generation model (701) may include a text-visual fusion module (710), a history tracking module (720), a candidate generation module (730), and a selection module (740).

[0307] In one embodiment, the text-visual multimodal generation model (701), text-visual fusion module (710), history tracking module (720), candidate generation module (730), and selection module (740) are stored in memory (130) as computer programs, and when executed by the processor (120), the electronic device (100) can perform the operations of FIGS. 5 and FIGS. 6.

[0308] In one embodiment, the text-visual multimodal generation model (701), text-visual fusion module (710), history tracking module (720), candidate generation module (730), and selection module (740) may include circuits embedded in the processor (120).

[0309] In one embodiment, the text-visual fusion module (710) may obtain a first feature from an image obtained through a camera (e.g., a camera for shooting (313) or a depth camera).

[0310] In one embodiment, the first feature may include an image feature. The first feature may include a vector value or a feature vector (or image feature vector) for the first feature (or image feature).

[0311] In one embodiment, the text-visual fusion module (710) can obtain a first feature from an image using a convolutional neural network (CNN) program.

[0312] In one embodiment, the text-visual fusion module (710) can convert voice input into text using a computer program for automatic speech recognition.

[0313] In one embodiment, the text-visual fusion module (710) can obtain a second feature from the text.

[0314] In one embodiment, the second feature may include a text feature. The second feature may include a vector value or a feature vector (or text feature vector) for the second feature (or text feature).

[0315] In one embodiment, the text-visual fusion module (710) can obtain a second feature from the text using a convolutional neural network (CNN) program.

[0316] In one embodiment, the text-visual fusion module (710) may obtain a third feature in which an image and text are fused. In one embodiment, the text-visual fusion module (710) may obtain a fused feature based on image features and text features.

[0317] For example, the third feature may include a feature in which image features and text features are fused. The third feature may include a fused feature. The third feature may include a vector value or a feature vector (or fused feature vector) for the third feature (or fused feature).

[0318] In one embodiment, the history tracking module (720) can update the history status.

[0319] In one embodiment, the history tracking module (720) can combine a set of fused feature vectors (x_t) with a previous fused feature vector based on a previous voice input and a previous image.

[0320] In one embodiment, the history tracking module (720) can continuously aggregate records of the user's utterances and update the history.

[0321] In one embodiment, the history tracking module (720) can update the history in which a set of fused feature vectors (x_t) is combined with a previous fused feature vector based on a GRU (gated recurrent unit).

[0322] In one embodiment, the candidate generation module (730) can generate candidate images.

[0323] In one embodiment, the candidate generation module (730) may include at least one of a prompt generation module, a controlnet module, or a stable diffusion module.

[0324] In one embodiment, the prompt generation module can generate a prompt provided to the stable diffusion module. The prompt generation module can configure the prompt according to importance based on the current user's request information (e.g., text converted from voice input) and previous caption information.

[0325] In one embodiment, the prompt generation module may represent a sentence as a keyword and move important keywords to the front of the sentence. The prompt generation module may change the connection between comma units and words to a first symbol (e.g., parallel bar symbol (∥)), display a second symbol (e.g., parenthesis symbol) based on the number of repeated important words, and display unimportant keywords to be removed as a third symbol (e.g., square bracket symbol).

[0326] For example, if the current user's request information (e.g., text converted from voice input) is “Remove the bed and add a wood table to make it a clean style” and the previous caption information is “The prepared design has a large window, a wood table, and lighting...”, the prompt generation module can treat “wood table” as a keyword, move it to the front of the sentence and display it with a second symbol as many times as it is repeated, and treat “bed” as a keyword to be removed and display it with a third symbol. The prompt generation module can display it with symbols as “((wood table)) ∥ large window, [remove bed]”.

[0327] In one embodiment, the stable diffusion module can generate candidate images based on a prompt. The ControlNet module can precisely control the image generation process by utilizing additional input data. At least one of the depth, composition, or pose of the candidate images generated by the stable diffusion module can be controlled.

[0328] In one embodiment, the selection module (740) can select a representative image from among the candidate images.

[0329] In one embodiment, the selection module (740) can select a representative image based on the user's preference among candidate images.

[0330] In one embodiment, the selection module (740) may include a K-NN (K-Nearest Neighbors) program. In one embodiment, the selection module (740) may verify user preferences based on the K-NN program. The K-NN program is a supervised learning algorithm that can be used to classify or predict new data based on similarity between data points. For example, the K-NN program can classify by referring to the labels of K other data points that are close to the data. When new data is input, the K-NN program can select the K data points closest to the new data by comparing the new data with the existing data distribution.

[0331] FIG. 8 is a drawing illustrating a method for controlling an audio-based spatial design of an electronic device (100) according to one embodiment of the present disclosure.

[0332] In one embodiment, while wearing the electronic device (100), the user can input voice input (801) into the electronic device (100), such as “Change the style of my room now,” “Remove the bed and put in a wooden table to make it a neat style,” or “Make the table smaller and make it feel like a living room with a comfortable sofa and study.”

[0333] In one embodiment, the electronic device (100) and the user can interact with each other based on voice input and images.

[0334] In one embodiment, when voice input (801) such as “Change my room style now” is obtained from a user, the electronic device (100) may obtain an image related to the actual space (802) in the user’s field of view (FoV) through a camera (e.g., a camera for shooting (313) or a depth camera). The electronic device (100) may obtain image features from the image.

[0335] In one embodiment, when voice input (801) such as “Change my room style now” is obtained from a user, the electronic device (100) can convert the voice input into text to obtain text features.

[0336] In one embodiment, the electronic device (100) can generate fusion features based on text features and image features, and generate candidate images (803) based on the fusion features.

[0337] In one embodiment, the electronic device (100) may select at least one of the candidate images (803) as a representative image (804) based on the user's preference and display it on a display (314-1, 314-2).

[0338] In one embodiment, when the electronic device (100) displays a representative image (804) on a display (314-1, 314-2), it may display it in correspondence with the actual space (or external environment) (802).

[0339] In one embodiment, the electronic device (100) may output an image caption (805) as audio through speakers (332-1, 332-2). For example, the image caption (805) may include a description of the representative image (804). For example, the image caption (805) may include “It is a minimal design with large windows, a comfortable sofa, a small table, arranged books, a cozy sofa, etc.”

[0340] In one embodiment, the electronic device (100) may display an image caption (805) on a display (314-1, 314-2).

[0341] FIG. 9 is a drawing illustrating a method for controlling an audio-based spatial design of an electronic device (100) according to one embodiment of the present disclosure.

[0342] In one embodiment, while wearing the electronic device (100), the user can input voice input (901), such as “I want to buy a new desk,” into the electronic device (100).

[0343] In one embodiment, the electronic device (100) and the user can interact with each other based on voice input (901) and images.

[0344] In one embodiment, when voice input (901) such as “I want to buy a new desk” is obtained from a user, the electronic device (100) may obtain an image related to the actual space (902) in the user’s field of view (FoV) through a camera (e.g., a camera for shooting (313) or a depth camera). The electronic device (100) may obtain image features from the image.

[0345] In one embodiment, when voice input (901) “I want to buy a new desk” is obtained from a user, the electronic device (100) can convert the voice input into text to obtain text features.

[0346] In one embodiment, the electronic device (100) can generate fusion features based on text features and image features, and generate candidate images based on the fusion features.

[0347] In one embodiment, the electronic device (100) can recognize at least one object (911, 912) in the actual space (902) through a segment or depth sensing operation. For example, the electronic device (100) can recognize a desk (911) and a monitor (912) as objects in the actual space (902).

[0348] In one embodiment, the electronic device (100) can obtain information related to a desk through an external electronic device (920). The electronic device (100) can obtain information related to a desk through web crawling via the external electronic device (920).

[0349] In one embodiment, the electronic device (100) may select at least one of the candidate images as a representative image (903) based on the user's preference and display it on a display (314-1, 314-2).

[0350] In one embodiment, when the electronic device (100) displays a representative image (903) on a display (314-1, 314-2), it may display it in correspondence with the actual space (or external environment) (902).

[0351] FIG. 10 is a drawing illustrating a method for controlling an audio-based spatial design of an electronic device (100) according to one embodiment of the present disclosure.

[0352] In one embodiment, while wearing the electronic device (100), the user can input voice input (1001) into the electronic device (100), such as “Change the style of my room now,” “Remove the bed and put in a wooden table to make it a neat style,” or “Make the table smaller and make it feel like a living room with a comfortable sofa and study.”

[0353] In one embodiment, the electronic device (100) and the user can interact with each other based on voice input (1001) and images.

[0354] In one embodiment, when voice input (1001) such as “Change my room style now” is obtained from a user, the electronic device (100) may acquire an image related to the actual space (1002) in the user’s field of view (FoV) through a camera (e.g., a camera for shooting (313) or a depth camera). The electronic device (100) may acquire image features from the image.

[0355] In one embodiment, when voice input (1001) such as “Change my room style now” is obtained from a user, the electronic device (100) can convert the voice input into text to obtain text features.

[0356] In one embodiment, the electronic device (100) can generate fusion features based on text features and image features, and generate candidate images (1005) based on the fusion features.

[0357] In one embodiment, the electronic device (100) may select at least one of the candidate images (1005) as a representative image (1006) based on the user's preference and display it on a display (314-1, 314-2).

[0358] In one embodiment, when the electronic device (100) displays a representative image (1006) on a display (314-1, 314-2), it may display it in correspondence with the actual space (or external environment) (1002).

[0359] Referring to screen 1004, the electronic device (100) can recognize at least one object (1014, 1024) in a representative image (1006) and generate a three-dimensional converted image (1024).

[0360] In one embodiment, when the electronic device (100) displays a representative image (1006) on a display (314-1, 314-2), it may display a three-dimensionally converted image (1024) in an area corresponding to an actual object.

[0361] In one embodiment, when the electronic device (100) displays a representative image (1006) on a display (314-1, 314-2), it may display a three-dimensionally converted image (1024) based on depth information in actual space.

[0362] In one embodiment, the electronic device (100) may output an image caption (1007) as audio through speakers (332-1, 332-2). For example, the image caption (1007) may include a description of the representative image (1006). For example, the image caption (1007) may include “It is a minimal design with large windows, a comfortable sofa, a small table, arranged books, a cozy sofa, etc.”

[0363] In one embodiment, the electronic device (100) may display an image caption (1007) on a display (314-1, 314-2).

[0364] FIGS. 11a, FIGS. 11b and FIGS. 11c are drawings illustrating a method for controlling an audio-based spatial design of an electronic device (100) according to an embodiment of the present invention.

[0365] In one embodiment, while wearing the electronic device (100), the user can input a first voice input (1101), such as “Change my room style now,” into the electronic device (100).

[0366] In one embodiment, the electronic device (100) and the user can interact with each other based on voice input and images.

[0367] In one embodiment, when a first voice input (1101) such as “Change my room style now” is received from a user, the electronic device (100) may acquire an image related to the actual space (1102) in the user’s field of view (FoV) through a camera (e.g., a camera for shooting (313) or a depth camera). The electronic device (100) may acquire image features from the image.

[0368] In one embodiment, when a first voice input (1101) such as “Change my room style now” is obtained from a user, the electronic device (100) can convert the voice input into text to obtain text features.

[0369] In one embodiment, the electronic device (100) can generate fusion features based on text features and image features, and generate first candidate images (1103) based on the fusion features.

[0370] In one embodiment, the electronic device (100) may select at least one of the first candidate images (1103) as the first representative image (1104) based on the user's preference and display it on the display (314-1, 314-2).

[0371] In one embodiment, the electronic device (100) can update at least one of the first candidate images (1103), the first representative image (1104), or the fusion feature in the history.

[0372] In one embodiment, the electronic device (100) may output an image caption (1105) as audio through speakers (332-1, 332-2). For example, the image caption (1105) may include a description of the first representative image (1104). For example, the image caption (1105) may include “It is a minimal design with large windows, a comfortable sofa, a small table, arranged books, a cozy sofa, etc.”

[0373] In one embodiment, the electronic device (100) can obtain a second voice input (1106) from a user, such as “remove the bed and put in a wooden table to make it a neat style” with respect to a first representative image (1104).

[0374] In one embodiment, the electronic device (100) can generate fusion features based on text features and image features, and generate second candidate images (1107) based on fusion features.

[0375] In one embodiment, the electronic device (100) may select at least one of the second candidate images (1107) based on the user's preference as the second representative image (1108) and display it on the display (314-1, 314-2).

[0376] In one embodiment, the electronic device (100) can update at least one of the second candidate images (1107), the second representative image (1108), or the fusion feature in the history.

[0377] In one embodiment, the electronic device (100) may output an image caption (1109) as audio through speakers (332-1, 332-2). For example, the image caption (1109) may include a description of a second representative image (1108). For example, the image caption (1105) may include “a design consisting of large windows, a wooden dining table, and arranged books.”

[0378] In one embodiment, the electronic device (100) can obtain a third voice input (1110) from a user regarding a second representative image (1108), such as “make the table small and create a living room feeling like a comfortable sofa and study.”

[0379] In one embodiment, the electronic device (100) can generate fusion features based on text features and image features, and generate third candidate images (1111) based on the fusion features.

[0380] In one embodiment, the electronic device (100) may select at least one of the third candidate images (1111) as the third representative image (1112) based on the user's preference and display it on the display (314-1, 314-2).

[0381] In one embodiment, the electronic device (100) can update at least one of the third candidate images (1111), the third representative image (1112), or the fusion feature in the history.

[0382] In one embodiment, the electronic device (100) may output an image caption (1113) as audio through speakers (332-1, 332-2). For example, the image caption (1113) may include a description of the third representative image (1108). For example, the image caption (1105) may include “It is a minimal design with large windows, a comfortable sofa, a small table, arranged books, a cozy sofa, etc.”

[0383] In one embodiment, the electronic device (100) can obtain a fourth voice input (1114) from the user, such as “Ah, next I’ll just put a bed and a desk in,” regarding the third representative image (1112).

[0384] In one embodiment, the electronic device (100) can generate fusion features based on text features and image features, and generate fourth candidate images (1115) based on the fusion features.

[0385] In one embodiment, the electronic device (100) may select at least one of the fourth candidate images (1115) based on the user's preference as the fourth representative image (1116) and display it on the display (314-1, 314-2).

[0386] In one embodiment, the electronic device (100) can update at least one of the fourth candidate images (1115), the fourth representative image (1116), or the fusion feature in the history.

[0387] In one embodiment, the electronic device (100) may output an image caption (1117) as audio through speakers (332-1, 332-2). For example, the image caption (1117) may include a description of the fourth representative image (1116). For example, the image caption (1117) may include “It is a minimal design with a bed and a desk, a large window, and arranged books.”

[0388] In one embodiment, the electronic device (100) can obtain a fifth voice input (1117) from a user, such as “I want to change my desk, please look for a different gaming desk”, regarding a fourth representative image (1116).

[0389] In one embodiment, the electronic device (100) can generate a fusion feature based on text features and image features, and generate a sixth representative image (1118) based on the fusion feature.

[0390] In one embodiment, the electronic device (100) may display the sixth representative image (1118) on a display (314-1, 314-2).

[0391] In one embodiment, the electronic device (100) can update at least one of the 6th representative image (1118) or fusion features to the history.

[0392] In one embodiment, the electronic device (100) may output an image caption (1119) as audio through speakers (332-1, 332-2). For example, the image caption (1119) may include a description of the sixth representative image (1118). For example, the image caption (1117) may include “reconfigured on a desk with a monitor, arranged books, and a section for holding a tablet.”

[0393] In one embodiment, the electronic device (100) can identify at least one object (e.g., a desk (1120), a monitor (1121)) in the sixth representative image (1118), search through an external electronic device (1130), and convert the searched object (or image) into a three-dimensional form (1140).

[0394] In one embodiment, the electronic device (100) may display a three-dimensionally converted object (or image) (1140) in an area corresponding to the actual object.

[0395] In one embodiment, the electronic device (100) can display a three-dimensionally converted object (or image) (1140) based on depth information in actual space.

[0396] FIGS. 12a, FIGS. 12b, FIGS. 12c, FIGS. 12d, FIGS. 12e and FIGS. 12f are drawings illustrating a method for controlling an audio-based spatial design of an electronic device (200) according to one embodiment of the present disclosure.

[0397] The electronic device (200) of FIGS. 12a, 12b, 12c, 12d, 12e, and 12f may include a smartphone (e.g., a bar-type or foldable-type smartphone) or a portable computer device. The electronic device (200) of FIGS. 12a, 12b, 12c, 12d, 12e, and 12f may include components of the electronic device (100) of FIG. 1.

[0398] In FIG. 12a, the user can input input (e.g., voice input, text input) (1201) into the electronic device (200) via voice input and / or text input, such as “I would like the room to have books by the large window now, make it a comfortable space with a table and deep, and make the tone a wood tone.”

[0399] In one embodiment, the electronic device (200) and the user can interact with each other based on input and images.

[0400] In one embodiment, the electronic device (200) can interact with the user through a chatbot application (1210).

[0401] Referring to FIG. 12b, when an input (1210) is obtained from a user, such as “I want the room to have books by the large window, make it a comfortable space with a table, and make the tone deep, and generate a wood tone,” the electronic device (200) can obtain an image related to the actual space (1202) through a camera (e.g., camera (180)). The electronic device (200) can obtain image features from the image.

[0402] In one embodiment, when an input (1210) is obtained from a user, such as “I would like the room to have books by the large window now, make it a comfortable space with a table, and make the tone a wood tone,” the electronic device (100) can obtain text features based on voice input and text input.

[0403] In one embodiment, the electronic device (200) can generate fusion features based on text features and image features, and generate candidate images based on the fusion features.

[0404] In one embodiment, the electronic device (200) can select at least one of the candidate images as a representative image (1204) based on the user's preference and display it on the display (160).

[0405] In one embodiment, the electronic device (200) can update at least one of candidate images, representative images, or fusion features in the history.

[0406] In one embodiment, the electronic device (200) may output an image caption (1203) as audio through speakers (332-1, 332-2). For example, the image caption (1203) may include a description of a representative image (1204). For example, the image caption (1105) may include “a design featuring large windows, a comfortable sofa, aligned shelves, and a minimal design.”

[0407] In one embodiment, the electronic device (200) may display an image caption (1203) on the execution screen of a chatbot application (1210) along with the user's voice input or text input.

[0408] In FIG. 12c, when a user inputs voice input or text input such as “show me candidate images” on the execution screen of the chatbot application (1210), the electronic device (200) may stop displaying the representative image (1204) and display candidate images (1206) for selecting the representative image (1204) on the display (160).

[0409] Referring to FIG. 12d, when a user inputs voice input or text input, such as “add a wood table,” on the execution screen of the chatbot application (1210), the electronic device (200) can display a representative image (1208) with a wood table added. The electronic device (200) can output an image caption (1209) as audio through speakers (332-1, 332-2). For example, the image caption (1209) may include a description of the representative image (1208). For example, the image caption (1209) may include “It is a design consisting of a large window, a wood dining table, and arranged books.” The electronic device (200) can display the image caption (1209) on the execution screen of the chatbot application (1210) together with the user’s voice input or text input.

[0410] Referring to FIG. 12e, when a user inputs voice or text input such as “I want to add a desk to the image, what design would be good” (1219) on the execution screen of the chatbot application (1210), the electronic device (200) can display a representative image (1213) with a desk added to the selected image (1212). The electronic device (200) can output an image caption (1214) as audio through speakers (332-1, 332-2). For example, the image caption (1214) may include a description of the representative image (1213). For example, the image caption (1219) may include “This is a design with a desk added.” The electronic device (200) can display the image caption (1219) on the execution screen of the chatbot application (1210) together with the user’s voice or text input.

[0411] Referring to FIG. 12f, when a user inputs voice or text input, such as “Please find a similar desk,” on the execution screen of the chatbot application (1210), the electronic device (200) can identify an object regarding a desk in a representative image (1213) to which a desk has been added, and search for and display a desk image (1215) through an external electronic device. The electronic device (200) can output an image caption (1216) as audio through speakers (332-1, 332-2). For example, the image caption (1216) may include a description of the desk image (1215). For example, the image caption (1216) may include “Price of a height-adjustable wooden computer desk at a supermarket…”. The electronic device (200) can display the image caption (1216) on the execution screen of the chatbot application (1210) together with the user’s voice or text input.

[0412] FIG. 13 is a flowchart illustrating a method for controlling an audio-based spatial design of an electronic device (100) according to one embodiment of the present disclosure.

[0413] In one embodiment, in operation 1301, instructions stored in memory (130) can enable the electronic device (100) to receive user input when executed individually or collectively by at least one processor (120).

[0414] In one embodiment, user input may include voice input regarding space. For example, voice input may include text or voice regarding space design.

[0415] In one embodiment, in operation 1301, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), allowing the electronic device (100) to receive user input through microphones (341-1, 341-2).

[0416] In one embodiment, in operation 1303, instructions stored in memory (130) can enable the electronic device (100) to acquire spatial information about an external environment based on a camera (e.g., a camera for taking pictures (313) or a depth camera) when executed individually or collectively by at least one processor (120).

[0417] In one embodiment, the external environment may include a real world corresponding to the field of view of a user wearing the electronic device (100).

[0418] In one embodiment, the external environment may include a space for the real world.

[0419] In one embodiment, in operation 1303, instructions stored in memory (130) can be executed individually or collectively by at least one processor (120), so that when the electronic device (100) receives user input, it can turn on a camera (e.g., a camera for shooting (313) or a depth camera) to obtain an image or spatial information of the external environment.

[0420] In one embodiment, in operation 1305, instructions stored in memory (130) can cause the electronic device (100) to place a virtual object in space and display the virtual object on a display (314-1, 314-2) based on user input when executed individually or collectively by at least one processor (120).

[0421] In one embodiment, the electronic device (100) comprises a frame (e.g., frame (323)) equipped with glasses (e.g., glasses (320, 330)) including a display (e.g., display (180), display (314-1, 314-2)), a wearable structure (e.g., first temple (321) and / or second temple (322)) coupled to the frame (e.g., frame (323)) to allow the frame to be seated on a user's head), at least one sensor, a camera (180) including a depth camera, a microphone (e.g., microphone (341-1, 341-2)), a speaker (e.g., speaker (332-1, 332-2)), at least one processor (120), and instructions stored in memory (130) so that when executed individually or collectively by at least one processor (120), the electronic device (100) causes a spatial voice input, through the microphone, to be received, and to the voice input Based on this, information related to the space received through the depth camera can be obtained, voice collected through the microphone can be converted into text, a virtual object to be placed in the space can be created based on the converted text, and the virtual object can be displayed on the glass (e.g., glass (320, 330)).

[0422] In one embodiment, when the instructions are executed individually or collectively by at least one processor (120), the electronic device (100) may identify text features in the converted text, identify image features in the acquired information, identify fusion features based on the text features and image features, apply a history state based on the fusion features, generate candidate images based on the fusion features, automatically select a representative image among the candidate images, display the selected representative image on a glass (e.g., glass (320, 330)), and output a caption for the selected representative image as audio through a speaker.

[0423] In one embodiment, when instructions are executed individually or collectively by at least one processor (120), the electronic device (100) may identify an object for a selected representative image, convert the identified object into three dimensions, and display the three-dimensional converted object on a glass (e.g., glass (320, 330)) at a position corresponding to the space in the user's field of view.

[0424] In one embodiment, when the instructions are executed individually or collectively by at least one processor (120), the electronic device (100) may be able to select a representative image from among candidate images by user input.

[0425] In one embodiment, user input may include at least one of voice input or gesture input.

[0426] In one embodiment, when instructions are executed individually or collectively by at least one processor (120), the electronic device (100) may display a representative image generated based on an image of an object collected from an external electronic device (100) on a glass (e.g., glass (320, 330)).

[0427] In one embodiment, when the instructions are executed individually or collectively by at least one processor (120), the electronic device (100) may generate candidate images by checking depth information on an image of a specific object in a depth map if the name of a specific object is included in the fusion feature.

[0428] In one embodiment, when the instructions are executed individually or collectively by at least one processor (120), the electronic device (100) may generate candidate images by checking depth information on an image of a specific object in a depth map, if depth is included in the fusion feature.

[0429] In one embodiment, when instructions are executed individually or collectively by at least one processor (120), the electronic device (100) may generate a prompt based on voice input or text and generate candidate images based on the generated prompt.

[0430] In one embodiment, when the instructions are executed individually or collectively by at least one processor (120), the electronic device (100) may be able to select a representative image based on an image selected by the user or a preferred image.

[0431] In one embodiment, a spatial design method using an electronic device (100) may include the operation of receiving a voice input related to a space through a microphone, the operation of acquiring information related to the space received through a depth camera based on the voice input, the operation of converting the voice input collected through the microphone into text, the operation of creating a virtual object to be placed in the space based on the converted text, and the operation of displaying the virtual object on a glass (e.g., glass (320, 330)).

[0432] In one embodiment, a spatial design method using an electronic device (100) may include the operation of identifying image features in acquired information, the operation of identifying text features in converted text, the operation of identifying fusion features based on text features and image features, the operation of applying a history state based on fusion features, the operation of generating candidate images based on fusion features and automatically selecting a representative image among the candidate images, the operation of displaying the selected representative image on a glass (e.g., glass (320, 330)), and the operation of outputting a caption for the selected representative image as audio through a speaker.

[0433] In one embodiment, a spatial design method using an electronic device (100) may include the operation of identifying an object for a selected representative image, and the operation of converting the identified object into a three-dimensional object and displaying the object converted into a three-dimensional object on a glass (e.g., glass (320, 330)) at a position corresponding to the space in the user's field of vision.

[0434] In one embodiment, a spatial design method using an electronic device (100) may include an operation of selecting a representative image from among candidate images by user input.

[0435] In one embodiment, a spatial design method using an electronic device (100) may include the operation of displaying a representative image generated based on an image of an object collected from an external electronic device (100) on a glass (e.g., glass (320, 330)).

[0436] In one embodiment, a spatial design method using an electronic device (100) may include the operation of generating candidate images by checking depth information on an image of a specific object in a depth map if the name of a specific object is included in the fusion feature.

[0437] In one embodiment, a spatial design method using an electronic device (100) may include an operation of generating candidate images by checking depth information on an image of a specific object in a depth map, if depth is included in the fusion features.

[0438] In one embodiment, a spatial design method using an electronic device (100) may include generating a prompt based on voice input or text, and generating candidate images based on the generated prompt.

[0439] In one embodiment, a spatial design method using an electronic device (100) may include the operation of selecting a representative image based on an image selected or preferred by the user.

[0440] An electronic device according to one embodiment disclosed in this document may be of various forms. The electronic device may include, for example, a portable communication device (e.g., a smartphone), a computer device, a portable multimedia device, a portable medical device, a camera, a wearable device, or a consumer electronics device. The electronic device according to the embodiment of this document is not limited to the aforementioned devices.

[0441] The embodiments of this document and the terms used therein are not intended to limit the technical features described in this document to specific embodiments, and should be understood to include various modifications, equivalents, or substitutions of said embodiments. In connection with the description of the drawings, similar reference numerals may be used for similar or related components. The singular form of a noun corresponding to an item may include one or more of said items unless the relevant context clearly indicates otherwise. In this document, phrases such as "A or B," "at least one of A and B," "at least one of A or B," "A, B or C," "at least one of A, B and C," and "at least one of A, B, or C" may each include any one of the items listed together in the corresponding phrase, or all possible combinations thereof. Terms such as "first," "second," or "first" or "second" may be used simply to distinguish said components from other said components and do not limit said components in any other aspect (e.g., importance or order). Where any (e.g., 1st) component is referred to as “coupled” or “connected” to another (e.g., 2nd) component, with or without the terms “functionally” or “communicationly,” it means that said any component may be connected to said other component directly (e.g., via a wire), wirelessly, or through a third component.

[0442] As used in one embodiment of this document, the term “module” may include a unit implemented in hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit, for example. A module may be a component formed integrally, or a minimum unit of said component or a part thereof that performs one or more functions. For example, according to one embodiment, a module may be implemented in the form of an application-specific integrated circuit (ASIC).

[0443] One embodiment of the present document may be implemented as software (e.g., a program) comprising one or more instructions stored in a storage medium (e.g., internal memory) or external memory that is readable by a machine (e.g., an electronic device (100)). For example, a processor (e.g., a processor (120)) of the machine (e.g., an electronic device (100)) may call at least one of the one or more instructions stored in the storage medium and execute it. This enables the machine to be operated to perform at least one function according to the at least one called instruction. The one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. The storage medium readable by the machine may be provided in the form of a non-transitory storage medium. Here, 'non-temporary' simply means that the storage medium is a tangible device and does not contain a signal (e.g., electromagnetic waves), and the term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily.

[0444] According to one embodiment, the method according to one embodiment disclosed herein may be provided by being included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)) or an application store (e.g., Play Store). TM It can be distributed online (e.g., downloaded or uploaded) through ) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily created on a device-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.

[0445] According to one embodiment, each component (e.g., module or program) of the components described above may include a singular or multiple entities, and some of the multiple entities may be separated and placed in other components. According to one embodiment, one or more of the components or operations among the aforementioned components may be omitted, or one or more other components or operations may be added. Generally or additionally, multiple components (e.g., module or program) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each of the multiple components in the same or similar manner as those performed by the corresponding component among the multiple components prior to integration. According to one embodiment, operations performed by the module, program, or other components may be executed sequentially, in parallel, iteratively, or heuristically, or one or more of the operations may be executed in a different order, omitted, or one or more other operations may be added.

Claims

1. In an electronic device, A frame equipped with glass including a display; A wearing structure coupled to the above frame to allow the frame to be seated on the user's head; At least one sensor; A camera including a depth camera; mike; speaker; At least one processor; and When the instructions stored in memory are executed individually or collectively by the at least one processor, the electronic device, Voice input related to space is received through the microphone, and Based on the above voice input, depth information related to images and space is obtained through the depth camera, and Converts the voice collected through the above microphone into text, and Creates a virtual object to be placed in space based on the above-mentioned converted text, and An electronic device that displays the above virtual object on the above display.

2. In Paragraph 1, When the above instructions are executed individually or collectively by the at least one processor, the electronic device, It allows checking text features in the above-mentioned converted text, and It allows for the identification of image features from the above-mentioned acquired information, and Confirming a feature in which the above image feature and the above text feature are fused, and Applying a history state based on a feature formed by fusing the above image feature and the above text feature, Generate candidate images based on the fused features of the above image features and the above text features, and automatically select a representative image among the candidate images. Display the selected representative image on the glass, and An electronic device that outputs a caption for the selected representative image as audio through the speaker.

3. In Paragraph 2, When the above instructions are executed individually or collectively by the at least one processor, the electronic device, Verify the object for the above-mentioned representative image, and An electronic device that converts the above-mentioned identified object into three dimensions and displays the three-dimensionally converted object on the glass at a position corresponding to the space within the user's field of vision.

4. In Paragraph 2, When the above instructions are executed individually or collectively by the at least one processor, the electronic device, An electronic device that selects the representative image among the candidate images by user input including at least one of voice input or gesture input.

5. In Paragraph 2, When the above instructions are executed individually or collectively by the at least one processor, the electronic device, An electronic device that displays a representative image generated based on an image of an object collected from an external electronic device on the glass.

6. In Paragraph 2, When the above instructions are executed individually or collectively by the at least one processor, the electronic device, An electronic device that generates candidate images by verifying depth information on an image of a specific object in a depth map, if the name of a specific object is included in a feature in which the image feature and the text feature are fused.

7. In Paragraph 2, When the above instructions are executed individually or collectively by the at least one processor, the electronic device, An electronic device that generates candidate images by verifying depth information on an image for a specific object in a depth map, if depth is included in a feature formed by fusing the image feature and the text feature.

8. In Paragraph 2, When the above instructions are executed individually or collectively by the at least one processor, the electronic device, Generates a prompt based on the above voice input or the above text, and generates the above candidate images based on the generated prompt, and An electronic device that allows the user to select the representative image based on an image selected or preferred by the user.

9. In a spatial design method using electronic devices, The action of receiving a voice input related to space through a microphone; An operation of acquiring depth information related to an image and space through a depth camera based on the above voice input; The operation of converting voice input collected through the above microphone into text; The operation of creating a virtual object to be placed in space based on the above-mentioned converted text; and A method including the operation of displaying the above virtual object on a display.

10. In Paragraph 9, An operation to identify image features from the above-mentioned acquired information; An operation to identify text features in the above-mentioned converted text; An operation to identify a feature formed by fusing the above image feature and the above text feature; An operation to apply a history state based on a feature in which the above image feature and the above text feature are fused; An operation to generate candidate images based on a feature formed by fusing the above image feature and the above text feature, and to automatically select a representative image among the candidate images; The operation of displaying the selected representative image on the glass; and A method including the operation of outputting a caption for the selected representative image as audio through a speaker.

11. In Paragraph 10, An operation to verify an object for the above-mentioned selected representative image; and A method comprising the operation of converting the above-identified object into a three-dimensional form and displaying the three-dimensionally converted object on the glass at a position corresponding to the space in the user's field of vision.

12. In Paragraph 10, A method comprising the operation of selecting the representative image among the candidate images by user input including at least one of voice input or gesture input.

13. In Paragraph 10, A method including the operation of displaying a representative image generated based on an image of an object collected from an external electronic device on the glass.

14. In Paragraph 10, If the name of a specific object is included in a feature formed by fusing the image feature and the text feature, the operation of generating the candidate images by verifying depth information on the image for the specific object in the depth map; or A method comprising the operation of generating candidate images by checking depth information on the image for the specific object in a depth map, if depth is included in the feature formed by fusing the image feature and the text feature.

15. In Paragraph 10, An operation to generate a prompt based on the voice input or the text, and to generate the candidate images based on the generated prompt; or A method including the action of selecting the representative image based on an image selected or preferred by the user.