Electronic device, method, and storage medium for providing information of object

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The described device addresses the weight and power issues of glasses-type wearables by using a microphone, camera, and AI to identify objects and provide information, enhancing user interaction without extra hardware.

WO2026141890A1PCT designated stage Publication Date: 2026-07-02SAMSUNG ELECTRONICS CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: SAMSUNG ELECTRONICS CO LTD
Filing Date: 2025-10-16
Publication Date: 2026-07-02

Application Information

Patent Timeline

16 Oct 2025

Application

02 Jul 2026

Publication

WO2026141890A1

IPC: G06F3/01; G06T7/11; G06T7/50; G06F3/16; G06F3/00; G06V40/20; G06F3/04845; G06F3/0481; G06F3/04842

AI Tagging

Technology Topics

Computer hardware Engineering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Glasses-type wearable devices with additional hardware for user input recognition are heavy and power-consuming, limiting their long-term use.

Method used

An electronic device equipped with a microphone, camera, and processor that uses an artificial intelligence model to identify user utterances and determine target objects in images, optionally using multimodal input, to provide information without additional hardware.

Benefits of technology

Enables accurate recognition and information provision about target objects in a lightweight and power-efficient manner, utilizing existing hardware to enhance user interaction.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure KR2025016421_02072026_PF_FP_ABST

Patent Text Reader

Abstract

An electronic device, a method, and a storage medium for providing information of an object are provided. The electronic device comprises: a microphone; at least one camera; an output interface; at least one processor including a processing circuit; and a memory storing instructions. The instructions, when executed individually or collectively by the at least one processor, instruct the electronic device to acquire an image including at least one object through the at least one camera. The instructions instruct the electronic device to identify a user's utterance received through the microphone by using an artificial intelligence model. The instructions instruct the electronic device to determine whether multi-modal input is required according to whether a target object is specified by the user's utterance. The instructions instruct the electronic device to determine the target object included in the image on the basis of the user's utterance when multi-modal input is unnecessary. The instructions instruct the electronic device to obtain a predetermined area including the target object. The instructions instruct the electronic device to obtain information of the target object included in the predetermined area. The instructions instruct the electronic device to provide information of the target object through the output interface.

Need to check novelty before this filing date? Find Prior Art

Description

Electronic device, method, and storage medium for providing information about an object

[0001] The embodiments of this document relate to electronic devices, methods, and storage media, and, for example, to electronic devices, methods, and storage media that recognize an object included in an image and provide information about the object.

[0002] Glasses-type wearable devices may include additional hardware devices to accurately recognize user input. For example, glasses-type wearable devices may include an eye-tracking sensor that receives user gaze information and display the tracked gaze information on a display. Glasses-type wearable devices may include multiple sensors that recognize the user's hands and recognize input through the user's hands by accurately tracking them. Glasses-type wearable devices can receive user input more accurately by using a separate controller. However, glasses-type wearable devices containing additional hardware devices are heavy and consume a lot of power, which may limit long-term use. Recently, lightweight glasses-type wearable devices containing only a microphone, speaker, and camera are being commercialized.

[0003] The information described above may be provided merely as related art to aid in understanding the present disclosure. None of the foregoing is to be claimed as prior art related to the present disclosure or to be used in determining prior art.

[0004] An electronic device according to various embodiments of this document may include a microphone, at least one camera, an output interface, at least one processor including a processing circuit, and a memory for storing instructions. When the instructions are executed individually or collectively by the at least one processor, the electronic device may acquire an image containing at least one object through the at least one camera. The instructions may enable the electronic device to identify a user's utterance received through the microphone using an artificial intelligence model. The instructions may enable the electronic device to determine whether multimodal input is required based on whether the target object is identified by the user's utterance. The instructions may enable the electronic device to determine the target object included in the image based on the user's utterance when the multimodal input is unnecessary. The instructions may enable the electronic device to acquire a certain area containing the target object. The above instructions enable the electronic device to obtain information about a target object included within the specified area. The above instructions enable the electronic device to provide information about the target object through the output interface.

[0005] A method for providing information about an object in an electronic device according to various embodiments of the present document may include the operation of acquiring an image containing at least one object through at least one camera. The method may include the operation of identifying a user's utterance received through a microphone using an artificial intelligence model. The method may include the operation of determining whether multimodal input is required based on whether the target object is identified by the user's utterance. If the multimodal input is unnecessary, the method may include the operation of determining the target object included in the image based on the user's utterance. The method may include the operation of acquiring a certain area containing the target object. The method may include the operation of acquiring information about the target object included within the certain area. The method may include the operation of providing information about the target object through an output interface.

[0006] A non-transient computer-readable storage medium having a program recorded thereon for a method of providing information about an object in an electronic device according to various embodiments of the present document may include instructions for performing an operation of acquiring an image containing at least one object through at least one camera. The storage medium may include instructions for performing an operation of identifying a user's utterance received through a microphone using an artificial intelligence model. The storage medium may include instructions for determining whether multimodal input is required depending on whether the target object is identified by the user's utterance. If the multimodal input is unnecessary, the storage medium may include instructions for performing an operation of determining the target object included in the image based on the user's utterance. The storage medium may include instructions for performing an operation of acquiring a certain area containing the target object. The storage medium may include instructions for performing an operation of acquiring information about the target object included within the certain area. The storage medium may include instructions for performing an operation of providing information about the target object through an output interface.

[0007] The above and other aspects, features, and advantages of specific embodiments of the present disclosure may become more apparent from the following detailed description, taken in conjunction with the accompanying drawings. In the drawings:

[0008] FIG. 1 is a block diagram of an electronic device in a network environment according to various embodiments.

[0009] FIG. 2 is a block diagram illustrating the configuration of an electronic device according to various embodiments.

[0010] FIG. 3 is a block diagram illustrating a configuration for analyzing user input according to various embodiments.

[0011] FIG. 4 is a flowchart illustrating the operation of an electronic device according to various embodiments.

[0012] FIGS. 5, FIGS. 6, FIGS. 7, FIGS. 8, FIGS. 9, FIGS. 10, FIGS. 11, FIGS. 12 and FIGS. 13 are drawings illustrating an operation for determining a target object according to various embodiments.

[0013] FIG. 14 is a drawing illustrating an operation to change a target object according to various embodiments.

[0014] FIGS. 15a and FIGS. 15b are drawings illustrating an operation for determining a target object from an image acquired in the past according to various embodiments.

[0015] FIG. 16 is a flowchart illustrating a method of providing information of an object in an electronic device according to various embodiments.

[0016] Hereinafter, embodiments of the present disclosure are described in detail with reference to the drawings so that those skilled in the art can easily implement them. However, the present disclosure may be embodied in various different forms and is not limited to the examples described herein. In relation to the description of the drawings, the same or similar reference numerals may be used for identical or similar components. Furthermore, in the drawings and related descriptions, descriptions of well-known functions and configurations may be omitted for clarity and brevity.

[0017] FIG. 1 is a block diagram of an electronic device (101) in a network environment (100) according to various embodiments.

[0018] Referring to FIG. 1, in a network environment (100), an electronic device (101) may communicate with an electronic device (102) through a first network (198) (e.g., a short-range wireless communication network) or with at least one of an electronic device (104) or a server (108) through a second network (199) (e.g., a long-range wireless communication network). According to one embodiment, the electronic device (101) may communicate with the electronic device (104) through a server (108). According to one embodiment, the electronic device (101) may include a processor (120), memory (130), input module (150), sound output module (155), display module (160), audio module (170), sensor module (176), interface (177), connection terminal (178), haptic module (179), camera module (180), power management module (188), battery (189), communication module (190), subscriber identification module (196), or antenna module (197). In some embodiments, at least one of these components (e.g., connection terminal (178)) may be omitted from the electronic device (101), or one or more other components may be added. In some embodiments, some of these components (e.g., sensor module (176), camera module (180), or antenna module (197)) may be integrated into a single component (e.g., display module (160)).

[0019] The processor (120) can control at least one other component (e.g., a hardware or software component) of the electronic device (101) connected to the processor (120) by executing software (e.g., a program (140)), and can perform various data processing or operations. According to one embodiment, as at least part of the data processing or operations, the processor (120) can store commands or data received from other components (e.g., a sensor module (176) or a communication module (190)) in volatile memory (132), process the commands or data stored in volatile memory (132), and store the resulting data in non-volatile memory (134). According to one embodiment, the processor (120) may include a main processor (121) (e.g., a central processing unit or an application processor) or an auxiliary processor (123) that can operate independently or together with it (e.g., a graphics processing unit, a neural processing unit (NPU), an image signal processor, a sensor hub processor, or a communication processor). For example, if the electronic device (101) includes a main processor (121) and an auxiliary processor (123), the auxiliary processor (123) may be configured to use lower power than the main processor (121) or to be specialized for a designated function. The auxiliary processor (123) may be implemented separately from the main processor (121) or as part thereof.

[0020] The auxiliary processor (123) may control at least some of the functions or states associated with at least one component of the electronic device (101) (e.g., display module (160), sensor module (176), or communication module (190)) on behalf of the main processor (121) while the main processor (121) is in an inactive (e.g., sleep) state, or together with the main processor (121) while the main processor (121) is in an active (e.g., application execution) state. According to one embodiment, the auxiliary processor (123) (e.g., image signal processor or communication processor) may be implemented as part of another functionally related component (e.g., camera module (180) or communication module (190)). According to one embodiment, the auxiliary processor (123) (e.g., neural network processing unit) may include a hardware structure specialized for processing an artificial intelligence model. The artificial intelligence model may be generated through machine learning. Such learning may be performed, for example, on the electronic device (101) itself where the artificial intelligence model is executed, or through a separate server (e.g., server (108)). The learning algorithm may include, for example, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but is not limited to the embodiments described above. The artificial intelligence model may include a plurality of artificial neural network layers.The artificial neural network may be a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), a deep Q-network, or a combination of two or more of the above, but is not limited to the embodiments described above. In addition to the hardware structure, the artificial intelligence model may include a software structure, either additionally or substantially.

[0021] The memory (130) can store various data used by at least one component of the electronic device (101) (e.g., processor (120) or sensor module (176)). The data may include, for example, input data or output data for software (e.g., program (140)) and related commands. The memory (130) may include volatile memory (132) or non-volatile memory (134). The non-volatile memory (134) may include at least one internal memory (136) and an external memory (138).

[0022] The program (140) may be stored as software in memory (130) and may include, for example, an operating system (142), middleware (144), or an application (146).

[0023] The input module (150) can receive commands or data to be used for a component of the electronic device (101) (e.g., processor (120)) from outside the electronic device (101) (e.g., user). The input module (150) may include, for example, a microphone, a mouse, a keyboard, a key (e.g., a button), or a digital pen (e.g., a stylus pen).

[0024] The sound output module (155) can output a sound signal to the outside of the electronic device (101). The sound output module (155) may include, for example, a speaker or a receiver. The speaker may be used for general purposes, such as multimedia playback or recording playback. The receiver may be used to receive incoming calls. According to one embodiment, the receiver may be implemented separately from the speaker or as part thereof.

[0025] The display module (160) can visually provide information to an external (e.g., user) of the electronic device (101). The display module (160) may include, for example, a display, a holographic device, or a projector and a control circuit for controlling said device. According to one embodiment, the display module (160) may include a touch sensor configured to detect a touch, or a pressure sensor configured to measure the intensity of the force generated by said touch.

[0026] The audio module (170) can convert sound into an electrical signal or, conversely, convert an electrical signal into sound. According to one embodiment, the audio module (170) can acquire sound through the input module (150) or output sound through the sound output module (155) or an external electronic device (e.g., electronic device (102)) (e.g., speaker or headphones) connected directly or wirelessly to the electronic device (101).

[0027] The sensor module (176) can detect the operating state of the electronic device (101) (e.g., power or temperature) or the external environmental state (e.g., user state) and generate an electrical signal or data value corresponding to the detected state. According to one embodiment, the sensor module (176) may include, for example, a gesture sensor, a gyroscope sensor, a barometric pressure sensor, a magnetic sensor, an accelerometer sensor, a grip sensor, a proximity sensor, a color sensor, an IR (infrared) sensor, a biosensor, a temperature sensor, a humidity sensor, or an illuminance sensor.

[0028] The interface (177) may support one or more specified protocols that can be used for the electronic device (101) to be connected directly or wirelessly to an external electronic device (e.g., electronic device (102)). According to one embodiment, the interface (177) may include, for example, a high definition multimedia interface (HDMI), a universal serial bus (USB) interface, an SD card interface, or an audio interface.

[0029] The connection terminal (178) may include a connector through which the electronic device (101) can be physically connected to an external electronic device (e.g., electronic device (102)). According to one embodiment, the connection terminal (178) may include, for example, an HDMI connector, a USB connector, an SD card connector, or an audio connector (e.g., a headphone connector).

[0030] The haptic module (179) can convert an electrical signal into a mechanical stimulus (e.g., vibration or movement) or an electrical stimulus that the user can perceive through tactile or kinesthetic senses. According to one embodiment, the haptic module (179) may include, for example, a motor, a piezoelectric element, or an electric stimulation device.

[0031] The camera module (180) can capture still images and video. According to one embodiment, the camera module (180) may include one or more lenses, image sensors, image signal processors, or flashes.

[0032] The power management module (188) can manage power supplied to the electronic device (101). According to one embodiment, the power management module (188) can be implemented, for example, as at least part of a power management integrated circuit (PMIC).

[0033] The battery (189) can supply power to at least one component of the electronic device (101). According to one embodiment, the battery (189) may include, for example, a non-rechargeable primary battery, a rechargeable secondary battery, or a fuel cell.

[0034] The communication module (190) can support the establishment of a direct (e.g., wired) communication channel or a wireless communication channel between an electronic device (101) and an external electronic device (e.g., electronic device (102), electronic device (104), or server (108)), and the performance of communication through the established communication channel. The communication module (190) may include one or more communication processors that operate independently of the processor (120) (e.g., application processor) and support direct (e.g., wired) communication or wireless communication. According to one embodiment, the communication module (190) may include a wireless communication module (192) (e.g., cellular communication module, short-range wireless communication module, or GNSS (global navigation satellite system) communication module) or a wired communication module (194) (e.g., LAN (local area network) communication module, or power line communication module). The corresponding communication module among these communication modules can communicate with an external electronic device (104) through a first network (198) (e.g., a short-range communication network such as Bluetooth, WiFi (wireless fidelity) direct, or IrDA (infrared data association)) or a second network (199) (e.g., a legacy cellular network, a 5G network, a next-generation communication network, the Internet, or a computer network (e.g., a LAN or WAN)). These various types of communication modules may be integrated into a single component (e.g., a single chip) or implemented as multiple separate components (e.g., multiple chips). The wireless communication module (192) can identify or authenticate the electronic device (101) within a communication network such as the first network (198) or the second network (199) using subscriber information (e.g., International Mobile Subscriber Identifier (IMSI)) stored in the subscriber identification module (196).

[0035] The wireless communication module (192) can support 5G networks and next-generation communication technologies following 4G networks, for example, new radio access technology. NR access technology can support high-speed transmission of high-capacity data (enhanced mobile broadband (eMBB)), minimization of terminal power and connection of multiple terminals (massive machine type communications (mMTC)), or high reliability and low latency (ultra-reliable and low-latency communications (URLLC)). The wireless communication module (192) can support a high-frequency band (e.g., mmWave band) to achieve a high data transmission rate, for example. The wireless communication module (192) can support various technologies for securing performance in the high-frequency band, such as beamforming, massive MIMO (multiple-input and multiple-output), full-dimensional MIMO (FD-MIMO), array antenna, analog beam-forming, or large-scale antenna. The wireless communication module (192) can support various requirements specified in the electronic device (101), external electronic device (e.g., electronic device (104)), or network system (e.g., second network (199)). According to one embodiment, the wireless communication module (192) can support a Peak data rate (e.g., 20 Gbps or more) for eMBB realization, loss coverage (e.g., 164 dB or less) for mMTC realization, or U-plane latency (e.g., downlink (DL) and uplink (UL) each 0.5 ms or less, or round trip 1 ms or less) for URLLC realization.

[0036] An antenna module (197) can transmit a signal or power to or from an external source (e.g., an external electronic device). According to one embodiment, the antenna module (197) may include an antenna comprising a radiator made of a conductor or a conductive pattern formed on a substrate (e.g., a PCB). According to one embodiment, the antenna module (197) may include a plurality of antennas (e.g., an array antenna). In this case, at least one antenna suitable for a communication method used in a communication network, such as a first network (198) or a second network (199), may be selected from the plurality of antennas, for example, by a communication module (190). A signal or power may be transmitted or received between the communication module (190) and an external electronic device through the selected at least one antenna. According to some embodiments, in addition to the radiator, other components (e.g., a radio frequency integrated circuit (RFIC)) may be additionally formed as part of the antenna module (197).

[0037] According to various embodiments, the antenna module (197) may form a mmWave antenna module. According to one embodiment, the mmWave antenna module may include a printed circuit board, an RFIC disposed on or adjacent to a first surface (e.g., bottom surface) of the printed circuit board and capable of supporting a specified high frequency band (e.g., mmWave band), and a plurality of antennas (e.g., array antennas) disposed on or adjacent to a second surface (e.g., top surface or side surface) of the printed circuit board and capable of transmitting or receiving a signal of the specified high frequency band.

[0038] At least some of the above components can be connected to each other via a communication method between peripheral devices (e.g., bus, GPIO (general purpose input and output), SPI (serial peripheral interface), or MIPI (mobile industry processor interface)) and exchange signals (e.g., commands or data) with each other.

[0039] According to one embodiment, commands or data may be transmitted or received between the electronic device (101) and an external electronic device (104) through a server (108) connected to a second network (199). Each of the external electronic devices (102, or 104) may be the same or a different type of device as the electronic device (101). According to one embodiment, all or part of the operations performed on the electronic device (101) may be performed on one or more of the external electronic devices (102, 104, or 108). For example, if the electronic device (101) needs to perform a function or service automatically or in response to a request from a user or another device, the electronic device (101) may request one or more external electronic devices to perform at least part of the function or service instead of performing the function or service itself or additionally. One or more external electronic devices that receive the above request may execute at least part of the requested function or service, or additional function or service related to the request, and transmit the result of the execution to the electronic device (101). The electronic device (101) may provide the result as is or additionally processed as at least part of the response to the request. For this purpose, for example, cloud computing, distributed computing, mobile edge computing (MEC), or client-server computing technology may be used. The electronic device (101) may provide ultra-low latency services using, for example, distributed computing or mobile edge computing. In one embodiment, the external electronic device (104) may include an Internet of Things (IoT) device. The server (108) may be an intelligent server using machine learning and / or neural networks. According to one embodiment, the external electronic device (104) or the server (108) may be included within a second network (199).The electronic device (101) can be applied to intelligent services (e.g., smart home, smart city, smart car, or healthcare) based on 5G communication technology and IoT-related technology.

[0040] FIG. 2 is a block diagram illustrating the configuration of an electronic device according to various embodiments.

[0041] According to one embodiment, with reference to FIG. 2, the electronic device (200) may include a microphone (210), a camera (220), an output interface (230), a memory (240), and a processor (250).

[0042] According to one embodiment, a microphone (210) (e.g., input module (150) of FIG. 1) can receive sounds from the surrounding environment or speech from a user. A processor (250) can identify the received speech from a user using an artificial intelligence model and perform an action corresponding to the speech from the user. For example, the microphone (210) may include a standard microphone, a surround microphone, and / or a directional microphone.

[0043] According to one embodiment, a camera (220) (e.g., camera module (180) of FIG. 1) can capture the surrounding environment to acquire an image. For example, the image may include a static image (e.g., a still image, a static image) containing one frame and a dynamic image (e.g., a video) containing multiple frames. The image may include one or more objects. For example, the camera (220) may include an RGB camera, a depth camera, a wide-angle camera, and / or a telephoto camera. The electronic device (200) may include one or more cameras (220).

[0044] According to one embodiment, an output interface (230) (e.g., the acoustic output module (155) of FIG. 1) and a display module (160) can output data (or information) processed by the processor (250). For example, the output interface (230) may include a speaker (e.g., the acoustic output module (155) of FIG. 1) and / or a display (e.g., the display module (160) of FIG. 1). The speaker may output information related to the user's speech and / or data processed based on the user's speech as voice and / or notification sounds, and the display may display text and / or images.

[0045] According to one embodiment, a memory (240) (e.g., memory (130) of FIG. 1) may store data, algorithms, programs, instructions, etc. that perform the functions of an electronic device (200). Instructions, etc. stored in the memory (240) may be loaded into a processor (250) and executed by the processor (250). The memory (240) may include a database containing information about objects.

[0046] According to one embodiment, a processor (250) (e.g., processor (120) of FIG. 1) can control each configuration of an electronic device (200). The electronic device (200) may include one or more processors (250). For example, the processor (250) may correspond to a plurality of processors that collectively perform a plurality of functions by dividing them among the processors.

[0047] The processor (250) may include various processing circuits and / or multiple processors. For example, the term “processor” as used herein, including in the claims, may include various processing circuits including at least one processor. Here, one or more of the at least one processor may perform the various functions described in this document in a distributed manner, individually and / or collectively. When the terms “processor,” “at least one processor,” and “one or more processors” as used herein are described as performing numerous functions, these terms may encompass, by example and without limitation, situations where one processor performs some of the mentioned functions and other processor(s) perform others of the mentioned functions, and situations where a single processor can perform all the mentioned functions. Additionally, at least one processor may include a combination of processors performing various mentioned / disclosed functions, and may perform them, for example, in a distributed manner. At least one processor may execute program instructions to achieve or perform various functions.

[0048] For example, the processor (250) can acquire an image containing an object through the camera (220). The image may contain one or more objects. As an example, if the electronic device (200) includes a display, the processor (250) can display the image acquired through the display. The processor (250) can receive a user's speech through the microphone (210). The processor (250) can identify the received user's speech using an artificial intelligence model.

[0049] For example, the processor (250) can determine whether a target object is specified among the objects included in the image based on the identified user's utterance. For example, specifying a target object may mean a case where the processor (250) can determine which object the user refers to among one or more objects included in the image. The processor (250) can specify a target object among the objects included in the image based solely on the user's utterance. As an example, if only one object is included in the image and the user utters, "When was this made?", the processor (250) can specify that the object referred to by the user is the one object included in the image. As an example, if a white object and a blue object are included in the image and the user utters, "What is the price of the white item?", the processor (250) can specify that the object referred to by the user is the white object.

[0050] For example, if the target object is identified solely by the user's utterance, the processor (250) can determine the target object included in the image. For example, determining the target object may mean identifying the location, size, shape, type, and / or features of the object referred to by the user.

[0051] For example, if the target object is not identified solely by the user's utterance, the processor (250) may determine that multimodal input is required. If it is determined that multimodal input is required, the processor (250) may additionally identify the target object using a pointer. As an example, the pointer may include a part of the body (e.g., a finger), a pre-set form of said part of the body (e.g., two fingers extended), a pre-set gesture (e.g., a gesture of drawing a circle with a finger), and / or a pre-set object (e.g., a smart pen). The processor (250) may acquire an image containing the pointer through the camera (220) and identify the pointer from the image. The processor (250) may identify the target object based on the user's utterance and the pointer. When the target object is identified, the processor (250) may determine the target object.

[0052] For example, the processor (250) can determine a target object based on the end point of the pointer and / or a certain range from the end point. The processor (250) can determine an object included within a certain range from the end point of the pointer and / or the end point as a target object. Multiple objects may be included within a certain range from the end point of the pointer and / or the end point. If multiple objects are included within a certain range from the end point of the pointer and / or the end point, the processor (250) can determine the target object using additional information. For example, the processor (250) can obtain depth information of the object through the camera (220). As an example, the electronic device (200) may further include a sensor (e.g., the sensor module (176) of FIG. 1). The processor (250) can obtain depth information of the object through the sensor. When the user's utterance refers to a distant location, the processor (250) can determine the object at the farthest location among the multiple objects as the target object based on depth information. When the user's utterance refers to a near location, the processor (250) can determine the object at the closest location among the multiple objects as the target object based on depth information. For example, when multiple objects are included within a certain range from the end point of a pointer and / or the end point, and the user utters "How much is this?", the processor (250) can determine the object at the closest location among the multiple objects as the target object based on depth information. For example, when the user utters "How much is that?", the processor (250) can determine the object at the farthest location among the multiple objects as the target object based on depth information.

[0053] For example, the processor (250) can determine the target object by extending the range from the end point of the pointer. As an example, if the target object is not determined based on the end point of the pointer, the processor (250) can determine the target object within a first range of size from the end point of the pointer (e.g., a radius of 1 cm from the end point of the pointer). If the target object is not determined within the first range of size from the end point of the pointer, the processor (250) can determine the target object within a second range of size from the end point of the pointer (e.g., a radius of 2 cm from the end point of the pointer).

[0054] For example, the processor (250) can determine a target object based on a path (e.g., a straight line, a curve, or a closed curve) formed by the movement of the pointer. For example, if a closed curve is formed as the pointer moves and one object is included within the closed curve, the processor (250) can determine the object included within the closed curve as the target object. For example, if parts of multiple objects are included within the closed curve, the processor (250) can determine the object that contains the largest portion and / or the object that contains more than a preset ratio as the target object. For example, the processor (250) can determine the target object after identifying the target object. For example, the processor (250) can identify and determine the target object substantially simultaneously.

[0055] For example, the received user utterance and the acquired pointer (or area, path associated with the pointer) may be referred to as multimodal. The processor (250) can process the received multimodal to determine the target object. The processor (250) can determine the target object using the received multimodal without additional hardware and provide information about the target object.

[0056] For example, the processor (250) can determine a target object from an image acquired in the past based on the user's utterance. The image acquired by the processor (250) may include a dynamic image (e.g., a video). As an example, Building A may be included in an image (or frame) of a dynamic image at a certain point in time, and Building A may disappear from an image at a current point in time. If the user utters, "Where is the entrance to the parking lot of the building from earlier?", the processor (250) can determine Building A, which is included in the image at a previous point in time, as the target object.

[0057] For example, the processor (250) may provide guidance to the user to re-enter their speech and / or pointer. If the processor (250) cannot determine the target object based on the pointer and additional information (e.g., depth information, location information) (or if the target object cannot be identified), the processor (250) may provide guidance to the user to re-enter their speech and / or pointer. As an example, the processor (250) may output audio guidance through a speaker, such as “Please repeat that” and / or “Please move the pointer to the correct location.” As an example, the processor (250) may display text guidance through a display, such as “Please be specific” and / or “Please move the pointer to select the object.”

[0058] For example, if a target object is determined, the processor (250) may display an indicator representing the determined target object through a display. As an example, the indicator representing the target object may include a line, a shape, a color and / or shade. The processor (250) may output the name, type and / or area of the target object through an output interface (230).

[0059] For example, the processor (250) can acquire (e.g., crop, copy) a certain area containing a target object. As an example, the processor (200) can determine the edges (e.g., outlines) of the determined target object and acquire a certain area along the determined edges.

[0060] For example, the electronic device (200) may further include a communication interface (e.g., the communication module (190) of FIG. 1). The processor (200) may transmit an image containing a target object determined through the communication interface to an external device. As an example, the external device may perform operations such as changing the target object, resizing a certain area to be acquired, and / or repositioning a certain area to be acquired according to user input, and generate operation data (or information). The processor (250) may receive the operation data generated from the external device through the communication interface. The processor (250) may acquire a certain area containing the target object based on the received operation data.

[0061] For example, the processor (250) can obtain information about a target object included within a certain area obtained. As an example, if a database contains information about the target object, the processor (250) can obtain information about the target object based on the database. If the database does not contain information about the target object, the processor (250) can obtain information about the target object using an external device. As an example, the processor (250) can obtain information about the target object using an artificial intelligence model. The processor (250) can provide information about the target object corresponding to the user's utterance through the output interface (230). As an example, if the user's utterance is to inquire about the history of the target object, the processor (250) can output the history of the target object through the output interface (230). If the user's utterance is to inquire about the price of the target object, the processor (250) can output the price of the target object through the output interface (230). For example, the processor (250) can output information of the target object as an audio signal through a speaker. The processor (250) can display information of the target object as text and / or an image through a display.

[0062] For example, various embodiments of this document can accurately identify a target object included in an image acquired by an electronic device without additional hardware devices and provide information about the target object. For example, various embodiments of this document can easily determine a target object using a user's multimodal in a lightweight electronic device (200) composed of simple hardware.

[0063] FIG. 3 is a block diagram illustrating a configuration for analyzing user input according to various embodiments.

[0064] According to one embodiment, with reference to FIG. 3, the electronic device (200) may include a task-based analysis system (300). The task-based analysis system (300) may include a camera (310) (e.g., the camera module (180) of FIG. 1 or the camera (220) of FIG. 2), a camera driver (320), an automatic speech recognition (ASR) (330), a natural-language understanding (NLU) (340), an image analysis module (350), a user input analysis module (360), an artificial intelligence model (370), and a database (380). As an example, some components of the task-based analysis system (300) may be stored in memory (240) and loaded into a processor (250) to operate.

[0065] According to one embodiment, the electronic device (200) may include one or more cameras (310). At least some of the one or more cameras (310) may be able to take pictures in one direction relative to the electronic device (200). A camera driving unit (320) may drive one or more cameras (310) so that the one or more cameras (310) can take pictures. An ASR (330) may convert a user's utterance into text. As an example, the user may make an utterance including a wake word. The electronic device (200) may activate a voice recognition operation by the wake word and receive the user's utterance. As an example, the user may say, "Hi, tell me the information about the cushion I see now," and "Hi" may be the wake word. The electronic device (200) may activate a voice recognition operation by the utterance "Hi" and receive the utterance "Tell me the information about the cushion I see now." The ASR (330) may convert the received user's utterance into text. As an example, the electronic device (200) may receive a trigger input and perform a voice recognition operation. The trigger input may include a pre-set button and / or a pre-set touch. As an example, when a user presses a pre-set button, the electronic device (200) may activate the microphone (210). The electronic device (200) receives the user's utterance through the microphone (210), and the ASR (330) may convert the received user's utterance into text. As an example, the electronic device (200) may process the user's utterance without a wake word and / or a trigger input. When the user speaks, the electronic device (200) receives the user's utterance, and the ASR (330) may convert the user's utterance into text. The electronic device (200) may analyze the converted text and determine which text among the converted texts needs to be processed.

[0066] According to one embodiment, the NLU (340) can determine the user's intention based on the converted text (or, the user's utterance). For example, the NLU (340) can determine whether the camera (310) is activated based on the converted text. If the NLU (340) determines that the operation of the camera (310) is not necessary, the electronic device (200) (e.g., NLU (340)) may not drive the camera (340). If the NLU (340) determines that the operation of the camera (310) is necessary, the electronic device (200) (e.g., NLU (340)) may drive the camera (310). As an example, if the user utters "What is the weather like today?", the NLU (340) determines that the operation of the camera (310) is unnecessary and may not drive the camera (310). For example, if a user utters, "What breed of cat is in front of me right now?", the NLU (340) determines that image information is needed and may request the camera driver (320) to acquire image information. The camera driver (320) may drive at least one of the one or more cameras (310) and acquire an image in accordance with the request of the NLU (340). For example, the image may include still images and video. For example, the camera (310) may be driven by a specific trigger keyword. If the user's utterance includes a specific trigger keyword, the NLU (340) requests the camera driver (320) to acquire image information, and the camera driver (320) may drive the camera (310).

[0067] According to one embodiment, when the camera (310) acquires an image, the image analysis module (350) can analyze the image acquired from the camera (310). The image analysis module (350) can determine whether the acquired image contains an object intended (or referred to) by the user (e.g., a target object) and can determine the object intended by the user. Data related to the determined object intended by the user can be used when generating result information. For example, the image analysis module (350) can acquire (e.g., crop, copy) a certain area containing the target object (or information related to the certain area (e.g., coordinate information)) and transmit the acquired certain area to an artificial intelligence model (370).

[0068] For example, if image information is required, an image acquired from a camera (310) can be transmitted to an image analysis module (350). The image analysis module (350) can identify objects contained in the image. According to one example, the image analysis module (350) may use computer vision technology to identify objects in the image. For example, the image analysis module (350) may identify objects using an artificial intelligence model (370). As an example, the image analysis module (350) may recognize objects contained in the image using deep learning or machine learning. The image analysis module (350) can determine whether the object intended by the user is identified among the identified objects. For example, the image analysis module (350) may identify the object using an artificial intelligence model (370). If the object intended by the user is identified, the image analysis module (350) can determine the object intended by the user. For example, the electronic device (200) may include a database (380) containing information about an object. An image analysis module (350) can determine whether information about an object corresponding to the object intended by the user is included in the database (380). For example, the database (380) may be generated through learning about the object and may be generated based on the object's tags. The database may include information about famous buildings, works of art, manufactured goods, and / or food products whose packaging is specified. For example, if the database (380) contains information about the target object (e.g., the name and price of a mobile phone case), the image analysis module (350) may generate result information based on the database (380).

[0069] According to one embodiment, the image analysis module (350) can determine the object of the pointer included in the acquired image. As an example, the pointer may include a part of the body (e.g., hand, finger), a pre-set shape of the part of the body, a pre-set gesture (or, movement of the part of the body), and / or a pre-set object. The image analysis module (350) can determine (or identify) the pointer as one object included in the image. According to one embodiment, the user input analysis module (360) can determine the presence of the pointer in the acquired image and the object referred to by the pointer. For example, if the object intended by the user is not specified (or determined) by the image analysis module (350) alone, the user input analysis module (360) can analyze the image of the surrounding area of the pointer and specify the object intended by the user.

[0070] For example, the user input analysis module (360) can determine a target object based on a certain range from a point of the pointer or from the end point of the pointer. For example, the user input analysis module (360) can determine a target object by expanding the range from the end point of the pointer. For example, the user input analysis module (360) can determine a target object using depth information of the object. For example, the user input analysis module (360) can determine a target object based on a path (e.g., a closed curve) formed by the movement of the pointer.

[0071] According to one embodiment, the artificial intelligence model (370) can generate result information to be provided to the user based on multimodal (e.g., user's utterance and image acquisition). As an example, the artificial intelligence model (370) may include a large language model (LLM) and / or a large multimodal model (LMM). The artificial intelligence model (370) may receive a certain area acquired from an image from an image analysis model (350) and text (or user's utterance) converted from an NLU (340). As an example, the artificial intelligence model (370) may receive information about the surrounding area of a pointer or a determined target object from a user input analysis module (360). The artificial intelligence model (370) may generate result information to be provided to the user (e.g., information about the target object) based on data related to the object generated through the analysis of the user's utterance and / or the acquired image. For example, the artificial intelligence model (370) may include an artificial intelligence model for recognizing user utterances, an artificial intelligence model for image analysis, and an artificial intelligence model for generating result information. As an example, the artificial intelligence model for recognizing user utterances, the artificial intelligence model for image analysis, and the artificial intelligence model for generating result information may each be implemented separately. As an example, the artificial intelligence model for recognizing user utterances, the artificial intelligence model for image analysis, and the artificial intelligence model for generating result information may be implemented as a single model.

[0072] For example, if the database (380) does not contain information about the target object, the artificial intelligence model (370) can generate result information using an external device. For example, the artificial intelligence model (370) can transmit information related to the target object image (e.g., shape and color of a mobile phone case) to an external device and receive result information from the external device.

[0073] As an example, all or part of the task-based analysis system (300) may be included in an external device. The electronic device (200) may determine an object intended by the user and acquire a certain area containing the determined object intended by the user. As an example, a certain area of the object intended by the user may be acquired based on the edges of the object intended by the user. The electronic device (200) may transmit the user's utterance and the acquired certain area to an external device including the task-based analysis system (300). The external device may generate result information to be provided to the user based on the received user's utterance and / or the acquired certain area. The external device may transmit the generated result information to be provided to the user to the electronic device (200). The electronic device (200) that receives the result information may output the result information through an output interface (230).

[0074] FIG. 4 is a flowchart illustrating the operation of an electronic device according to various embodiments.

[0075] In the following embodiments, each operation may be performed sequentially, but is not necessarily performed sequentially. For example, the order of each operation may be changed, and at least two operations may be performed in parallel.

[0076] According to one embodiment, 410 to 490 may be understood to be performed in a processor (e.g., processor (120) of FIG. 1 or processor (230) of FIG. 2) of an electronic device (e.g., electronic device (101) of FIG. 1 or electronic device (200) of FIG. 2).

[0077] According to one embodiment, the electronic device (200) can acquire an image (410) and receive a user's speech (420). The electronic device (200) can acquire an image containing an object using a camera (220). As an example, if the electronic device (200) needs to acquire an image based on a user's speech, it can acquire an image using the camera (220). When a user speaks, the electronic device (200) can receive the user's speech using a microphone (210).

[0078] According to one embodiment, the electronic device (200) can determine whether multimodal input is required (430). For example, if a target object is identified based on the user's utterance, the electronic device (200) can determine that multimodal input is not required. For example, if a target object is not identified solely by the user's utterance, the electronic device (200) can determine that multimodal input is required.

[0079] According to one embodiment, if it is determined that multimodal input is not required (430-NO), the electronic device (200) can determine a target object based on the user's utterance (440). The electronic device (200) can acquire a certain area containing the target object (450) and acquire information about the identified target object (470). As an example, the electronic device (200) can transmit a certain area containing the target object (or related information) to an external device. The external device can acquire information about the target object based on the received data. As an example, the external device can acquire information about the target object using an artificial intelligence model. The external device (200) can transmit the acquired information about the target object to the electronic device (200). The electronic device (200) can provide the acquired information about the target object using an output interface (230).

[0080] According to one embodiment, if it is determined that multimodal input is required (430-YES), the electronic device (200) can determine whether it can identify (or specify) a target object through the analysis of the acquired image (460). For example, the electronic device (200) can determine a pointer included in the image, a certain area centered on the pointer, and / or a path formed by the pointer, and identify the target object. As an example, the pointer may include a part of the body (e.g., a finger), a pre-set form of said part of the body (e.g., a form with two fingers extended), a pre-set gesture (e.g., a gesture of drawing a circle with a finger), and / or a pre-set object (e.g., a smart pen). As an example, the electronic device (200) can identify an object located at the end point of the pointer as the target object. As an example, the electronic device (200) can identify an object included within a certain distance area from the pointer (or a point of the pointer) as the target object. As an example, the electronic device (200) can estimate the direction (e.g., vector) value and / or coordinate value indicated by the pointer and estimate the object located in that direction and / or at that point. The electronic device (200) can identify the estimated object as a target object. As an example, the electronic device (200) can identify an object located within a path formed by the pointer as a target object. As an example, if multiple objects are included within an area or path at a certain distance from the pointer, the electronic device (200) can identify all of the multiple objects as target objects.

[0081] According to one embodiment, if the target object is not identified through image analysis (460-NO), the electronic device (200) may make an additional inquiry to the user. If the user makes an additional utterance in response to the inquiry, the electronic device (200) receives the user's utterance (420) and can determine whether multimodal input is required (430).

[0082] According to one embodiment, when a target object is identified through image analysis (460-YES), the electronic device (200) can search for the edges of the target object and obtain an area containing the target object (e.g., crop, copy) (470). For example, when a target object is identified, the electronic device (200) can obtain an area containing the target object based on the edges of the target object.

[0083] According to one embodiment, the electronic device (200) transmits the area acquired along with the user's utterance to an artificial intelligence model (e.g., LLM and / or LMM) (480) and can acquire information about the target object (490). The electronic device (200) can acquire information about the target object using the artificial intelligence model based on the acquired area including the user's utterance and the target object. As an example, the artificial intelligence model may be included in an external device. The electronic device (200) can transmit the area acquired along with the user's utterance to an external device. The external device can acquire information about the target object using the artificial intelligence model based on the received user's utterance and the acquired area including the target object. The external device (200) can transmit the acquired information about the target object to the electronic device (200). The electronic device (200) can provide information about the acquired target object using an output interface (230).

[0084] FIGS. 5, FIGS. 6, FIGS. 7, FIGS. 8, FIGS. 9, FIGS. 10, FIGS. 11, FIGS. 12 and FIGS. 13 are drawings illustrating an operation for determining a target object according to various embodiments.

[0085] According to one embodiment, FIGS. 5, FIGS. 6, FIGS. 7, FIGS. 8, FIGS. 9, FIGS. 10, FIGS. 11, FIGS. 12 and FIGS. 13 illustrate a field of view seen by a user wearing an electronic device (200) or a screen displayed on the electronic device (200).

[0086] According to one embodiment, with reference to FIG. 5, a screen (1110) containing one object (11) is shown. A user may make a speech related to the object (11), and an electronic device (200) may receive the user's speech. As an example, the user may make a speech saying, "Tell me the history of that." The electronic device (200) may determine whether multimodal input is required. If the target object is identified based on the user's speech, the electronic device (200) may determine that multimodal input is not required. If the target object is not identified based on the user's speech, the electronic device (200) may determine that multimodal input is required. Whether multimodal input is required may have substantially the same meaning as whether the target object is identified.

[0087] According to one embodiment, although the user utters "that," the screen (1110) contains only one object (11), so the electronic device (200) can identify the target object and determine that multimodal input is not required. The electronic device (200) can determine that the one object (11) included in the screen (1110) is the target object. The electronic device (200) can display an indicator (51) representing the target object. As an example, the screen (1110) illustrated in FIG. 5 may include the Eiffel Tower. Since the electronic device (200) can identify the Eiffel Tower as the target object, it can determine that multimodal input is not required. If the database of the electronic device (200) contains information about the Eiffel Tower, the electronic device (200) can obtain information about the Eiffel Tower based on the database.

[0088] According to one embodiment, the electronic device (200) can crop a certain area containing a target object. As an example, the electronic device (200) can crop a certain area in the shape of a specific figure or the shape of the target object. The electronic device (200) can transmit the user's utterance and the cropped area to an artificial intelligence model. As an example, if the database contains information about the target object, the electronic device (200) can generate result information based on the database without transmitting the user's utterance and the cropped area to the artificial intelligence model. As an example, the artificial intelligence model may include an LLM and / or an LMM. The artificial intelligence model can acquire information about the target object. As an example, if the artificial intelligence model is included in an external device, the electronic device (200) can transmit the user's utterance and the cropped area to the external device. The artificial intelligence model can acquire information about the target object based on the user's utterance and the cropped area. The external device can transmit the acquired information about the target object to the electronic device (200). The electronic device (200) can provide information about a target object obtained from an artificial intelligence model (or, external device) to the user.

[0089] According to one embodiment, with reference to FIG. 6, a screen (1120) including a first object (13) and a second object (15) is shown. In the screen (1120) shown in FIG. 6, the first object (13) may be located on the left side and the second object (15) may be located on the right side relative to the user's field of view. The user may make utterances related to the first object (13) and the second object (15), and the electronic device (200) may receive the user's utterances. As an example, the user may say, "What is the price of the item on the right?" The electronic device (200) may determine whether multimodal input is required. If the target object is identified based on the user's utterance, the electronic device (200) may determine that multimodal input is not required. If the target object is not identified based on the user's utterance, the electronic device (200) may determine that multimodal input is required.

[0090] According to one embodiment, although the user uttered "the object on the left," the screen (1120) contains two objects (13, 15), so the electronic device (200) can identify the first object (13) as the target object based on the user's field of view and determine that multimodal input is not required. The electronic device (200) can identify and determine the first object (13) included in the screen (1120) as the target object. The electronic device (200) can display an indicator (53) representing the target object.

[0091] According to one embodiment, the electronic device (200) can crop a certain area containing a target object. As an example, the electronic device (200) can crop a certain area in the shape of a specific figure or the shape of the target object. For example, if the database contains information about the target object, the electronic device (200) can generate information about the target object based on the database. If the database does not contain information about the target object, the electronic device (200) can transmit the user's utterance and the cropped area to an artificial intelligence model. As an example, the artificial intelligence model may include an LLM and / or an LMM. The artificial intelligence model can acquire information about the target object. For example, the electronic device (200) can acquire information about the target object using an external device. As an example, the electronic device (200) can transmit information related to the target object image to an external device and receive information about the target object from the external device. For example, if an artificial intelligence model is included in an external device, the electronic device (200) can transmit the user's utterance and cropped area to the external device. The artificial intelligence model can obtain information about a target object based on the user's utterance and cropped area. The external device can transmit the obtained information about the target object to the electronic device (200). The electronic device (200) can provide the information about the target object obtained from the artificial intelligence model (or the external device) to the user.

[0092] As an example, the electronic device (200) can identify the first object (13) as the target object based on the utterance "the object on the left" and the screen (1120) shown in FIG. 6. Since the electronic device (200) can identify the target object, it can determine that multimodal input is not required. If the database (380) does not contain information about the first object (13), the electronic device (200) can transmit information related to the image of the first object (13) (e.g., shape, color) to an external device and receive information about the first object (13) from the external device.

[0093] According to one embodiment, with reference to FIG. 7, a screen (1130) including a plurality of objects and a pointer (3) is shown. A user may make a speech related to an object included in the screen (1130), and an electronic device (200) may receive the user's speech. As an example, the user may say, "What is the price of this?". The electronic device (200) may determine whether multimodal input is required.

[0094] According to one embodiment, the user utters "this," and since the screen (1130) contains multiple objects, the electronic device (200) cannot identify the target object and may determine that multimodal input is required. The electronic device (200) can identify the pointer (3) included in the screen (1130). As an example, the pointer (3) may include a part of the body, a pre-set shape of a part of the body, a pre-set gesture, and / or a pre-set object. For example, the electronic device (200) can identify the target object based on the end point of the pointer (3). The electronic device (200) may determine the object included at the end point of the pointer (3) (e.g., the tip of a finger) as the target object. As illustrated in FIG. 7, a third object (17) may be located at the end point of the pointer (3). The electronic device (200) can identify and determine the third object (17) as the target object based on the user's speech and pointer (3). The electronic device (200) can display an indicator (77) indicating the target object.

[0095] According to one embodiment, the electronic device (200) can crop a certain area containing a target object and obtain information about the target object using an artificial intelligence model. The electronic device (200) can provide the information about the target object obtained from the artificial intelligence model to a user.

[0096] According to one embodiment, with reference to FIG. 8, a screen (1140) including a plurality of objects (21, 22, 23) and a pointer (3) is shown. A user may make a speech related to an object included in the screen (1140), and an electronic device (200) may receive the user's speech. As an example, the user may make a speech inquiring about information of an object (e.g., history, type, price, brand). The electronic device (200) may determine whether multimodal input is required.

[0097] According to one embodiment, if the target object is not specified solely by the user's utterance, the electronic device (200) may determine that multimodal input is required. The electronic device (200) may identify a pointer (3) included in the screen (1140).

[0098] For example, the electronic device (200) can identify a target object based on an area (5) that includes a certain radius (R) from the end point of the pointer (3). For example, the electronic device (200) can identify a target object by expanding the radius (R) from the end point of the pointer (3). As an example, if there is no object at the end point of the pointer (3), the electronic device (200) can determine a target object within a first size range (e.g., a radius of 1 cm from the end point of the pointer) from the end point of the pointer (3). If there is no target object within the first size range from the end point of the pointer (3), the electronic device (200) can identify a target object within a second size range (e.g., a radius of 2 cm from the end point of the pointer) from the end point of the pointer (3).

[0099] For example, if a plurality of objects (21, 22, 23) are included within an area (5) that includes a certain radius (R) from the end point of the pointer (3), the electronic device (200) can identify a target object using additional information. As an example, the electronic device (200) can acquire depth information of the plurality of objects (21, 22, 23) and identify a target object based on the acquired depth information and the user's utterance. As an example, if the user utters "What is the price of this?", the electronic device (200) can determine that the user is pointing to an object located at a relatively close distance. The electronic device (200) can identify the fourth object (21), which is the object at the closest location among the plurality of objects (21, 22, 23), as the target object. The electronic device (200) can determine the fourth object (21) as the target object. For example, when a user speaks, "What is the price of that?", the electronic device (200) may determine that it is pointing to an object located at a relatively distant distance. The electronic device (200) may identify the fifth object (23), which is the object located furthest away among the plurality of objects (21, 22, 23), as the target object. The electronic device (200) may determine the fifth object (23) as the target object.

[0100] According to one embodiment, the electronic device (200) can crop a certain area containing a target object and obtain information about the target object using an artificial intelligence model. The electronic device (200) can provide the information about the target object obtained from the artificial intelligence model to a user.

[0101] According to one embodiment, with reference to FIG. 9, a screen (1150) including a plurality of objects (21, 22, 23) and a pointer (3) is shown. A user may make a speech related to an object included in the screen (1150), and an electronic device (200) may receive the user's speech. As an example, the user may make a speech inquiring about information of an object. The electronic device (200) may determine whether multimodal input is required.

[0102] According to one embodiment, if the target object is not specified solely by the user's utterance, the electronic device (200) may determine that multimodal input is required. The electronic device (200) may identify a pointer (3) included in the screen (1150).

[0103] For example, the electronic device (200) can identify a target object based on a path (7) formed by the movement of the pointer (3). As an example, the path (7) may include a straight line, a curve, or a closed curve. If one object (27) is included within the path (7) formed by the movement of the pointer (3), the electronic device (200) can identify and determine that one object (27) included within the path (7) is the target object. For example, if parts of multiple objects are included within the path (7), the electronic device (200) can identify the object that contains the largest portion and / or the object that contains more than a preset ratio as the target object. For example, if multiple objects are included within the path (7), the electronic device (200) can identify the target object by additionally using additional information (e.g., depth information, location information). For example, if multiple objects are included within the path (7), the electronic device (200) may request additional information and / or re-enter the path (7) from the user.

[0104] According to one embodiment, the electronic device (200) can crop a certain area containing a target object and obtain information about the target object using an artificial intelligence model. The electronic device (200) can provide the information about the target object obtained from the artificial intelligence model to a user.

[0105] According to one embodiment, with reference to FIG. 10, a screen (1160) including a plurality of pointers (3, 9) is illustrated according to one embodiment. For example, the plurality of pointers (3, 9) may include parts of a body (e.g., fingers). The electronic device (200) may learn the shape, color, size, and / or features of the parts of the body in advance using an artificial intelligence model. As an example, the learning of the parts of the body may be performed using an artificial intelligence model on an external device, and the electronic device (200) may receive the learned information.

[0106] According to one embodiment, a user may make a utterance related to an object included in the screen (1160), and an electronic device (200) may receive the user's utterance. As an example, the user may make a utterance inquiring about information about the object. The electronic device (200) may determine whether multimodal input is required. If the target object is not specified solely by the user's utterance, the electronic device (200) may determine that multimodal input is required. The electronic device (200) may determine a pointer included in the screen (1160).

[0107] According to one embodiment, when a plurality of pointers (3, 9) exist on a screen (1160), the electronic device (200) can determine a pointer corresponding to a user based on learned information. As an example, a first pointer (3) and a second pointer (9) may be included on the screen. The first pointer (3) may be a pointer corresponding to a learned user. According to one example, the electronic device (200) learns the user's physical information (e.g., shape, features, color, skin color of the hand) and can recognize the learned physical information of the user as a pointer corresponding to the user. The electronic device (200) can determine the first pointer (3) as a pointer corresponding to a user based on the learned information. The electronic device (200) can identify and determine a target object based on the first pointer (3). The electronic device (200) can ignore the second pointer (9). The electronic device (200) can identify and determine a target object (29) based on the end point of the first pointer (3) and / or a certain area from the end point. The electronic device (200) can display an indicator (79) representing the target object (29).

[0108] According to one embodiment, the electronic device (200) can crop a certain area containing a target object and obtain information about the target object using an artificial intelligence model. The electronic device (200) can provide the information about the target object obtained from the artificial intelligence model to a user.

[0109] According to one embodiment, with reference to FIG. 11, a screen (1170) including a pointer (3a) is illustrated. For example, the pointer (3a) may include a preset shape and / or a preset gesture of a part of the body. The electronic device (200) may pre-set a preset shape and / or a preset gesture of a part of the body. As an example, the preset shape and / or a preset gesture of a part of the body may include one finger, two fingers, a gesture of drawing a specific shape with fingers, a finger forming a specific shape, or a tapping gesture with fingers. For example, the electronic device (200) may store shooting information and / or learning information learned based on artificial intelligence that includes the preset shape and / or a preset gesture of a part of the body. As an example, the shooting information and / or learning information may be acquired by an external device and transmitted to the electronic device (200).

[0110] According to one embodiment, a user may make a speech related to an object included in the screen (1170), and an electronic device (200) may receive the user's speech. As an example, the user may make a speech inquiring about information about the object. The electronic device (200) may determine whether multimodal input is required. If the target object is not identified solely by the user's speech, the electronic device (200) may determine that multimodal input is required. If it is determined that multimodal input is required, the electronic device (200) may acquire the screen (1180) using a camera (220) and determine the pointer (3a) by analyzing a part of the user's body included in the screen (1180). For example, the electronic device (200) may identify the shape, location, and / or gesture of the part of the body.

[0111] According to one embodiment, the electronic device (200) can identify a pointer (3a) included in the screen (1170) and determine whether the pointer (3a) is a preset shape and / or a preset gesture based on shooting information and / or learning information. As an example, if the image is a still image, the electronic device (200) can determine the shape and / or location of a part of the body, and if the image is a video, the electronic device (200) can determine the shape, location, and / or gesture (or movement) of a part of the body. If the identified pointer (3a) is not a preset shape and / or a preset gesture, the electronic device (200) can ignore the pointer (3a). If the identified pointer (3a) is a preset shape and / or a preset gesture, the target object (31) can be identified and determined based on the pointer (3a). For example, if the pointer (3a) is a pointer corresponding to the user described in FIG. 10 and is a pointer corresponding to a preset shape and / or a preset gesture, the electronic device (200) can determine the target object.

[0112] For example, the electronic device (200) can determine an object located at the end point of the pointer (3a) as the target object. If an object is not determined from the end point of the pointer (3a) (e.g., if an object is not located at the end point), the electronic device (200) can determine the target object by expanding the area (81) from the end point of the pointer (3a). For example, the electronic device (200) can automatically expand the area (81) according to a preset method, and can expand the area (81) according to the user's gesture input. If a target object is not determined in a certain area (81) from the end point of the pointer (510), the electronic device (200) can determine the target object based on additional utterances by the user (e.g., the user's response to a question).

[0113] According to one embodiment, the electronic device (200) can crop a certain area containing a target object and obtain information about the target object using an artificial intelligence model. The electronic device (200) can provide the information about the target object obtained from the artificial intelligence model to a user.

[0114] According to one embodiment, with reference to FIG. 12, a screen (1180) including a pointer (510) is shown. For example, the pointer (510) may include a preset object. As an example, the preset object may include a stylus pen, a remote control and / or a stick.

[0115] According to one embodiment, the electronic device (200) may store information of a pre-set object. For example, the electronic device (200) may store shooting information including a pre-set object and / or learning information learned based on artificial intelligence. As an example, the shooting information and / or learning information may be acquired by an external device and transmitted to the electronic device (200).

[0116] According to one embodiment, a user may make a sound related to an object included in the screen (1180), and an electronic device (200) may receive the user's sound. As an example, the user may make a sound inquiring about information about the object. The electronic device (200) may determine whether multimodal input is required. If the target object is not specified solely by the user's sound, the electronic device (200) may determine that multimodal input is required. For example, if it is determined that multimodal input is required, the electronic device (200) may acquire the screen (1180) using a camera (220) and determine the pointer (510) by analyzing a pre-set point object included in the screen (1180).

[0117] According to one embodiment, the electronic device (200) can identify a pointer (510) included in the screen (1180) and determine whether the pointer (510) is a pre-set object based on shooting information and / or learning information. If the identified pointer (510) is not a pre-set object, the electronic device (200) can ignore the pointer (510). If the identified pointer (510) is a pre-set object, the target object (33) can be identified and determined based on the pointer (510). For example, the electronic device (200) can check the shape, location, and / or gesture of the pointer (510) (or pre-set object). As an example, if the image is a still image, the electronic device (200) can determine the shape and / or location of the pointer (510), and if the image is a video, the electronic device (200) can determine the shape, location, and / or gesture (or movement) of the pointer (510). The electronic device (200) can identify and determine a target object (33) based on the shape, location, and / or gesture (or movement) of the pointer (510). As an example, the electronic device (200) can determine the end point of the pointer (510) and / or a certain range from the end point based on the shape and / or location of the pointer (510). The electronic device (200) can determine a target object (33) based on the end point of the pointer (510) and / or a certain range from the end point. For example, if a pre-set object is a type of electronic device, the electronic device (200) can receive location information from the pre-set object (e.g., pointer (510)). The electronic device (200) can identify and determine a target object (33) based on the location information of the electronic device (200), the location information of the received pointer (510), and the pointer (510) and / or the user's speech.As an example, the electronic device (200) can obtain the distance and direction of the pointer (510) from the electronic device (200) based on the location information of the electronic device (200) and the location information of the received pointer (510), and can determine an object located at a point extending the distance and direction of the obtained pointer as a target object.

[0118] For example, the electronic device (200) can identify an object located at the end point of the pointer (510) as the target object. If the object is not identified from the end point of the pointer (510) (e.g., if the object is not located at the end point), the electronic device (200) can identify the target object by expanding the area (83) from the end point of the pointer (510). For example, the electronic device (200) can automatically expand the area (83) according to a preset method, and can expand the area (83) according to the user's gesture input. If the target object is not identified in a certain area (83) from the end point of the pointer (510), the electronic device (200) can identify the target object based on additional utterances by the user (e.g., the user's response to a question).

[0119] According to one embodiment, the electronic device (200) can crop a certain area containing a target object and obtain information about the target object using an artificial intelligence model. The electronic device (200) can provide the information about the target object obtained from the artificial intelligence model to a user.

[0120] According to one embodiment, with reference to FIG. 13, a screen (1190) including a pointer (3) is shown. For example, the pointer (3) may include a part of the body including a preset object (520). As an example, the preset object may include a wearable device (e.g., a smart watch, a smart ring, a smart band), and the pointer (3) may include the hand (or finger) of a user wearing the wearable device.

[0121] According to one embodiment, the electronic device (200) may store information of a pre-set object. For example, the electronic device (200) may store shooting information including a pre-set object and / or learning information learned based on artificial intelligence. As an example, the shooting information and / or learning information may be acquired by an external device and transmitted to the electronic device (200).

[0122] According to one embodiment, a user may make a sound related to an object included in the screen (1190), and an electronic device (200) may receive the user's sound. As an example, the user may make a sound inquiring about information about the object. The electronic device (200) may determine whether multimodal input is required. If the target object is not specified solely by the user's sound, the electronic device (200) may determine that multimodal input is required. The electronic device (200) may determine whether a pointer (3) included in the screen (1190) includes a pre-set object (520). For example, the electronic device (200) may analyze the screen (1190) and determine whether the pre-set object (520) is included. For example, the electronic device (200) may determine whether the pre-set object (520) is included (e.g., whether it is worn) based on information received from the pre-set object (520).

[0123] According to one embodiment, if the pointer (3) does not include a pre-set object (520), the electronic device (200) may ignore the pointer (3). If the pointer (3) includes a pre-set object (520), the electronic device (200) may identify the pointer (3) and determine and identify a target object (35) based on the pointer (3). For example, if the pre-set object is a type of electronic device, the electronic device (200) may receive location information from the pre-set object (520). As an example, the electronic device (200) may obtain location information of the pointer (3) by considering the shape and / or size of the pointer (3) from the location information of the pre-set object (520). As an example, the electronic device (200) may obtain corrected location information of the pre-set object (520) by considering the shape and / or size of the pointer (3) from the location information of the pre-set object (520). The electronic device (200) can identify and determine a target object (35) based on the location information of the electronic device (200), the location information of the pointer (3) (or, corrected location information of a pre-set object (520)), the pointer (3) and / or the user's speech. As an example, the electronic device (200) can obtain the distance and direction of the pointer (3) from the electronic device (200) based on the location information of the electronic device (200) and the location information of the pointer (3) (or, corrected location information of a pre-set object (520)), and determine an object located at a point extending the distance and direction of the obtained pointer as a target object. The electronic device (200) can display an indicator (85) representing the target object (35).

[0124] According to one embodiment, when the pointer (3) is the user's hand and the pre-set object (520) is a wearable device, the electronic device (200) can increase the accuracy of the selected object by recognizing the user's hand based on information about the wearable device worn by the user. For example, when the user wears a wearable device (e.g., a smart ring, a smart watch, or a smart band), the electronic device (200) can recognize that the user is wearing the wearable device and use that information to identify the user's hand. The electronic device (200) can also use additional information, such as whether the hand wearing the wearable device is the left hand or the right hand, to identify the user's hand.

[0125] According to one embodiment, the electronic device (200) can crop a certain area containing a target object and obtain information about the target object using an artificial intelligence model. The electronic device (200) can provide the information about the target object obtained from the artificial intelligence model to a user.

[0126] For example, the electronic device (200) may provide guidance to the user to re-enter their speech and / or pointer. If the electronic device (200) cannot determine the target object based on the pointer and additional information (e.g., depth information, location information) (or if the target object cannot be identified), the electronic device (200) may provide guidance to the user to re-enter their speech and / or pointer. As an example, the electronic device (200) may output audio guidance such as "Please repeat that" and / or "Please move the pointer to the correct location" through a speaker. As an example, the electronic device (200) may display text guidance such as "Please provide additional information" and / or "Please move the pointer to select the object" through a display.

[0127] FIG. 14 is a drawing illustrating an operation to change a target object according to various embodiments.

[0128] Referring to FIG. 14, an electronic device (200) and an external device (530) are illustrated. The electronic device (200) can acquire an image containing an object. The electronic device (200) can receive a user's utterance. If the target object is not identified solely by the user's utterance, the electronic device (200) may determine that multimodal input is required. The electronic device (200) can identify a pointer (3) included on the screen. The electronic device (200) can identify and determine the target object based on the user's utterance and the pointer (3). The electronic device (200) can display an indicator (87a) representing the target object.

[0129] According to one embodiment, the electronic device (200) can transmit the acquired image to an external device (530). The external device (530) can display the received image. The external device (530) can perform operations such as selecting a target object, changing a target object, resizing a certain area, and / or adjusting the position of a certain area according to user input.

[0130] As an example, the electronic device (200) can transmit an image in which the target object is not specified to an external device (530). The external device (530) displays the image in which the target object is not specified and can select one of the objects included in the image as the target object based on user input. The external device (530) can display an indicator indicating the selected target object.

[0131] As an example, as illustrated in FIG. 14, an electronic device (200) can transmit an image containing a selected target object to an external device (530). The image may display a first indicator (87a) representing the selected target object in the electronic device (200). The external device (530) may move the first indicator (87a) according to user input. The external device (530) may display a second indicator (87b) to which the first indicator (87a) has been moved, and determine the object (37) to which the second indicator (87b) is displayed as the changed target object.

[0132] As an example, the electronic device (200) can transmit an image containing a certain area containing a target object to an external device (530). The external device (530) can display a certain area containing a target object. The external device (530) can perform the operation of positioning and / or resizing the certain area containing the target object according to user input.

[0133] For example, an external device (530) can obtain information about a target object based on result data obtained by performing operations such as selecting a target object, changing a target object, adjusting the size of a certain area, and / or adjusting the position of a certain area. The external device (530) transmits the obtained information about the target object to an electronic device (200), and the electronic device (200) can output the received information about the target object. For example, the external device (530) can transmit result data obtained by performing operations such as selecting a target object, changing a target object, adjusting the size of a certain area, and / or adjusting the position of a certain area to the electronic device (200). The electronic device (200) can obtain information about the target object based on the received result data and output information about the target object.

[0134] FIGS. 15a and FIGS. 15b are drawings illustrating an operation for determining a target object from an image acquired in the past according to various embodiments.

[0135] Referring to FIG. 15a, a user wearing an electronic device (200) is illustrated. The electronic device (200) can acquire an image of the surrounding environment within a certain range relative to the front. The electronic device (200) can store the acquired image of the surrounding environment. Depending on the movement of the user's body (e.g., face, neck) wearing the electronic device (200), the image of the surrounding environment acquired by the electronic device (200) may change.

[0136] For example, when the body of a user wearing the electronic device (200) moves from left to right, the electronic device (200) can sequentially acquire and store images of the surrounding environment from the right direction to the left direction.

[0137] As an example, as illustrated in FIG. 15b, the electronic device (200) may sequentially acquire and store a first image (1210), a second image (1220), and a third image (1230). The second image (1220) may include a cushion (89) (e.g., an object), and the electronic device (200) may be positioned in a direction in which the third image (1230) is currently being viewed. When a user speaks, the electronic device (200) may determine a target object based on the user's speech and acquire information about the target object.

[0138] For example, if the current electronic device (200) is positioned in a direction where the third image (1230) is viewed, the user may utter, "What is the price of the cushion from a moment ago?" Based on the utterance "a moment ago," the electronic device (200) determines that the target object is an object included in the previous image, and based on the utterance "cushion," the target object may be identified as a cushion. The electronic device (200) may determine the target object from an image acquired in the past based on the user's utterance. The electronic device (200) may analyze the first image (1210) and the second image (1220) and determine the cushion (89) included in the second image (1220) as the target object. The electronic device (200) may crop a certain area containing the cushion (89) and obtain price information of the cushion (89) using an artificial intelligence model. The electronic device (200) may provide the price information of the target object obtained from the artificial intelligence model to the user.

[0139] FIG. 16 is a flowchart illustrating a method of providing information of an object in an electronic device according to various embodiments.

[0140] In the following embodiments, each operation may be performed sequentially, but is not necessarily performed sequentially. For example, the order of each operation may be changed, and at least two operations may be performed in parallel.

[0141] According to one embodiment, 1610 to 1670 may be understood to be performed in a processor (e.g., processor (120) of FIG. 1 or processor (230) of FIG. 2) of an electronic device (e.g., electronic device (101) of FIG. 1 or electronic device (200) of FIG. 2).

[0142] For example, the electronic device (200) can acquire an image containing at least one object (1610). The electronic device (200) can acquire an image of the surrounding environment through a camera (220).

[0143] For example, the electronic device (200) can identify the received user's utterance using an artificial intelligence model (1620). The electronic device (200) can receive the user's utterance through a microphone (210). The electronic device (200) can convert the received user's utterance into text and identify the user's utterance based on the converted text using an artificial intelligence model.

[0144] For example, the electronic device (200) can determine whether multimodal input is required based on whether the target object is specified by the user's utterance (1630). As an example, if the target object is specified based only on the user's utterance, the electronic device (200) can determine that multimodal input is unnecessary. As an example, if the target object is not specified based only on the user's utterance, the electronic device (200) can determine that multimodal input is required.

[0145] For example, when multimodal input is unnecessary, the electronic device (200) can determine the target object based on the user's speech (1640). The electronic device (200) can acquire a certain area containing the target object (1650) and acquire information about the target object contained within the certain area (1660).

[0146] For example, the electronic device (200) may include a database. If the database contains information about a target object, the electronic device (200) may obtain information about the target object based on the database. If the database does not contain information about the target object, the electronic device (200) may obtain information about the target object using an external device. For example, the electronic device (200) may transmit information related to an image of the target object to an external device and receive information about the target object from the external device.

[0147] For example, the electronic device (200) can provide information about the target object (1660). The electronic device (200) can output information about the target object as sound through a speaker and can display information about the target object through a display.

[0148] According to one embodiment, if multimodal input is required, the electronic device (200) can determine a target object using multimodal input. For example, an image acquired by the electronic device (200) may include a pointer. As an example, the pointer may include a part of the body, a preset shape of the part of the body, a preset gesture, and / or a preset object. The electronic device (200) can determine a target object based on the user's speech and the pointer. As an example, the electronic device (200) can determine a target object based on the end point of the pointer and / or a certain range from the end point. As an example, the electronic device (200) can determine a target object based on a path formed by the pointer. If multiple objects are included within the end point, a certain range from the end point, and / or the path, the electronic device (200) can determine a target object by additionally using additional information. For example, the electronic device (200) can acquire depth information of the object. The electronic device (200) can determine a target object based on the user's utterance, the terminal point, a certain range from the terminal point, path and / or depth information. For example, if the user's utterance refers to a distant place, the electronic device (200) can determine the object at the farthest location as the target object based on the depth information among the terminal point, a certain range from the terminal point, and / or a plurality of objects within the path. If the user's utterance refers to a nearby place, the electronic device (200) can determine the object at the closest location as the target object based on the depth information among the terminal point, a certain range from the terminal point, and / or a plurality of objects within the path.

[0149] For example, the electronic device (200) can determine a target object from an image acquired in the past based on the user's utterance. The image acquired by the electronic device (200) may include a dynamic image. The electronic device (200) may store an image (or frame) from a previous point in time of the dynamic image. If the user's utterance includes an object included in a past image, the electronic device (200) can determine a target object included in an image from a previous point in time.

[0150] For example, if a terminal point, a certain range and / or a plurality of objects within a path from the terminal point are included, the electronic device (200) may provide a guide to re-enter the user's utterance and / or pointer.

[0151] For example, the electronic device (200) may acquire a certain area containing a target object and acquire information about the target object based on the user's utterance (or converted text) and the acquired certain area. As an example, the electronic device (200) may transmit the user's utterance and the acquired area to an external device. The external device may acquire information about the target object based on the received user's utterance and the acquired area and transmit the acquired information about the target object to the electronic device (200).

[0152] As an example, an electronic device may include a microphone, at least one camera, an output interface, at least one processor including a processing circuit, and a memory for storing instructions. When the instructions are executed individually or collectively by the at least one processor, the electronic device may acquire an image containing at least one object through the at least one camera. The instructions may enable the electronic device to identify a user's utterance received through the microphone using an artificial intelligence model. The instructions may enable the electronic device to determine whether multimodal input is required based on whether the target object is identified by the user's utterance. The instructions may enable the electronic device to determine the target object included in the image based on the user's utterance when the multimodal input is unnecessary. The instructions may enable the electronic device to acquire a certain area containing the target object. The instructions may enable the electronic device to acquire information about the target object included within the certain area. The above instructions allow the electronic device to provide information of the target object through the output interface.

[0153] As an example, the image may include a pointer acquired through the at least one camera. The instructions allow the electronic device to determine the target object based on the user's utterance and the pointer when the multimodal input is required.

[0154] As an example, the above instructions allow the electronic device to determine the target object based on a certain range from the end point of the pointer.

[0155] For example, the instructions allow the electronic device to acquire depth information of the at least one object through the at least one camera. The instructions allow the electronic device to determine the object furthest among the plurality of objects as the target object based on the depth information when a plurality of objects are included within a certain range from the end point of the pointer and the user's utterance refers to a distant place. The instructions allow the electronic device to determine the object closest among the plurality of objects as the target object based on the depth information when a plurality of objects are included within a certain range from the end point of the pointer and the user's utterance refers to a nearby place.

[0156] As an example, the instructions may provide a guide for the electronic device to re-enter at least one of the user's utterance and the pointer when a plurality of objects are included within the specified range from the end point of the pointer.

[0157] As an example, the above instructions allow the electronic device to determine the target object based on the path formed by the movement of the pointer.

[0158] As an example, the pointer may include at least one of a part of the body, a preset shape of the part of the body, a preset gesture, and a preset object.

[0159] As an example, the above instructions allow the electronic device to determine the target object from an image acquired in the past based on the user's utterance.

[0160] As an example, the electronic device may further include a communication interface. The instructions may cause the electronic device to transmit the image containing the target object to an external device through the communication interface. The instructions may cause the electronic device to receive operation data related to at least one operation from the external device through the communication interface when at least one operation among changing the target object, resizing the certain area, and repositioning the certain area is performed at the external device. The instructions may cause the electronic device to obtain information about the target object based on the operation data.

[0161] As an example, the output interface may include at least one of a speaker and a display. The instructions may cause the electronic device to output at least one of the name, type, area, and information of the target object through at least one of the speaker and the display.

[0162] As an example, a method for providing information about an object in an electronic device may include the operation of acquiring an image containing at least one object through at least one camera. The method may include the operation of identifying a user's utterance received through a microphone using an artificial intelligence model. The method may include the operation of determining whether multimodal input is required based on whether the target object is identified by the user's utterance. If the multimodal input is unnecessary, the method may include the operation of determining the target object included in the image based on the user's utterance. The method may include the operation of acquiring a certain area containing the target object. The method may include the operation of acquiring information about the target object included within the certain area using an artificial intelligence model. The method may include the operation of providing information about the target object through an output interface.

[0163] As an example, a non-transient computer-readable storage medium having a program recorded thereon that performs a method of providing information about an object in an electronic device may include instructions for performing an operation of acquiring an image containing at least one object through at least one camera. The storage medium may include instructions for performing an operation of identifying a user's utterance received through a microphone using an artificial intelligence model. The storage medium may include instructions for determining whether multimodal input is required depending on whether the target object is identified by the user's utterance. If the multimodal input is unnecessary, the storage medium may include instructions for performing an operation of determining the target object included in the image based on the user's utterance. The storage medium may include instructions for performing an operation of acquiring a certain area containing the target object. The storage medium may include instructions for performing an operation of acquiring information about the target object included within the certain area. The storage medium may include instructions for performing an operation of providing information about the target object through an output interface.

[0164] As an example, an electronic device may include a microphone, a speaker, one or more cameras, one or more processors including processing circuits, and a memory for storing instructions. When the instructions are executed individually or collectively by the one or more processors, the electronic device may identify a voice command input through the microphone using a first artificial intelligence model in a situation where at least some of the one or more cameras are shooting toward one direction relative to the electronic device. When the instructions are executed individually or collectively by the one or more processors, the electronic device may identify at least one of the shape and movement of a body part located within the screen acquired by shooting toward the one direction, or at least one of the shape and movement of a first object extended from the body part, using a second artificial intelligence model. When the above instructions are executed individually or collectively by the one or more processors, the electronic device may specify a point within the screen or a certain range within the screen based on the voice command, or based further on at least one of the shape and the movement in addition to the voice command. When the above instructions are executed individually or collectively by the one or more processors, the electronic device may identify a second object located at the one point or a second object included at least partially within the certain range using a third artificial intelligence model. When the above instructions are executed individually or collectively by the one or more processors, the electronic device may process the voice command related to the second object.

[0165] As an example, when the instructions are executed individually or collectively by the one or more processors, the electronic device may generate a voice signal representing the identified second object using the first artificial intelligence model. When the instructions are executed individually or collectively by the one or more processors, the electronic device may output the voice signal through the speaker.

[0166] For example, when the instructions are executed individually or collectively by the one or more processors, the electronic device can identify a second object included in an image taken at a past point in time based on the voice command, if the shooting is periodic still image shooting or video shooting. When the instructions are executed individually or collectively by the one or more processors, the electronic device can process the voice command related to the second object. When the instructions are executed individually or collectively by the one or more processors, the electronic device can output the result of processing the voice command through the speaker.

[0167] As an example, when the above instructions are executed individually or collectively by the one or more processors, the electronic device may specify the one point based at least partially on the end point of the body part or the end point of the first object, or specify the certain range by a radius of a specified length from the one point.

[0168] As an example, when the above instructions are executed individually or collectively by the one or more processors, the electronic device may specify the certain range based on a closed curve formed at least partially by the shape of the body part or the first object, or, if the shooting is periodic still image shooting or video shooting, the certain range may be specified based on a closed curve formed at least partially by the movement of the body part or the first object.

[0169] As an example, the electronic device may further include a display. When the instructions are executed individually or collectively by the one or more processors, the electronic device may display on the display at least one of the identified body part, the first object, the second object, the one point, the certain range, and the result of processing the voice command.

[0170] As an example, when the above instructions are executed individually or collectively by the one or more processors, the electronic device may receive at least one of the voice command, the shape of the body part, the movement of the body part, the shape of the first object, and the movement of the first object again through the microphone or the display, if there are multiple second objects included within the specified range.

[0171] As an example, the electronic device may further include a wireless communication circuit. When the instructions are executed individually or collectively by the one or more processors, the electronic device may transmit image data based at least partially on the screen acquired by the capture to an external electronic device via the wireless communication circuit. When the instructions are executed individually or collectively by the one or more processors, the electronic device may receive adjustment data related to at least one operation from the external electronic device via the wireless communication circuit, if at least one operation among position adjustment of the one point within the screen, position adjustment of the certain range within the screen, size adjustment of the certain range within the screen, and shape adjustment within the screen is performed by the external device. When the instructions are executed individually or collectively by the one or more processors, the electronic device may identify the second object based at least partially on the adjustment data.

[0172] The electronic devices according to the various examples disclosed in this document may be of various forms. Electronic devices may include, for example, portable communication devices (e.g., smartphones), computer devices, portable multimedia devices, portable medical devices, cameras, wearable devices, or consumer electronics. The electronic devices according to the embodiments of this document are not limited to the devices described above.

[0173] The various embodiments of this document and the terms used therein are not intended to limit the technical features described in this document to specific embodiments, and should be understood to include various modifications, equivalents, or substitutions of said embodiments. In connection with the description of the drawings, similar reference numerals may be used for similar or related components. The singular form of a noun corresponding to an item may include one or more of said items unless the relevant context clearly indicates otherwise. In this document, each of phrases such as "A or B," "at least one of A and B," "at least one of A or B," "A, B or C," "at least one of A, B and C," and "at least one of A, B, or C" may include any one of the items listed together in the corresponding phrase, or all possible combinations thereof. Terms such as “first,” “second,” or “first” or “second” may be used simply to distinguish a component from another component and do not limit the components in any other aspect (e.g., importance or order). Where any (e.g., first) component is referred to as “coupled” or “connected” to another (e.g., second) component, with or without the terms “functionally” or “communicationally,” it means that said component may be connected to said other component directly (e.g., wired), wirelessly, or through a third component.

[0174] The term “module” as used in the various embodiments of this document may include a unit implemented in hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit, for example. A module may be a component formed integrally, or a minimum unit of said component or a part thereof that performs one or more functions. For example, according to one embodiment, a module may be implemented in the form of an application-specific integrated circuit (ASIC).

[0175] Various embodiments of the present document may be implemented as software (e.g., program (140)) comprising one or more instructions stored in a storage medium (e.g., internal memory (136) or external memory (138)) readable by a machine (e.g., electronic device (101)). For example, a processor (e.g., processor (120)) of the machine (e.g., electronic device (101)) may call at least one of the one or more instructions stored in the storage medium and execute it. This enables the machine to be operated to perform at least one function according to the at least one called instruction. The one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. The storage medium readable by the machine may be provided in the form of a non-transitory storage medium. Here, 'non-temporary' simply means that the storage medium is a tangible device and does not contain a signal (e.g., electromagnetic waves), and the term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily.

[0176] According to one embodiment, the method according to the various embodiments disclosed herein may be provided by being included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)) or an application store (e.g., Play Store). TM It can be distributed online (e.g., downloaded or uploaded) through ) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily created on a device-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.

[0177] According to various embodiments, each component (e.g., module or program) of the components described above may include a singular or multiple entities, and some of the multiple entities may be separated and placed in other components. According to various embodiments, one or more of the components or operations among the aforementioned components may be omitted, or one or more other components or operations may be added. Generally or additionally, multiple components (e.g., module or program) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each of the multiple components in the same or similar manner as those performed by the corresponding component among the multiple components prior to integration. According to various embodiments, operations performed by the module, program, or other components may be executed sequentially, in parallel, iteratively, or heuristically, or one or more of the operations may be executed in a different order, omitted, or one or more other operations may be added.

[0178] The effects of this document are not limited to those mentioned above, and other unmentioned effects will be clearly understood by a person skilled in the art from the description above. Although this disclosure has been described and explained with reference to various embodiments, the various embodiments may be for illustrative purposes only and not for limitation. A person skilled in the art will better understand that various changes in form and detail may be made without departing from the substantial spirit and full scope of the disclosure, including the appended claims and equivalents. Furthermore, any of the embodiments described herein may be used in conjunction with other embodiments described herein.

Claims

1. In an electronic device, mike; At least one camera; Output interface; At least one processor including a processing circuit; and Includes memory for storing instructions; and When the above instructions are executed individually or collectively by the at least one processor, the electronic device: An image including at least one object is obtained through the above-mentioned at least one camera, and Identifying the user's speech received through the above microphone using an artificial intelligence model, and Determining whether multimodal input is required based on whether the target object is specified by the user's utterance, and When the above multimodal input is unnecessary, the target object included in the image is determined based on the user's utterance, and Acquire a certain area including the above target object, and Acquire information about target objects included within the above-mentioned designated area, and An electronic device that provides information about the target object through the output interface.

2. In Paragraph 1, The above image includes a pointer obtained through the at least one camera, and When the above instructions are executed individually or collectively by the at least one processor, the electronic device: An electronic device that determines the target object based on the user's utterance and the pointer when the above multimodal input is required.

3. In Paragraph 2, When the above instructions are executed individually or collectively by the at least one processor, the electronic device: An electronic device that determines the target object based on a certain range from the terminal point of the above pointer.

4. In Paragraph 3, When the above instructions are executed individually or collectively by the at least one processor, the electronic device: Depth information of the at least one object is obtained through the at least one camera, and When a plurality of objects are included within a certain range from the terminal point of the above pointer, and the user's utterance refers to a distant location, the object furthest among the plurality of objects is determined as the target object based on the depth information, and An electronic device that determines the closest object among the plurality of objects as a target object based on depth information when a plurality of objects are included within a certain range from the terminal point of the pointer and the user's utterance refers to a nearby location.

5. In Paragraph 3, When the above instructions are executed individually or collectively by the at least one processor, the electronic device: An electronic device that provides a guide to re-enter at least one of the user's utterance and the pointer when a plurality of objects are included within the specified range from the terminal point of the pointer.

6. In Paragraph 2, When the above instructions are executed individually or collectively by the at least one processor, the electronic device: An electronic device that determines the target object based on a path formed by the movement of the pointer.

7. In Paragraph 2, The above pointer is, An electronic device comprising at least one of a part of a body, a preset form of said part of the body, a preset gesture, and a preset object.

8. In Paragraph 1, When the above instructions are executed individually or collectively by the at least one processor, the electronic device: An electronic device that determines the target object from an image acquired in the past based on the utterance of the user.

9. In Paragraph 1, In addition to a communication interface; When the above instructions are executed individually or collectively by the at least one processor, the electronic device: The image including the target object is transmitted to an external device through the communication interface, and When at least one operation among changing the target object, adjusting the size of the predetermined area, and adjusting the position of the predetermined area is performed on the external device, operation data related to the at least one operation is received from the external device through the communication interface, and An electronic device that acquires information about the target object based on the above operation data.

10. In Paragraph 1, The above output interface includes at least one of a speaker and a display, and When the above instructions are executed individually or collectively by the at least one processor, the electronic device: An electronic device that outputs at least one of the name, type, area, and information of the target object through at least one of the speaker and the display.

11. A method for providing information of an object in an electronic device, The operation of acquiring an image containing at least one object through at least one camera; An action of identifying a user's speech received through a microphone using an artificial intelligence model; An operation to determine whether multimodal input is required based on whether the target object is specified by the utterance of the user above; When the above multimodal input is unnecessary, an operation to determine the target object included in the image based on the user's utterance; An operation to acquire a certain area including the above target object; An operation to acquire information of a target object included within the above-mentioned predetermined area; and A method comprising the operation of providing information of the target object through an output interface.

12. In Paragraph 11, The above image includes a pointer obtained through the at least one camera, and A method further comprising, when the above multimodal input is required, an operation of determining the target object based on the user's utterance and the pointer.

13. In Paragraph 12, The operation of determining the above target object is, A method for determining the target object based on a certain range from the terminal point of the above pointer.

14. In Paragraph 13, The operation of acquiring depth information of the at least one object through the at least one camera is further included, The operation of determining the above target object is, When a plurality of objects are included within a certain range from the terminal point of the above pointer, and the user's utterance refers to a distant location, the object furthest among the plurality of objects is determined as the target object based on the depth information, and A method for determining the closest object among the plurality of objects as the target object based on depth information, wherein a plurality of objects are included within a certain range from the terminal point of the pointer, and the user's utterance refers to a nearby location.

15. In a non-transient computer-readable storage medium on which a program for performing a method of providing information of an object in an electronic device is recorded, Instructions for acquiring an image containing at least one object through at least one camera; Instructions for identifying user speech received through a microphone using an artificial intelligence model; Instructions for determining whether multimodal input is required based on whether a target object is specified by the utterance of the user above; Instructions for determining the target object included in the image based on the user's utterance when the above multimodal input is unnecessary; Instructions for acquiring a certain area including the above target object; Instructions for obtaining information of a target object included within the above-mentioned specified area; and A non-transient computer-readable storage medium having a program recorded thereon that performs a method including instructions for providing information of the target object through an output interface.