Electronic device for analyzing and searching for photo on basis of sound and operation method therefor

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The electronic device uses sound analysis with an artificial neural network to classify emotions and preferences, addressing the limitations of conventional photo search technologies by enhancing the accuracy and ease of finding desired photos based on subjective criteria.

WO2026142029A1PCT designated stage Publication Date: 2026-07-02SAMSUNG ELECTRONICS CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: SAMSUNG ELECTRONICS CO LTD
Filing Date: 2025-12-02
Publication Date: 2026-07-02

Application Information

Patent Timeline

02 Dec 2025

Application

02 Jul 2026

Publication

WO2026142029A1

IPC: G06F16/58; G06F16/532; G06F16/538; G06F16/51; G10L25/63; G10L25/03; G06N3/096

AI Tagging

Technology Topics

Feature vectorAcoustic voice analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure KR2025020447_02072026_PF_FP_ABST

Patent Text Reader

Abstract

Provided are an electronic device for analyzing and searching for a photo on the basis of a sound, and an operation method therefor. The electronic device may: acquire a sound signal by means of a microphone for a preset time before and after a time point at which a photo is captured by using a camera; input a feature vector extracted from the sound signal to an artificial neural network model; acquire at least one type of emotion or taste pertaining to the sound signal by analyzing the feature vector by using the artificial neural network model; and store the at least one type of emotion or taste as metadata for the photo captured by using the camera.

Need to check novelty before this filing date? Find Prior Art

Description

Electronic device for analyzing and searching for photos based on sound and method of operation thereof

[0001] The present disclosure relates to an electronic device for analyzing and searching photographs based on sound, and a method of operation thereof. Specifically, the present disclosure discloses an electronic device and a method of operation thereof that performs semantic analysis of a photograph based on a sound signal recorded and acquired before and after the time of taking the photograph, and searches for the photograph based on the result of the semantic analysis when a search query is entered by a user.

[0002] With the recent widespread adoption of camera-equipped mobile devices such as smartphones, most people now take photos using these devices. Generally, thousands to tens of thousands of photos are stored on mobile devices. Consequently, search technologies are evolving to enable users to locate specific images among this vast collection. The performance of photo search technology is significantly improving due to advancements in Artificial Intelligence (AI)-based classification techniques that recognize specific objects within images and output recognition results.

[0003] In the case of photo search, subjective emotions or personal preferences at the time of taking the photo are often important, in addition to the presence or absence of specific objects within the image. For example, for photos capturing moments one wishes to preserve as memories—such as a baby looking happy or cute, admiring a magnificent landscape while traveling, or laughing joyfully with friends—the subjective emotion at the moment of shooting is a more important factor than the presence or absence of specific objects. For photos where emotion is important, it is difficult to search for them using object keyword searches based on existing AI-based object recognition modules or classification models, and there is a problem of low accuracy in photo search.

[0004] In conventional technology, when searching for photos based on subjective emotions or personal preferences at the time of taking, one must search using object keywords such as the clothes worn at the moment of shooting, toys played with, or the location of the shoot; however, this relies on memory and has technical limitations due to the complex search process for finding the desired photo. Conventional object recognition models can only recognize a limited number of emotions, and there is a technical limitation in that it is particularly difficult to accurately recognize emotions from a single photo.

[0005] One aspect of the present disclosure provides a method for an electronic device to analyze and search for a photograph based on sound. The method of operation of the electronic device may include the step of acquiring a sound signal through a microphone for a preset time period before and after the time of taking a photograph using a camera. The method of operation of the electronic device may include the step of inputting a feature vector extracted from the acquired sound signal into an artificial neural network model trained to classify at least one of emotion and preference from sound signal learning data, and acquiring at least one type of emotion and preference regarding the sound signal by analyzing the feature vector using the artificial neural network model. The method of operation of the electronic device may include the step of storing at least one type of acquired emotion and preference as metadata for a photograph taken using a camera.

[0006] One aspect of the present disclosure provides an electronic device for analyzing and searching photographs based on sound. The electronic device may include a camera; a microphone; at least one processor comprising processing circuitry; and a memory for storing one or more instructions. By having the one or more instructions executed individually or collectively by at least one processor, the electronic device may acquire a sound signal through the microphone during a preset time before and after the time of taking a photograph using the camera. By having the one or more instructions executed individually or collectively by at least one processor, the electronic device may input a feature vector extracted from the sound signal into an artificial neural network model trained to classify at least one of emotion and taste from sound signal learning data, and by analyzing the feature vector using the artificial neural network model, acquire at least one type of emotion and taste regarding the sound signal. By executing one or more of the above instructions individually or collectively by at least one processor, the electronic device can store at least one type of acquired emotion and taste as metadata for a photograph taken using a camera.

[0007] One aspect of the present disclosure provides a computer program product comprising a computer-readable storage medium. The storage medium may include instructions readable by the electronic device for the electronic device to perform the operations of acquiring a sound signal through a microphone for a preset time before and after a time when a photograph is taken using a camera, inputting a feature vector extracted from the acquired sound signal into an artificial neural network model trained to classify at least one of emotion and taste from sound signal learning data, and acquiring at least one type of emotion and taste regarding the sound signal by analyzing the feature vector using the artificial neural network model, and storing at least one type of the acquired emotion and taste as metadata for a photograph taken using the camera.

[0008] One aspect of the present disclosure provides a method for an electronic device to analyze and search a video based on sound. The method of operation of the electronic device may include the step of obtaining a plurality of segmented sound signal data by dividing a sound signal included in a video into pre-set time interval units. The method of operation of the electronic device may include the step of obtaining at least one of the emotion and preference regarding the plurality of segmented sound signal data by analyzing the plurality of segmented sound signal data by inputting a plurality of feature vectors extracted from the plurality of segmented sound signal data into an artificial neural network model trained to classify at least one of emotion and preference from sound signal learning data. The method of operation of the electronic device may include the step of storing at least one of the obtained emotion and preference as metadata for a plurality of image frames included in a time interval corresponding to each of the plurality of segmented sound signal data among the time intervals of the video.

[0009] The present disclosure can be easily understood from the combination of the following detailed description and the accompanying drawings, where reference numerals denote structural elements.

[0010] FIG. 1 is a diagram illustrating the operation of an electronic device according to one embodiment of the present disclosure to analyze at least one of emotion and taste based on a sound signal and to store the analysis result as metadata regarding a photograph.

[0011] FIG. 2 is a flowchart illustrating a method in which an electronic device according to one embodiment of the present disclosure analyzes at least one of emotion and taste based on an acoustic signal and stores the analysis result as metadata regarding a photograph.

[0012] FIG. 3 is a diagram illustrating the operation of an electronic device according to one embodiment of the present disclosure to display a search result of a photograph based on a search query regarding at least one of emotion and taste.

[0013] FIG. 4 is a flowchart illustrating a method for an electronic device according to one embodiment of the present disclosure to display a search result of a photograph based on a search query regarding at least one of emotion and taste.

[0014] FIG. 5 is a block diagram illustrating the components of an electronic device according to one embodiment of the present disclosure.

[0015] FIG. 6 is a flowchart illustrating a method in which an electronic device according to one embodiment of the present disclosure analyzes an acoustic signal to obtain at least one type of emotion and taste, and stores at least one type of emotion and taste as metadata for a photograph.

[0016] FIG. 7 is a diagram illustrating the operation of an electronic device according to one embodiment of the present disclosure to analyze an acoustic signal to obtain at least one type of emotion and taste, and to store at least one type of emotion and taste as metadata for a photograph.

[0017] FIG. 8a is a drawing illustrating an emotion / taste classification model according to one embodiment of the present disclosure.

[0018] FIG. 8b is a drawing illustrating a plurality of artificial neural network models according to one embodiment of the present disclosure.

[0019] FIG. 9 is a flowchart illustrating a method in which an electronic device according to one embodiment of the present disclosure calculates a preference for at least one type of emotion and taste, and stores the calculated preference as metadata for a photograph.

[0020] FIG. 10 is a diagram illustrating the operation of an electronic device according to one embodiment of the present disclosure to calculate a preference for at least one type of emotion and taste.

[0021] FIG. 11 is a flowchart illustrating a method for an electronic device according to one embodiment of the present disclosure to display a search result of a photograph when a search query including a search term regarding a search target object is input.

[0022] FIG. 12 is a diagram illustrating the operation of an electronic device according to one embodiment of the present disclosure to display a search result of a photograph when a search query is entered.

[0023] FIG. 13 is a flowchart illustrating a method in which an electronic device according to one embodiment of the present disclosure analyzes at least one of emotion and taste based on an acoustic signal of a video and stores the analysis result as metadata regarding the video.

[0024] FIG. 14 is a diagram illustrating the operation of an electronic device according to one embodiment of the present disclosure to analyze at least one of emotion and taste based on an audio signal of a video and to store the analysis result as metadata regarding the video.

[0025] FIG. 15 is a flowchart illustrating a method in which an electronic device according to one embodiment of the present disclosure calculates a preference based on an audio signal of a video and stores the calculated preference as metadata regarding the video.

[0026] FIG. 16 is a graph illustrating the operation of an electronic device according to one embodiment of the present disclosure to calculate a preference based on an audio signal of a video and to store the calculated preference as metadata for a segment within the video.

[0027] FIG. 17 is a flowchart illustrating a method for an electronic device according to one embodiment of the present disclosure to display a search result of a video when a search query including a search term regarding a search target object is input.

[0028] The terms used in the embodiments of this specification have been selected to be as widely used as possible, taking into account the functions of the present disclosure; however, these terms may vary depending on the intent of those skilled in the art, case law, the emergence of new technologies, etc. Additionally, in specific cases, terms have been arbitrarily selected by the applicant, and in such cases, their meanings will be described in detail in the description section of the relevant embodiments. Therefore, terms used in this specification should be defined not merely by their names, but based on their meanings and the overall content of the present disclosure.

[0029] Singular expressions may include plural expressions unless the context clearly indicates otherwise. Terms used herein, including technical or scientific terms, may have the same meaning as generally understood by those skilled in the art as described in this specification.

[0030] Throughout this disclosure, when a part is described as "comprising" a certain component, this means that, unless specifically stated otherwise, it does not exclude other components but may include additional components. Furthermore, terms such as "...part," "...module," etc., as used in this specification refer to a unit that processes at least one function or operation, and this may be implemented in hardware or software, or as a combination of hardware and software.

[0031] As used in this disclosure, the expression “configured to” may be replaced, depending on the context, with, for example, “suitable for,” “having the capacity to,” “designed to,” “adapted to,” “made to,” or “capable of.” The term “configured to” may not necessarily mean only “specifically designed to” in hardware. Instead, in some situations, the expression “system configured to” may mean that the system is “capable of” in conjunction with other devices or components. For example, the phrase “processor configured to perform A, B, and C” may mean a dedicated processor for performing the said operations (e.g., an embedded processor), or a generic-purpose processor (e.g., a CPU or an application processor) capable of performing said operations by executing one or more software programs stored in memory.

[0032] In addition, when a component is described in the present disclosure as being "connected" or "connected" to another component, it should be understood that the component may be directly connected to or directly connected to the other component, but unless otherwise specifically stated, it may also be connected or connected through another component in between.

[0033] All functions or operations described in this disclosure may be processed individually by a single processor and / or collectively by a plurality of processors. A single processor or a combination of a plurality of processors may include circuitry that performs processing, such as an Application Processor (AP), Communication Processor (CP), Graphical Processing Unit (GPU), Neural Processing Unit (NPU), Microprocessor Unit (MPU), System on Chip (SoC), Integrated Chip (IC), etc.

[0034] It should be understood that the blocks and combinations of flowcharts in the flowcharts illustrated in the present disclosure may be performed by one or more computer programs comprising computer-executable instructions. The one or more computer programs may be stored all in a single memory or may be divided and stored in multiple different memories.

[0035] In the present disclosure, "sound signal" refers to signal data obtained by receiving sound, such as voice, music, or acoustics, through a microphone and converting the received sound signal into an electrical signal. In one embodiment of the present disclosure, the sound signal may refer to a signal obtained by converting a sound signal, obtained by recording ambient sounds, such as speech, laughter, interjections, sighs, ambient noise, or music, through a microphone during a preset time interval before and after a photograph is taken using a camera, into an electrical signal of alternating current voltage using a transducer or the like.

[0036] In the present disclosure, 'video' refers to image data including an object moving over time. The video may include a plurality of image frames arranged chronologically and displayed. In one embodiment of the present disclosure, the video may further include sound data such as voice or music.

[0037] In the present disclosure, 'metadata' refers to structured data regarding data, which is attribute information describing other data (e.g., photos or videos). Metadata is used as an index to search for specific data from a large amount of data and may include information such as the location, content, creator information, rights conditions, usage conditions, and usage history of the photos or videos of the data. In one embodiment of the present disclosure, metadata may include at least one classification or identification information among the contents included in the photos or videos, such as objects, actions, behaviors, situations, or events. However, it is not limited thereto, and metadata according to one embodiment of the present disclosure may include information indicating at least one of subjective emotions and personal preferences before or after the time of taking the photos or videos.

[0038] Metadata may include, for example, tags regarding photos or videos.

[0039] In the present disclosure, a "search query" refers to a signal requested by a user to search for an object they wish to search for from storage or a database in which photo or video data is stored. In one embodiment of the present disclosure, the search query may include keywords representing the content to be searched. However, it is not limited thereto, and the search query according to one embodiment of the present disclosure may include search terms relating to at least one of emotions and tastes.

[0040] In the present disclosure, functions related to 'Artificial Intelligence' are operated through a processor and memory. The processor may be composed of one or more processors. In this case, the one or more processors may be general-purpose processors such as CPUs, APs, and DSPs (Digital Signal Processors), graphics-dedicated processors such as GPUs and VPUs (Vision Processing Units), or AI-dedicated processors such as NPUs. The one or more processors control the processing of input data according to predefined operation rules or AI models stored in memory. Alternatively, if the one or more processors are AI-dedicated processors, the AI-dedicated processors may be designed with a hardware structure specialized for processing a specific AI model.

[0041] The predefined rules of operation or artificial intelligence models are characterized by being created through learning. Here, being created through learning means that a predefined rules of operation or artificial intelligence models configured to perform desired characteristics (or objectives) are created by a basic artificial intelligence model being trained using multiple learning data by a learning algorithm. Such learning may be performed on the device itself where the artificial intelligence according to the present disclosure is executed, or it may be performed through a separate server and / or system. Examples of learning algorithms include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but are not limited to the examples described above.

[0042] In the present disclosure, an 'artificial intelligence model' may be composed of a plurality of neural network layers. Each of the plurality of neural network layers has a plurality of weight values and performs neural network operations through operations between the results of operations of a previous layer and the plurality of weights. The plurality of weights possessed by the plurality of neural network layers may be optimized by the learning results of the artificial intelligence model. For example, the plurality of weights may be updated so that the loss value or cost value obtained from the artificial intelligence model during the learning process is reduced or minimized. The artificial neural network model may include a Deep Neural Network (DNN), such as a Convolutional Neural Network, a Recurrent Neural Network, a Restricted Boltzmann Machine, a Deep Belief Network, a Bidirectional Recurrent Deep Neural Network, or Deep Q-Networks, but is not limited to the examples described above.

[0043] Embodiments of the present disclosure are described below with reference to the attached drawings so that those skilled in the art can easily implement them. However, the present disclosure may be embodied in various different forms and is not limited to the embodiments described herein.

[0044] Embodiments of the present disclosure will be described in detail below with reference to the drawings.

[0045] FIG. 1 is a diagram illustrating the operation of an electronic device (100) according to one embodiment of the present disclosure analyzing at least one of an emotion and a taste based on a sound signal (30) and storing the analysis result as metadata (70) regarding a photograph (i).

[0046] Referring to FIG. 1, the electronic device (100) may be a mobile device such as a smartphone, tablet PC, laptop computer, digital camera, e-book terminal, digital broadcasting terminal, PDA (Personal Digital Assistants), PMP (Portable Multimedia Player), navigation, or MP3 player. In one embodiment of the present disclosure, the electronic device (100) may be a home appliance such as a TV, air conditioner, robot vacuum cleaner, or clothing care machine. However, it is not limited thereto, and in one embodiment of the present disclosure, the electronic device (100) may be implemented as a wearable device such as a smart watch, glasses-type augmented reality device (e.g., AR glasses), head-mounted device (HMD), or body-attached device (e.g., skin pad).

[0047] Referring to FIG. 1, the electronic device (100) may include a camera (110), a microphone (120), and a display (150). FIG. 1 illustrates only the essential functions and / or essential components for describing the operation of the electronic device (100), and the components included in the electronic device (100) are not limited to those shown in FIG. 1. The components included in the electronic device (100) will be described in detail in FIG. 5.

[0048] An electronic device (100) can obtain an acoustic signal (30) by using a camera (110) to photograph an object (10) and by using a microphone (120) to receive speech (20) by the object (10) or surrounding sounds during a preset time before and after the time of photography. The electronic device (100) can obtain information (50) regarding at least one of subjective emotions and personal preferences before and after the time of photography by inputting the obtained acoustic signal (30) into an artificial neural network model (40) and analyzing the acoustic signal (30) using the artificial neural network model (40). In one embodiment of the present disclosure, the electronic device (100) can obtain at least one type of emotion and preference and a confidence value regarding at least one type as an output value regarding the input acoustic signal (30) as a result of performing inference using the artificial neural network model (40). The electronic device (100) may store emotional or taste information, in which the reliability value of at least one type of emotion and taste is greater than or equal to a preset threshold, as metadata regarding a photo (i) acquired by a camera (110). In one embodiment of the present disclosure, the electronic device (100) may store at least one type of emotion and taste, for example, 'pleasure (70)', as a tag on the photo (i).

[0049] FIG. 2 is a flowchart illustrating a method in which an electronic device (100) according to one embodiment of the present disclosure analyzes at least one of emotion and taste based on an acoustic signal and stores the analysis result as metadata regarding a photograph. Hereinafter, with reference to FIG. 1 and FIG. 2 together, the function and / or operation of the electronic device (100) analyzing a photograph (i) based on an acoustic signal (30) and storing the analysis result as metadata regarding the photograph (i) will be described in detail.

[0050] In step S210, the electronic device (100) acquires a sound signal through a microphone during a preset time before and after the time of taking a picture using a camera. In the present disclosure, the 'preset time before and after the time of taking' may include a preset time before the time of taking and a preset time after the time of taking, upon receiving an input in which a user presses a shooting button UI (user interface) displayed through a camera application or a shooting button configured as hardware. The preset time may, for example, be 3 seconds before and after the time of taking, but is not limited thereto. In one embodiment of the present disclosure, the electronic device (100) may activate the microphone to acquire a sound signal when a camera application is executed and a preview image of a subject is displayed.

[0051] Referring together to the embodiment illustrated in FIG. 1, the electronic device (100) can obtain a preview image (P) by taking a picture of an object (10) through a camera (110) as a camera application is executed, and can display the preview image (P) in real time through a display (150). While the preview image (P) is being displayed, the electronic device (100) can activate a microphone (120). Upon receiving user input pressing a hardware shooting button or a shooting button UI, the electronic device (100) can take a picture of an object (10) using the camera (110) and obtain a picture (i). The electronic device (100) can receive sound through the microphone (120) for a preset time before and after the time when the picture (i) is taken using the camera (110) by user input, and can obtain an acoustic signal (30) by converting the received sound into an electrical signal. The microphone (120) can acquire a sound signal by recording, for example, speech, laughter, interjections, sighs, ambient noise, or music. The microphone (120) can acquire an acoustic signal (30) by converting the acquired sound signal into an electrical signal of alternating voltage using a transducer or the like. In the embodiment illustrated in FIG. 1, the acoustic signal (30) may include a voice signal regarding "Wow~ so pretty" acquired from laughter and speech (20) of the object (10).

[0052] Referring again to FIG. 2, in step S220, the electronic device (100) inputs a feature vector extracted from an acoustic signal into an artificial neural network model and obtains at least one type of emotion and preference regarding the acoustic signal by analyzing the feature vector using the artificial neural network model. In one embodiment of the present disclosure, the electronic device (100) may preprocess the acquired acoustic signal, extract a feature vector from the preprocessed acoustic signal data, and input the extracted feature vector into an artificial neural network model. The electronic device (100) may perform preprocessing, for example, normalize the acoustic signal data, resample it, and remove background noise. In one embodiment of the present disclosure, the electronic device (100) may extract features representing at least one of volume, tempo, and speech from the preprocessed acoustic signal data, and convert the extracted features into a feature vector by vector embedding. The electronic device (100) inputs a feature vector into an artificial neural network model and performs inference using the artificial neural network model to analyze the input feature vector, thereby obtaining at least one type among the emotions and tastes for which the acoustic signal is classified and at least one confidence value representing the probability that the acoustic signal is classified into said at least one type.

[0053] Referring together with the embodiment illustrated in FIG. 1, the electronic device (100) can preprocess an acoustic signal (30) and obtain signal data regarding laughter and speech (e.g., "Wow, you're so pretty") from the preprocessed acoustic signal. In one embodiment of the present disclosure, the electronic device (100) can perform Automatic Speech Recognition (ASR) to convert the voice signal of speech (20) from the preprocessed acoustic signal (30) into text. The electronic device (100) can vector embedding the acoustic signal data of the laughter and the text converted from the speech, respectively, to convert them into feature vectors. Since vector embedding is a known technique in the art, a detailed description is omitted.

[0054] The electronic device (100) inputs a feature vector into an artificial neural network model (40) and can analyze the feature vector using the artificial neural network model (40). In one embodiment of the present disclosure, the artificial neural network model (40) may be a deep neural network model trained through a supervised learning method in which a plurality of training data consisting of acoustic signals are applied as inputs, and a label value regarding an emotion or preference for each of the plurality of training data is applied as a ground truth. The deep neural network model may be, for example, a Convolutional Neural Network or a Transformer. However, this is not limited to this, and deep neural network models may also be implemented as, for example, recurrent neural networks, restricted Boltzmann machines, deep belief networks, bidirectional recurrent deep neural networks, or deep Q-networks.

[0055] An electronic device (100) can obtain information (50) regarding at least one of emotion and preference regarding an acoustic signal (30) by inputting a feature vector into an artificial neural network model (40) and analyzing the feature vector by performing inference using the artificial neural network model (40). The information (50) regarding at least one of emotion and preference may include at least one type of emotion and preference and at least one confidence value representing the probability that the acoustic signal (30) is classified into at least one type. Referring to the embodiment illustrated in FIG. 1, as a result of inference by the artificial neural network model (40), a type of emotion such as pleasure, surprise, or sadness, which is a subjective emotion, is obtained from the acoustic signal (30), and a confidence value regarding each type of emotion can be obtained. For example, the confidence value at which the acoustic signal (30) can be classified as the emotion of pleasure may be 0.8, the confidence value at which it can be classified as surprise may be 0.2, and the confidence value at which it can be classified as sadness may be 0.1. However, the aforementioned types of emotions and reliability values are merely examples for the convenience of explanation, and the present disclosure is not limited to the above examples.

[0056] Referring again to FIG. 2, in step S230, the electronic device (100) stores at least one of emotions and tastes as metadata for a photograph acquired through a camera. In the present disclosure, 'metadata' may include at least one classification information or identification information among the contents included in the photograph or video, such as objects, actions, behaviors, situations, or events. In one embodiment of the present disclosure, the metadata may include a tag regarding the photograph or video. The electronic device (100) may acquire a photograph by taking a picture of an object using a camera and store at least one type of emotion and taste acquired as a result of analyzing an acoustic signal by an artificial neural network model as a tag in the photograph. In one embodiment of the present disclosure, the electronic device (100) may store only the type among the types of emotions and tastes whose reliability value is greater than or equal to a preset threshold as a tag regarding the photograph.

[0057] Referring together with FIG. 1, among the types of emotions obtained as a result of inference by an artificial neural network model (40), an emotion type whose confidence value is greater than or equal to a preset threshold can be stored as a tag in a photo (i). The preset threshold may be, for example, 0.5. However, it is not limited thereto. In the embodiment illustrated in FIG. 1, the electronic device (100) selects, among the types of emotions, for example, joy, surprise, and sadness, joy as an emotion type whose confidence value is greater than or equal to 0.5, and can store the selected 'joy (70)' as a tag in a photo (i) taken by a camera (110).

[0058] FIG. 3 is a diagram illustrating the operation of an electronic device (100) according to one embodiment of the present disclosure displaying a search result of a photograph based on a search query (310) regarding at least one of emotion and taste.

[0059] Referring to FIG. 3, the electronic device (100) can display a plurality of photos stored in memory (140, see FIG. 5) (Operation ①).

[0060] The electronic device (100) can receive user input that enters a search query including a search term regarding at least one of emotion and taste (operation ②).

[0061] The electronic device (100) can display at least one photo that matches the search term of the input search query among a plurality of previously stored photos (Operation ③).

[0062] FIG. 4 is a flowchart illustrating a method in which an electronic device (100) according to one embodiment of the present disclosure displays a search result of a photo based on a search query regarding at least one of emotions and tastes. Hereinafter, with reference to FIG. 3 and FIG. 4 together, a function and / or operation of the electronic device (100) displaying a search result of a photo based on a search query regarding at least one of emotions and tastes input by a user will be described in detail.

[0063] In step S410, the electronic device (100) receives a user's search query containing search terms regarding at least one of emotions and tastes. The user may enter a search query containing search terms regarding subjective emotions, such as joy, sadness, or surprise. The user may enter a search query containing search terms regarding personal tastes, such as cuteness, coolness, beauty, or deliciousness. The electronic device (100) may display a plurality of previously stored photos and receive a search query from the user through a search UI (user interface).

[0064] Referring together to operation ① illustrated in FIG. 3, the electronic device (100) displays a plurality of photos (i1, i2, i3, ..., i) previously stored in the image storage of the memory (140, see FIG. 5) through the display (150). n The electronic device (100) can display a search UI (300) for searching photos on the display (150). The user can select the search UI (300) through user input, such as touch, and enter a search query containing a search term regarding at least one of emotions and tastes. Referring to operation ② illustrated in FIG. 3, the electronic device (100) can receive user input entering a search query (310) for 'joy' among subjective emotions through the search UI (300). However, the search query is not limited to a search term composed of words regarding emotions or tastes, and the electronic device (100) may also receive a search query composed of natural language including a search term regarding emotions or tastes. For example, the electronic device (100) may receive natural language search queries through the search UI (300) that include search terms related to emotions, such as "find photos taken during happy moments" or "find photos of sad moments," or search terms related to tastes, such as "find my cool look" or "find photos of delicious food."

[0065] Referring again to FIG. 4, in step S420, the electronic device (100) searches for at least one photo among the plurality of photos that stores metadata matching the search term by comparing the metadata stored for each of the previously stored plurality of photos with the metadata corresponding to the search term included in the search query. Referring together to operation ③ of the embodiment illustrated in FIG. 3, the electronic device (100) searches for the plurality of photos (i1, i2, i3, ..., i n) Search each metadata to identify metadata (e.g., tags) that matches a search term representing the emotion 'joy', and at least one photo (i2, i7, i8, i) that stores the identified metadata. 12 ) can be searched. The electronic device (100) can search for at least one searched photo (i2, i7, i8, i 12 ) can be displayed as a search result on the display (150).

[0066] Although not illustrated in FIG. 3, an electronic device (100) according to one embodiment of the present disclosure comprises at least one retrieved photograph (i2, i7, i8, i 12 ) can be displayed as search results at the top of the display (150), and the remaining unsearched photos can be displayed at the bottom of the display (150) arranged in chronological order.

[0067] In conventional technology, when searching for photos based on subjective emotions or personal preferences, one must use object keywords such as the clothes worn at the moment the photo was taken, the toys played with, or the location where the photo was taken. However, this relies on memory, and there are technical limitations in the complex search process for finding the desired photo. Although attempts are continuously being made to use object recognition models to recognize subjective emotions felt by users, such as joy, happiness, and cuteness, current object recognition models are only capable of recognizing a limited number of emotions, and there is a problem in that it is difficult to accurately recognize emotions, particularly with just a single photo.

[0068] The present disclosure aims to provide an electronic device (100) and a method of operation thereof, which obtains an acoustic signal by recording sound at the time of taking a photograph, analyzes at least one of subjective emotions and preferences by analyzing the acoustic signal, and stores the analysis result as metadata regarding the photograph, thereby enabling the user to easily and accurately search for the photograph they want when searching for photographs.

[0069] An electronic device (100) according to the embodiment illustrated in FIGS. 1 to 4 may acquire a sound signal (30) through a microphone (120) during a preset time before and after the time of taking a photograph using a camera (110), and may store at least one of the emotions and preferences obtained as a result of analyzing the sound signal (30) using an artificial neural network model (40) as metadata regarding a photograph (i). When a search query (310) including a search term regarding emotions and preferences is input into the electronic device (100) according to one embodiment of the present disclosure, the previously stored photographs (i1, i2, i3, ..., i n At least one photo matching the search query using the stored metadata for each (i2, i7, i8, i 12 When searching for ), it provides a technical effect that can improve the convenience and accuracy of the search when searching for emotions or tastes. In addition, the electronic device (100) according to one embodiment of the present disclosure provides a technical effect that can improve the user's search experience and increase the frequency and duration of use of the photo search function in a photo application (e.g., a 'gallery application') by enabling faster finding of a desired photo among at least one photo provided as a search result.

[0070] FIG. 5 is a block diagram illustrating the components of an electronic device (100) according to one embodiment of the present disclosure.

[0071] Referring to FIG. 5, the electronic device (100) may include a camera (110), a microphone (120), a processor (130), a memory (140), and a display (150). The camera (110), the microphone (120), the processor (130), the memory (140), and the display (150) may each be electrically and / or physically connected to one another. FIG. 5 illustrates only essential components for explaining the function and / or operation of the electronic device (100), and the components included in the electronic device (100) are not limited to those illustrated in FIG. 5. In one embodiment of the present disclosure, the electronic device (100) may further include a battery that supplies driving power to the camera (110), the microphone (120), the processor (130), and the display (150). In one embodiment of the present disclosure, the electronic device (100) may further include a hardware device for receiving user input, such as a physical button, keyboard, keypad, mouse, trackball, jog switch, etc.

[0072] A camera (110) is configured to capture an object and acquire an image. The camera (110) may include a lens module, an image sensor, and an image processing module. The camera (110) may acquire a still image or video of an object by means of an image sensor (e.g., CMOS or CCD). A still image may be, for example, a photograph. A video may include multiple image frames that are continuously acquired by capturing an object through the camera (110). The image processing module may store a still image consisting of a single image frame acquired through the image sensor or video data consisting of multiple image frames in an image storage (146) within a memory (140).

[0073] In one embodiment of the present disclosure, when the electronic device (100) is a mobile device such as a smartphone or tablet PC, the camera (110) may be implemented in a small form factor so as to be mounted on the mobile device and may be implemented as a lightweight RGB camera that consumes low power. However, it is not limited thereto.

[0074] The camera (110) may include two or more cameras. The camera (110) may be implemented as a plurality of stereo cameras. However, it is not limited thereto, and in one embodiment of the present disclosure, the electronic device (100) may include three or more cameras.

[0075] A microphone (120) is a device configured to acquire an acoustic signal by receiving ambient sounds or voices spoken by a person. The microphone (120) can acquire a sound signal by recording, for example, speech, laughter, interjections, sighs, ambient noise, or music. The microphone (120) can acquire an acoustic signal by converting the acquired sound signal into an electrical signal of alternating current voltage using a transducer or the like. In the present disclosure, "sound signal" refers to signal data acquired by converting sounds, such as voice, music, or sound, received through the microphone (120) into an electrical signal. In one embodiment of the present disclosure, the microphone (120) is activated for a preset time before and after the time of taking a photo or video by the camera (110) to receive ambient sounds or speech, and can convert the received ambient sounds or speech into an acoustic signal.

[0076] The processor (130) can execute one or more instructions of a program stored in memory (140). The processor (130) may be composed of hardware components that perform arithmetic, logic, and input / output operations and image processing. Although the processor (130) is depicted as a single element in FIG. 5, it is not limited thereto. In one embodiment of the present disclosure, the processor (130) may be composed of one or more elements.

[0077] The processor (130) may include various processing circuits and / or multiple processors. For example, the term 'processor' as used in the disclosure, including in the claims, may include at least one processor and various processing circuits. In at least one processor, one or more processors may be configured to perform the various functions described herein in a distributed manner, individually and / or collectively. As used herein, 'processor', 'at least one processor', and 'one or more processors' may be configured to perform various functions. However, these terms cover, without limitation, situations where one processor performs some of the functions and other processor(s) perform other parts of the functions, and situations where a single processor can perform all functions. Additionally, at least one processor may include a combination of processors performing various functions of the disclosed functions in a distributed manner. At least one processor may execute program instructions to achieve or perform various functions.

[0078] The processor (130) may be implemented as a general-purpose processor such as a CPU (Central Processing Unit), AP (Application Processor), DSP (Digital Signal Processor), a graphics-dedicated processor such as a GPU (Graphic Processing Unit) or VPU (Vision Processing Unit), or an artificial intelligence-dedicated processor such as an NPU (Neural Processing Unit). The processor (130) may be controlled to process input data according to predefined operation rules or an artificial intelligence model. Alternatively, if the processor (130) is an artificial intelligence-dedicated processor, the artificial intelligence-dedicated processor may be designed with a hardware structure specialized for processing a specific artificial intelligence model.

[0079] The memory (140) may be composed of at least one type of storage medium, such as a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (e.g., SD or XD memory), RAM (Random Access Memory), SRAM (Static Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), or an optical disk.

[0080] The memory (140) may store instructions related to functions and / or operations in which the electronic device (100) analyzes at least one of emotions and tastes based on an acoustic signal acquired through the microphone (120) and stores the analysis results as metadata regarding the photograph. In one embodiment of the present disclosure, the memory (140) may store at least one of instructions, an algorithm, a data structure, program code, and an application program that can be read by the processor (130). The instructions, algorithm, data structure, and program code stored in the memory (140) may be implemented in a programming or scripting language such as, for example, C, C++, Java, assembler, etc.

[0081] The memory (140) may store instructions, algorithms, data structures, or program codes related to a preprocessing module (141), an emotion / taste classification module (142), a preference calculation module (143), a metadata storage module (144), and a photo search module (145). A 'module' included in the memory (140) refers to a unit that processes a function or operation performed by the processor (130), and this may be implemented as software such as instructions, algorithms, data structures, or program code. In one embodiment of the present disclosure, the memory (140) may include an image storage (146), which is a storage space for storing photos or videos.

[0082] The processor (130) can be implemented by executing instructions or program codes stored in memory (140). Hereinafter, the functions and / or operations performed by the processor (130) by executing the instructions or program codes of each of the plurality of modules stored in memory (140) will be described in detail.

[0083] The processor (130) can acquire an acoustic signal through the microphone (120) during a preset time before and after the time of taking a picture using the camera (110). In the present disclosure, the 'preset time before and after the time of taking' may include a preset time before the time of taking and a preset time after the time of taking, as input is received by touching the shooting button UI (user interface) displayed through the display (150) when the user runs the camera application. However, it is not limited thereto, and the electronic device (100) may further include a shooting button configured as hardware. In this case, the 'preset time before and after the time of taking' may include a preset time before the time of taking and a preset time after the time of taking, as input is received by pressing the shooting button. The 'preset time' may be, for example, 3 seconds before and after the time of taking, but is not limited thereto.

[0084] In one embodiment of the present disclosure, the processor (130) may execute a camera application and display a preview image of a subject on a display (150). While the preview image is being displayed, the processor (130) may activate the operation of the microphone (120). Upon receiving user input such as pressing a shooting button UI or a hardware shooting button, the processor (130) may control the camera (110) to photograph the subject and acquire a photograph. The processor (130) may receive sounds, such as ambient sounds or voices uttered by the subject, through the microphone (120) for a preset time before and after the time of taking a photograph using the camera (110), and may convert the received sounds into electrical signals to acquire acoustic signals.

[0085] The preprocessing module (141) is composed of instructions or program code for executing the function and / or operation of preprocessing an acoustic signal. The processor (130) can preprocess the acoustic signal obtained through the microphone (120) by executing the instructions or program code of the preprocessing module (141). The processor (130) can perform preprocessing such as normalization, resampling, or background noise removal on the data of the acoustic signal. In one embodiment of the present disclosure, the processor (130) can obtain a speech signal by a subject being filmed from the preprocessed acoustic signal. For example, the processor (130) can extract a speech signal of speech from the preprocessed acoustic signal and perform Automatic Speech Recognition (ASR) to convert the speech signal into text.

[0086] A processor (130) can extract acoustic features representing at least one of volume, tone, tempo, and utterance from a preprocessed acoustic signal, and convert the extracted acoustic features into a feature vector by vector embedding. In one embodiment of the present disclosure, the processor (130) can obtain a sound feature vector by vector embedding acoustic features regarding volume, tone, or tempo, and obtain an utterance feature vector by vector embedding text converted from the utterance.

[0087] The emotion / taste analysis module (142) is composed of instructions or program code for executing a function and / or operation of analyzing a feature vector obtained from an acoustic signal using an artificial neural network model and classifying the acoustic signal into at least one of the emotion and taste at the time of taking the photo according to the analysis result. 'Emotion' refers to a subjective mind or mood regarding a phenomenon or event, and may include, for example, joy, sadness, surprise, anger, etc. 'Taste' refers to a direction or tendency of desire that arises on an individual basis, and may include, for example, cuteness, coolness, or deliciousness, etc. The processor (130) can obtain information regarding at least one of the emotion and taste by analyzing the feature vector obtained from the acoustic signal by executing the instructions or program code of the emotion / taste analysis module (142). In one embodiment of the present disclosure, the emotion / taste classification module (142) may include an artificial neural network model. The artificial neural network model may be a deep neural network model trained through a supervised learning method that applies multiple training data consisting of acoustic signals as input and applies label values regarding emotions or preferences for each of the multiple training data as ground truths. The deep neural network model may be, for example, a Convolutional Neural Network or a Transformer.However, this is not limited to this, and deep neural network models may also be implemented as, for example, recurrent neural networks, restricted Boltzmann machines, deep belief networks, bidirectional recurrent deep neural networks, or deep Q-networks.

[0088] The processor (130) can obtain information regarding at least one of emotion and preference regarding an acoustic signal by inputting a feature vector into an artificial neural network model and analyzing the feature vector by performing inference using the artificial neural network model. The information regarding at least one of emotion and preference may include at least one type of emotion and preference and at least one confidence value representing the probability that the acoustic signal is classified into at least one type.

[0089] A specific embodiment in which a processor (130) preprocesses an acoustic signal and obtains classification information regarding at least one of emotion and taste from the acoustic signal using an artificial neural network model will be described in detail with reference to FIGS. 6 and 7.

[0090] The preference calculation module (143) is composed of instructions or program code for executing a function and / or operation to calculate a preference for a photo based on at least one type of emotion or preference obtained as a result of analyzing an acoustic signal and a reliability value for each of at least one type, and to output the calculated preference. The processor (130) can calculate a preference for a photo based on at least one type of emotion or preference and a reliability value by executing the instructions or program code of the preference calculation module (143). In one embodiment of the present disclosure, the processor (130) can calculate a preference for a photo by applying importance to each of at least one type of emotion or preference output by the emotion / preference classification module (142) and performing an operation of multiplying the importance by the reliability value of each of at least one type. The importance is a value of the concept of weight for at least one type of emotion or preference, and may be pre-set for each of at least one type of emotion and preference. In one embodiment of the present disclosure, 'importance' may be pre-set based on user input or history information regarding the user's shooting and searching. For example, an importance value of 2 may be pre-set for pleasure, an importance value of 1 for surprise, and an importance value of 0.5 for sadness. However, the above importance values are merely examples, and the importance of the present disclosure is not limited to the values described above.

[0091] A specific embodiment in which the processor (130) calculates a preference for a photograph based on at least one type of emotion and taste and a reliability value will be described in detail with reference to FIG. 9 and FIG. 10.

[0092] The metadata storage module (144) is composed of instructions or program code for executing a function and / or operation of storing information regarding at least one of emotion and taste as metadata of a photograph. In the present disclosure, 'metadata' may include at least one classification or identification information among the contents included in the photograph or video, such as objects, actions, behaviors, situations, or events. In one embodiment of the present disclosure, the metadata may include a tag regarding the photograph or video. The processor (130) may store information regarding emotion and / or taste as a tag for the photograph or video by executing the instructions or program code of the metadata storage module (144). In one embodiment of the present disclosure, the processor (130) may store at least one type of emotion and taste output by the emotion / taste classification module (142) as metadata for a photograph taken by the camera (110). For example, the processor (130) can store at least one type of emotion or taste as a tag in a photo, having a confidence value greater than or equal to a preset threshold among the confidence values of each of at least one type of emotion and taste.

[0093] The processor (130) can store the preference calculated by the preference calculation module (143) as metadata for the photo. In one embodiment of the present disclosure, the processor (130) can store the preference as a tag in the photo. In one embodiment of the present disclosure, the processor (130) can store the maximum value among the preference values regarding at least one type of emotion and taste as a tag in the photo.

[0094] The photo search module (145) is composed of commands or program code for executing functions and / or operations to search for photos that match a search query entered by a user and output search results. In the present disclosure, 'search query' refers to a signal requested by a user to search for an object they wish to search for from a database in which photos are stored, such as an image storage (146). In one embodiment of the present disclosure, the search query may include keywords representing the content to be searched. However, it is not limited thereto, and the search query according to one embodiment of the present disclosure may include search terms related to at least one of emotions and tastes. The search query may also include keywords for searching for videos.

[0095] When a search query is entered by a user through a touchscreen integrated with a user input interface or a display (150), the processor (130) can output at least one photo that matches the search term included in the search query among a plurality of photos already stored in the image storage (146) as a search result by executing commands or program code of the photo search module (145). For example, if the search query entered by the user includes a search term regarding at least one of emotion and taste, the processor (130) can search for at least one photo that matches the search term among the plurality of photos by comparing the metadata stored for each of the plurality of photos already stored in the image storage (146) with the search term included in the search query. The processor (130) can display at least one photo on the display (150).

[0096] When a search query received from a user does not include a search term regarding at least one of emotions and tastes, but includes a search term regarding a target object, e.g., bag, shoe, gift, person, puppy, Christmas, etc., the processor (130) can search for at least one photo among a plurality of previously stored photos that includes metadata matching the target object included in the search query. The processor (130) can sort the at least one searched photo in descending order based on the preference for each of the at least one photo. The processor (130) can display the at least one photo sorted based on preference through the display (150) as a photo search result. A specific embodiment in which the processor (130) searches for a photo matching the search query when a search query including a search term regarding a target object is input, and sorts and displays the searched photo based on preference, will be described in detail with reference to FIGS. 11 and 12.

[0097] In the description of the photo search module (145), a plurality of photos are depicted and described as being stored in the local storage space of the electronic device (100), for example, image storage (146), but are not limited thereto. In one embodiment of the present disclosure, a plurality of photos may be stored in web storage or a cloud server. In this case, the electronic device (100) further includes a communication interface configured to perform data transmission and reception with an external web storage or cloud server, transmits a search query to the web storage or cloud server through the communication interface, and can receive at least one photo matching the search query from the web storage or cloud server.

[0098] Image storage (146) is a storage device within memory (140) that stores multiple photos or videos captured and stored by a camera (110). However, it is not limited thereto, and image storage (146) may also store images or videos stored from websites or applications, etc. Image storage (146) may be composed of non-volatile memory. Non-volatile memory refers to a storage medium that stores and maintains information even when power is not supplied, and can use the stored information again when power is supplied. Non-volatile memory may include, for example, at least one of flash memory, hard disk, SSD (Solid State Drive), multimedia card micro type, card type memory (e.g., SD or XD memory, etc.), ROM (Read Only Memory), magnetic memory, magnetic disk, and optical disk.

[0099] Although the image storage (146) in FIG. 5 is depicted as a component included within the memory (140), the present disclosure is not limited to what is depicted in the drawings. In one embodiment of the present disclosure, the image storage (146) may be configured as a database within the electronic device (100) which is a component separate from the memory (140).

[0100] The display (150) is configured to display search results of photos or videos under the control of the processor (130). In one embodiment of the present disclosure, the display (150) may display a search UI (user interface) that receives a search query for photo search. The display (150) may be implemented as at least one of, for example, a liquid crystal display, a thin film transistor-liquid crystal display, an organic light-emitting diode, a flexible display, a 3D display, and an electrophoretic display.

[0101] FIG. 6 is a flowchart illustrating a method in which an electronic device (100) according to one embodiment of the present disclosure analyzes an acoustic signal to obtain at least one type of emotion and taste, and stores at least one type of emotion and taste as metadata for a photograph.

[0102] Steps S620 and S630 illustrated in FIG. 6 are steps that embody the operation of step S220 of FIG. 2. Step S620 illustrated in FIG. 6 can be performed after the operation of step S610 is performed. After the operation of step S210 illustrated in FIG. 2 is performed, the operation of step S610 of FIG. 6 can be performed. Step S640 illustrated in FIG. 6 is a step that embodies the operation of step S230 of FIG. 2.

[0103] FIG. 7 is a diagram illustrating the operation of an electronic device (100) according to one embodiment of the present disclosure analyzing an acoustic signal (700) to obtain at least one type of emotion and taste, and storing at least one type of emotion and taste (740) as metadata for a photograph.

[0104] Hereinafter, the function and / or operation of the electronic device (100) will be described in detail with reference to FIG. 6 and FIG. 7.

[0105] In step S610 of FIG. 6, the electronic device (100) preprocesses the acoustic signal. In one embodiment of the present disclosure, the electronic device (100) may perform preprocessing such as normalizing the data of the acoustic signal, resampling, and removing background noise. In the present disclosure, 'normalization' may refer to a scaling process that applies a specific gain to the acoustic signal to adjust the dynamic range of the acoustic signal data and adjusts the peak of the amplitude or volume. In the present disclosure, 'resampling' may refer to a process that changes the format and sound quality of the acoustic signal data. In one embodiment of the present disclosure, resampling may include upsampling and downsampling.

[0106] In step S620, the electronic device (100) extracts features representing at least one of volume, tone, tempo, and utterance from an acoustic signal and converts the extracted features into a feature vector. In one embodiment of the present disclosure, the electronic device (100) may convert feature data regarding volume, tone, and tempo extracted from the acoustic signal into a sound feature vector by vector embedding. In one embodiment of the present disclosure, the electronic device (100) may convert utterance data extracted from the acoustic signal into an utterance feature vector by vector embedding.

[0107] Referring to steps S610 and S620 of FIG. 6 together with FIG. 7, the electronic device (100) can acquire an acoustic signal (700) by recording ambient sounds or speech of an object, etc., for a preset time before and after the time of taking a photograph through a microphone (120). The processor (130, see FIG. 5) of the electronic device (100) can preprocess the acoustic signal (700) output by the microphone (120) by executing instructions or program code of a preprocessing module (141). In one embodiment of the present disclosure, the preprocessing module (141) may include a normalization module (141a), a resampling module (141b), a noise removal module (141c), and vector embedding modules (141d, 141e). The processor (130) can execute instructions or program code of the normalization module (141a) to adjust the dynamic range of the acoustic signal (700) and adjust the peak or volume of the amplitude. The processor (130) can execute instructions or program code of the resampling module (141b) to perform preprocessing that changes the format and sound quality of the acoustic signal (700). The processor (130) can execute instructions or program code of the noise removal module (141c) to remove unnecessary background noise from the acoustic signal (700). Preprocessed acoustic signal data (710) can be obtained through the preprocessing process.

[0108] The vector embedding module (141d, 141e) may be composed of instructions or program code for executing an embedding function that converts feature data extracted from preprocessed acoustic signal data (710) into numbers (vector values) in an n-dimensional vector space. The first vector embedding module (141d) may be configured to perform embedding that converts preprocessed acoustic signal data (710), for example, sound features (711) regarding volume, timbre, and tempo, into an n-dimensional feature vector. The processor (130) may extract sound features (711) regarding volume, timbre, and tempo from the preprocessed acoustic signal data (710), and input the extracted sound features into the first vector embedding module (141d) to convert them into a sound feature vector (721). The second vector embedding module (141e) may be configured to perform embedding that converts the data of the utterance (712) into an n-dimensional feature vector. The processor (130) may extract the data of the utterance (712) from the preprocessed acoustic signal data (710) and input the utterance (712) into the second vector embedding module (141e) to convert it into a utterance feature vector (722). Although not shown in FIG. 7, the preprocessing module (141) may further include an ASR module. The Automatic Speech Recognition (ASR) module may be composed of instructions or program code for executing a function and / or operation to convert the utterance data into text by performing ASR when utterance data is input. In one embodiment of the present disclosure, the processor (130) may input the utterance data into the ASR module and convert the utterance data into text by performing ASR. In this case, the processor (130) can obtain a speech feature vector (722) by inputting the text into the second vector embedding module (141e) and embedding the text into a vector.

[0109] Referring again to FIG. 6, in step S630, the electronic device (100) inputs a feature vector to an artificial neural network model and performs inference using the artificial neural network model to obtain at least one type of emotion or taste for which the acoustic signal is classified and a confidence value for at least one type. Referring together to FIG. 7, the emotion / taste classification module (142) may include an artificial neural network model (142a). In one embodiment of the present disclosure, the 'artificial neural network model (142a)' may be a deep neural network model trained through a supervised learning method in which a plurality of training data consisting of acoustic signals are applied as inputs, and a label value regarding the emotion or taste for each of the plurality of training data is applied as the ground truth. Since the artificial neural network model (142a) is identical to the artificial neural network model described in FIG. 1, FIG. 2, and FIG. 5, a redundant description is omitted.

[0110] A processor (130) inputs a sound feature vector (721) and a speech feature vector (722) into an artificial neural network model (142a), and performs inference using the artificial neural network model (142a) to analyze the sound feature vector (721) and the speech feature vector (722), thereby obtaining a confidence value (730), which is a probability value that can be classified into at least one type of emotion and taste. As a result of inference by the artificial neural network model (142a), a type of emotion such as pleasure, surprise, or sadness, which is a subjective emotion, is obtained from the feature vectors (721, 722) extracted from the acoustic signal (700), and a confidence value for each type of emotion can be obtained. In the embodiment illustrated in FIG. 7, as a result of inference by the artificial neural network model (142a), a type of emotion such as pleasure, surprise, or sadness is obtained from the acoustic signal (700), and a confidence value for each of pleasure, surprise, and sadness can be obtained. The ‘confidence value’ is a value representing the probability that an acoustic signal (700) can be classified into a specific emotion or taste type as a result of inference by an artificial neural network model (142a), and can be determined as a value within the range of 0 to 1. For example, a confidence value of 0.9 for ‘joy’ can mean that the probability that the acoustic signal (700) can be classified as the emotion of joy is 0.9, that is, 90%. For example, a confidence value of 0.2 for ‘surprise’ can mean that the probability that the acoustic signal (700) can be classified as the emotion of surprise is 20%, and a confidence value of 0.1 for ‘sadness’ can mean that the probability that the acoustic signal (700) can be classified as the emotion of sadness is 10%.

[0111] In the embodiment illustrated in FIG. 7, the artificial neural network model (142a) is illustrated and described as outputting a type regarding an individual’s subjective emotion, such as joy, surprise, or sadness, but the present disclosure is not limited thereto. In one embodiment of the present disclosure, the artificial neural network model (142a) may output a type and confidence value representing personal preferences, such as cuteness, coolness, or beauty, as a result of inference.

[0112] In step S640 of FIG. 6, the electronic device (100) stores at least one type of emotion or taste having a confidence value greater than or equal to a preset threshold among the confidence values of each of at least one type as a tag in a photograph. Referring together to the embodiment illustrated in FIG. 7, the processor (130) identifies a type of emotion / taste, 'joy,' among the types and confidence values (730) of emotion / taste output by the artificial neural network model (142a), having a confidence value greater than or equal to a preset threshold, e.g., 0.5, and can store the identified 'joy' as a tag for a photograph taken at a time before or after the acoustic signal (700) is recorded.

[0113] FIG. 8a is a drawing illustrating an emotion / taste classification model (810) according to one embodiment of the present disclosure.

[0114] Referring to FIG. 8a, the emotion / taste classification model (810) may be implemented as an artificial intelligence model. In one embodiment of the present disclosure, the emotion / taste classification model (810) may be included in an emotion / taste classification module (142, see FIG. 5). The emotion / taste classification model (810) may be a deep neural network model trained through a supervised learning method that applies feature vectors extracted from a plurality of training data consisting of acoustic signals as inputs, and applies label values regarding emotions or tastes for each of the plurality of training data as ground truths. The deep neural network model may be, for example, a Convolutional Neural Network or a Transformer. However, this is not limited to this, and deep neural network models may also be implemented as, for example, recurrent neural networks, restricted Boltzmann machines, deep belief networks, bidirectional recurrent deep neural networks, or deep Q-networks.

[0115] When a sound feature vector (801) and a speech feature vector (802) are input to a learned emotion / taste classification model (810), the model can output the type of emotion / taste and a confidence value (820) by analyzing the input sound feature vector (801) and speech feature vector (802) through inference. For example, as a result of inference by the emotion / taste classification model (810), the type of emotion, such as joy, sadness, or surprise, and the type of taste, such as coolness, may be output together. Among the types of emotions, the confidence value of joy may be, for example, 0.8, the confidence value of sadness may be, for example, 0.3, and the confidence value of surprise may be, for example, 0.1. The confidence value of coolness, which is a type of taste, may be, for example, 0.05.

[0116] FIG. 8b is a drawing illustrating a plurality of artificial neural network models (831, 832) according to one embodiment of the present disclosure.

[0117] Referring to FIG. 8b, the emotion classification model (831) and the taste classification model (832) can each be implemented as an artificial intelligence model. In one embodiment of the present disclosure, both the emotion classification model (831) and the taste classification model (832) can be included in an emotion / taste classification module (142, see FIG. 5).

[0118] The emotion classification model (831) may be a deep neural network model trained through a supervised learning method that applies feature vectors extracted from multiple training data consisting of acoustic signals as inputs and applies label values regarding emotions for each of the multiple training data as correct answers. The taste classification model (832) may be a deep neural network model trained through a supervised learning method that applies feature vectors extracted from multiple training data consisting of acoustic signals as inputs and applies label values regarding tastes for each of the multiple training data as correct answers. Since the description of the deep neural network model itself, excluding the input and output of the training data, is the same as that in FIG. 8a, redundant descriptions are omitted.

[0119] When a sound feature vector (801) and a speech feature vector (802) are input to a learned emotion classification model (831), the model can output an emotion type and confidence value (821) by analyzing the input sound feature vector (801) and speech feature vector (802) through inference. For example, as a result of inference by the emotion classification model (831), an emotion type and feature vector (801, 802), such as joy, sadness, surprise, or anger, can be classified into joy, sadness, surprise, or anger, respectively, and a confidence value can be output. Among the emotion types, the confidence value for joy may be, for example, 0.8, the confidence value for sadness may be, for example, 0.3, the confidence value for surprise may be, for example, 0.1, and the confidence value for anger may be, for example, 0.05.

[0120] When a sound feature vector (801) and a speech feature vector (802) are input to a learned taste classification model (832), the model can output a type of taste and a confidence value (822) by analyzing the input sound feature vector (801) and speech feature vector (802) through inference. For example, as a result of inference by the taste classification model (832), a confidence value can be output that allows the type of taste and feature vectors (801, 802), such as cuteness, coolness, prettiness, or deliciousness, to be classified as cuteness, coolness, prettiness, or deliciousness, respectively. Among the types of taste, the confidence value for cuteness may be, for example, 0.5, the confidence value for coolness may be, for example, 0.3, the confidence value for prettiness may be, for example, 0.2, and the confidence value for deliciousness may be, for example, 0.05.

[0121] FIG. 9 is a flowchart illustrating a method in which an electronic device (100) according to one embodiment of the present disclosure calculates a preference for at least one type of emotion and taste, and stores the calculated preference as metadata for a photograph.

[0122] Steps S910 to S930 illustrated in FIG. 9 are steps that embody the operation of step S230 of FIG. 2. Step S910 illustrated in FIG. 9 can be performed after the operation of step S220 of FIG. 2 has been performed.

[0123] FIG. 10 is a diagram illustrating the operation of an electronic device (100) according to one embodiment of the present disclosure to calculate a preference for at least one type of emotion and taste.

[0124] Hereinafter, the function and / or operation of the electronic device (100) will be described in detail with reference to FIG. 9 and FIG. 10 together.

[0125] In step S910 of FIG. 9, the electronic device (100) calculates a preference for a photograph based on a pre-set importance for at least one type of emotion or taste and a reliability value for each of at least one type. In one embodiment of the present disclosure, the electronic device (100) may calculate a preference for a photograph by applying a pre-set importance for each of at least one type of emotion or taste and performing an operation of multiplying the importance by the reliability value for each of at least one type. The 'importance' is a value that performs a weighting function for at least one type of emotion or taste and may be pre-set for each of at least one type of emotion and taste.

[0126] Referring together with FIG. 10, a processor (130, see FIG. 5) inputs a type of emotion / taste and a confidence value (1000) into a preference calculation module (143) and can calculate a preference for a photo by executing commands or program code of the preference calculation module (143). When at least one type of emotion or taste and a confidence value for each of the at least one type are input, the processor (130) can identify a type of emotion or taste (1010) having a confidence value greater than or equal to a threshold among at least one type. The threshold may be, for example, 0.5, but is not limited thereto. In the embodiment illustrated in FIG. 10, among the types of emotions or tastes, the reliability value of pleasure is 0.7, the reliability value of surprise is 0.8, the reliability value of sadness is 0.05, and the reliability value of anger is 0.01; therefore, the types of emotions or tastes having a reliability value greater than or equal to a threshold (e.g., 0.5) may be 'pleasure' and 'surprise'. The processor (130) may perform an operation of multiplying the importance of pleasure and surprise, which are types of emotions or tastes (1010) having a reliability value greater than or equal to the threshold. In one embodiment of the present disclosure, the importance may be pre-set by user input. However, it is not limited thereto, and in one embodiment of the present disclosure, the importance may be set based on user history information regarding shooting and searching. For example, the importance for 'pleasure' may be pre-set to 2, and the importance for 'surprise' may be pre-set to 1. The processor (130) can obtain a preference value of 1.4 as the result of an operation in which 0.7, the reliability value of pleasure, is multiplied by 2, the importance value for pleasure that is preset, and obtain a preference value of 0.8 as the result of an operation in which 0.8, the reliability value of surprise, is multiplied by 1, the importance value for surprise that is preset.

[0127] Referring again to FIG. 9, in step S920, the electronic device (100) identifies the maximum value among the calculated preferences. Referring together to the embodiment illustrated in FIG. 10, the processor (130) can identify the preference (1030) of 1.4, which is the maximum value among the calculated preference values of 1.4 and 0.8.

[0128] In step S930, the electronic device (100) stores the maximum value of the identified preference as a tag for the photo. Referring together to the embodiment illustrated in FIG. 10, the processor (130) can store the preference (1030) calculated as 1.4 as a tag for the photo.

[0129] FIG. 11 is a flowchart illustrating a method for an electronic device (100) according to one embodiment of the present disclosure to display a search result of a photo when a search query including a search term regarding a search target object is input.

[0130] FIG. 12 is a diagram illustrating the operation of an electronic device (100) according to one embodiment of the present disclosure displaying a search result of a photo when a search query (1210) is input through a search UI (1200).

[0131] Hereinafter, the function and / or operation of the electronic device (100) will be described in detail with reference to FIG. 11 and FIG. 12 together.

[0132] In step S1110 of FIG. 11, the electronic device (100) receives a search query containing search terms regarding a search target object. The 'search target object' refers to a person, animal, object, behavior, action, attribute, situation, event, etc., that the user wishes to search for. Search terms regarding the search target object may include, for example, a bag, shoes, a gift, a person, a puppy, Christmas, etc., but are not limited thereto. In one embodiment of the present disclosure, the electronic device (100) may display a plurality of previously stored photos and receive a search query containing search terms regarding the search target object from the user through a search UI (user interface). Referring together with FIG. 12, the electronic device (100) displays a plurality of previously stored photos (i1, i2, ..., i) in memory (140, see FIG. 5) through the display (150). n A search UI (1200) can be displayed (operation ①) and receive a search query for searching photos (operation ②). The user can select the search UI (1200) through user input such as touch and enter a search query (1210) containing a search term regarding the object to be searched. In the embodiment illustrated in FIG. 12, the electronic device (100) can receive a search query (1210) containing the search term "Christmas", which is the object to be searched, through the search UI (1200).

[0133] However, the search query (1210) of the present disclosure is not limited to including only search terms as illustrated in FIG. 12. In one embodiment of the present disclosure, the electronic device (100) may receive a search query composed of natural language that includes search terms regarding a search target object. For example, the electronic device (100) may receive a natural language search query that includes a search term 'Christmas', such as "find a photo taken on Christmas Day" or "find a photo of a Christmas gift", through a search UI (1200).

[0134] Referring again to FIG. 11, in step S1120, the electronic device (100) searches for at least one photo among a plurality of previously stored photos that matches a search query. In one embodiment of the present disclosure, the electronic device (100) can search for at least one photo among a plurality of photos that stores metadata matching a search term by comparing the metadata stored for each of the plurality of photos stored in memory (140) with a search term included in the search query.

[0135] In step S1130, the electronic device (100) displays at least one retrieved photo sorted in descending order based on preference. In one embodiment of the present disclosure, the electronic device (100) may identify a preference value among the metadata stored for each of the at least one retrieved photo and sort the at least one photo in order from the photo with the highest preference value to the photo with the lowest preference value. The electronic device (100) may display at least one photo sorted in descending order as a photo search result on the display (150). Referring together to operation ③ illustrated in FIG. 12, the processor (130, see FIG. 5) of the electronic device (100) may identify a preference value for each of the at least one photo (i2, i5, i8, i9) by analyzing the metadata stored for each of the at least one photo (i2, i5, i8, i9) that matches the search query (1210). In the embodiment illustrated in FIG. 12, among at least one retrieved photo (i2, i5, i8, i9), the preference value of the fifth photo (i5) may be the largest, the preference value of the ninth photo (i9) may be smaller than that of the fifth photo (i5), the preference value of the eighth photo (i8) may be smaller than that of the ninth photo (i9), and the preference value of the second photo (i2) may be the smallest. In this case, the processor (130) may sort the retrieved photos in descending order according to the preference value, for example, in the order of the fifth photo (i5), the ninth photo (i9), the eighth photo (i8), and the second photo (i2). The processor (130) may display at least one photo (i2, i5, i8, i9) sorted in descending order according to the preference value on the display (150).

[0136] When the electronic device (100) according to the embodiment illustrated in FIGS. 11 and 12 receives a search query (1210) containing a search term regarding a search target object, it displays at least one searched photo (i2, i5, i8, i9) sorted in descending order according to preference determined based on at least one of emotion and taste, rather than simply listing them in chronological order, thereby providing a technical effect that reflects the user's personal preference and improves the user experience regarding the search function. In addition, the electronic device (100) according to one embodiment of the present disclosure provides a technical effect that enables the user to find the photo they want more quickly and conveniently among the at least one photo (i2, i5, i8, i9) provided as a search result.

[0137] FIG. 13 is a flowchart illustrating a method in which an electronic device (100) according to one embodiment of the present disclosure analyzes at least one of emotion and taste based on an acoustic signal of a video and stores the analysis result as metadata regarding the video.

[0138] FIG. 14 is a diagram illustrating the operation of an electronic device (100) according to one embodiment of the present disclosure analyzing at least one of an emotion and a taste based on an acoustic signal (1410) of a video (1400) and storing the analysis result as metadata regarding the video (1400).

[0139] Hereinafter, the function and / or operation of the electronic device (100) will be described in detail with reference to FIG. 13 and FIG. 14 together.

[0140] In step S1310 of FIG. 13, the electronic device (100) divides the sound signal included in the video into pre-set time intervals to obtain a plurality of interval sound signal data. In one embodiment of the present disclosure, the electronic device (100) can extract sound signal data included in the video. The extracted sound signal is data regarding sound received through a microphone (120, see FIG. 5) during video recording, and may include, for example, electrical signals regarding speech, laughter, interjections, sighs, ambient noise, or music. The electronic device (100) can obtain a plurality of interval sound signal data by dividing the extracted sound signal into pre-set specific time lengths, for example, windows.

[0141] In one embodiment of the present disclosure, an electronic device (100) may divide an acoustic signal such that some of the segmented acoustic signal data overlap. Referring together with FIG. 14, a processor (130, see FIG. 5) of the electronic device (100) may extract an acoustic signal (1410) from a video (1400) and divide the acoustic signal (1410) into pre-set time segments. The processor (130) may divide the acoustic signal (1410) to obtain a plurality of segmented acoustic signal data (1410-1 to 1410-5). The plurality of segmented acoustic signal data (1410-1 to 1410-5) may overlap each other at some time intervals. For example, the first interval acoustic signal data (1410-1) may be acoustic signal data during the time interval from the first time point (t1) to the third time point (t3), and the second interval acoustic signal data (1410-2) may be acoustic signal data during the time interval from the second time point (t2) to the fourth time point (t4). The first interval acoustic signal data (1410-1) and the second interval acoustic signal data (1410-2) may overlap each other during the time interval between the second time point (t2) and the third time point (t3). Similarly, the third interval acoustic signal data (1410-3) may be acoustic signal data during the time interval from the third time point (t3) to the fifth time point (t5), and the second interval acoustic signal data (1410-2) and the third interval acoustic signal data (1410-3) may overlap each other during the time interval between the third time point (t3) and the fourth time point (t4).

[0142] Referring again to FIG. 13, in step S1320, the electronic device (100) inputs a plurality of feature vectors extracted from a plurality of segmented acoustic signal data into an artificial neural network model and analyzes the plurality of segmented acoustic signal data using the artificial neural network model to obtain at least one type among emotion and preference. In one embodiment of the present disclosure, the electronic device (100) may preprocess a plurality of segmented acoustic signal data, extract features representing at least one of volume, tempo, and speech from each of the preprocessed segmented acoustic signal data, and obtain a plurality of feature vectors by vector embedding the extracted features. The electronic device (100) inputs a plurality of feature vectors into an artificial neural network model and performs inference using the artificial neural network model to analyze the input plurality of feature vectors, thereby obtaining at least one type of emotion or preference in which the acoustic signal is classified and at least one confidence value representing the probability that the acoustic signal is classified into said at least one type. Since the artificial neural network model is identical to the artificial neural network model described in FIGS. 1, FIGS. 2, FIGS. 5, and FIGS. 7, redundant descriptions are omitted.

[0143] Referring together with FIG. 14, the processor (130) obtains a plurality of feature vectors from each of the plurality of segment acoustic signal data (1410-1 to 1410-5) and performs inference by inputting the plurality of feature vectors into an artificial neural network model to obtain at least one type of emotion and preference and a confidence value for at least one type for each of the plurality of segment acoustic signal data (1410-1 to 1410-5). For example, as a result of inference by the artificial neural network model, a type of emotion called 'joy' and a confidence value for joy (1421) may be obtained. Additionally, as a result of inference by the artificial neural network model, a type of emotion called 'surprise' and a confidence value for surprise (1422) may be obtained. For example, when the first segment acoustic signal data (1410-1) is applied as input to the artificial neural network model, as a result of inference, a confidence value for joy may be obtained as 0.5 and a confidence value for surprise as 0.1. When the second section acoustic signal data (1410-2) is applied as input to an artificial neural network model, the confidence value of the inference result for pleasure can be obtained as 0.05 and the confidence value for surprise as 0.02. For the third section acoustic signal data (1410-3), the confidence value for pleasure can be 0.5 and the confidence value for surprise as 0.9, for the fourth section acoustic signal data (1410-4), the confidence value for pleasure can be 0.5 and the confidence value for surprise as 0.8, and for the fifth section acoustic signal data (1410-5), the confidence value for pleasure can be 0.3 and the confidence value for surprise as 0.6.

[0144] In one embodiment of the present disclosure, a weighted sum of reliability values can be obtained in a time interval where a plurality of segment acoustic signal data (1410-1 to 1410-5) overlap. For example, in a time interval between a second time point (t2) and a third time point (t3) where the first segment acoustic signal data (1410-1) and the second segment acoustic signal data (1410-2) overlap, a weighted sum of 0.275 can be calculated by multiplying the reliability value of enjoyment obtained from the first segment acoustic signal data (1410-1), which is 0.5, by 1 / 2, and the reliability value of enjoyment obtained from the second segment acoustic signal data (1410-2), which is 0.05, by 1 / 2, and summing the multiplied values. Likewise, a weighted sum of 0.06 can be calculated by multiplying 0.1, the confidence value of surprise obtained from the first section acoustic signal data (1410-1), by 1 / 2, and multiplying 0.02, the confidence value of surprise obtained from the second section acoustic signal data (1410-2), by 1 / 2, and summing the multiplied values. Through the weighted sum operation of the method described above, the processor (130) can calculate the reliability value of pleasure (1421) as 0.275 and the reliability value of surprise (1422) as 0.46 in the time interval between the third time point (t3) and the fourth time point (t4), the reliability value of pleasure (1421) as 0.5 and the reliability value of surprise (1422) as 0.85 in the time interval between the fourth time point (t4) and the fifth time point (t5), and the reliability value of pleasure (1421) as 0.4 and the reliability value of surprise (1422) as 0.75 in the time interval between the fifth time point (t5) and the sixth time point (t6).

[0145] Referring again to FIG. 13, in step S1330, the electronic device (100) stores at least one of an emotion and a taste as metadata for a plurality of image frames included in a time interval corresponding to each of the plurality of audio signal data intervals among the time intervals of the video. In one embodiment of the present disclosure, the electronic device (100) may store at least one type of emotion or taste having a reliability value greater than or equal to a preset threshold among the reliability values of each of at least one type as a tag for image frames included in the corresponding time interval of the video. The preset threshold may be, for example, 0.5, but is not limited thereto.

[0146] Referring together with FIG. 14, among a plurality of time intervals, the time interval having a value greater than or equal to a preset threshold (e.g., 0.5) among the reliability values calculated through a weighted sum is the time interval between the fourth time point (t4) and the sixth time point (t6). The processor (130) includes image frames (f) included in the time interval between the fourth time point (t4) and the sixth time point (t6). j to f n The type of emotion or preference having a value greater than or equal to a threshold can be stored as metadata (1430). For example, the processor (130) can store the type of emotion or preference having a value greater than or equal to a threshold as metadata (1430) in the time interval between the fourth time point (t4) and the fifth time point (t5), since the reliability value of pleasure is 0.5 and the reliability value of surprise is 0.85, the entire image frames (f1 to f) included in the video (1400) n Image frames (f) included in the time interval between the 4th time point (t4) and the 5th time point (t5) among ) j to f k-1Regarding ), a first tag (1431) indicating 'joy' and a second tag (1432) indicating 'surprise' can be stored. Likewise, since the processor (130) has a reliability value of 0.4 for joy and a reliability value of surprise for surprise in the time interval between the fifth time point (t5) and the sixth time point (t6), all image frames (f1 to f) included in the video (1400) n Image frames (f) included in the time interval between the 5th time point (t5) and the 6th time point (t6) among ) k to f n A third tag (1433) indicating 'surprise' regarding ) can be stored. In one embodiment of the present disclosure, the processor (130) may not store a tag for an image frame during the remaining time intervals in which it is determined that the reliability value is below a preset threshold, for example, between the first time point (t1) and the fourth time point (t4) and between the sixth time point (t6) and the seventh time point (t7).

[0147] FIG. 15 is a flowchart illustrating a method in which an electronic device (100) according to one embodiment of the present disclosure calculates a preference based on an audio signal of a video and stores the calculated preference as metadata regarding the video.

[0148] Steps S1510 and S1520 illustrated in FIG. 15 are steps that embody the operation of step S1320 of FIG. 13. Step S1510 illustrated in FIG. 15 may be performed after the operation of step S1310 of FIG. 13 has been performed. Steps S1530 and S1540 illustrated in FIG. 15 are steps that embody the operation of step S1330 of FIG. 13.

[0149] FIG. 16 is a graph illustrating the operation of an electronic device (100) according to one embodiment of the present disclosure, which calculates a preference (1610) based on an audio signal of a video and stores the calculated preference as metadata for a segment (1620) within the video.

[0150] Hereinafter, the function and / or operation of the electronic device (100) will be described in detail with reference to FIG. 15 and FIG. 16 together.

[0151] In step S1510 of FIG. 15, the electronic device (100) extracts a feature vector from each of the plurality of segmented acoustic signal data. In one embodiment of the present disclosure, the electronic device (100) may divide an acoustic signal included in a video into a plurality of segmented acoustic signal data and extract features representing at least one of volume, timbre, tempo, and speech from each of the divided plurality of segmented acoustic signal data. The electronic device (100) may convert the extracted feature data into a feature vector by vector embedding. Since step S1510 is identical to step S620 of FIG. 6 and the operation of FIG. 7 except that it extracts a feature vector from the plurality of segmented acoustic signal data, a redundant description is omitted.

[0152] In step S1520, the electronic device (100) inputs the extracted feature vector into an artificial neural network model and performs inference using the artificial neural network model to analyze the feature vector, thereby obtaining at least one type of emotion or preference for which each of the multiple segmented acoustic signal data is classified, and a confidence value for each of the at least one type. In one embodiment of the present disclosure, as a result of inference by the artificial neural network model, types of emotions such as joy, surprise, and sadness, which are subjective emotions, are obtained from the feature vectors extracted from the multiple segmented acoustic signal data, and confidence values for each of the types of emotions can be obtained. In addition, in one embodiment of the present disclosure, as a result of inference by the artificial neural network model, types of preferences such as cuteness, coolness, and prettyness, which are personal preferences, are obtained from the feature vectors extracted from the multiple segmented acoustic signal data, and confidence values for each of the types of preferences can be obtained. Since the method of inference by the artificial neural network model and the weighted sum of confidence values is the same as that described in FIGS. 13 and 14, redundant descriptions are omitted.

[0153] Referring to FIG. 16, when analyzing multiple interval acoustic signal data using an artificial neural network model, the reliability value (1601), which is the probability value that multiple interval acoustic signal data can be classified as pleasure among the types of emotions, can be calculated as 0.5 in the time interval between the first time point (t1) and the second time point (t2), 0.275 in the time interval between the second time point (t2) and the fourth time point (t4), and 0.5 in the time interval between the fourth time point (t4) and the fifth time point (t5). Additionally, as a result of analyzing multiple interval acoustic signal data using an artificial neural network model, the reliability value (1602), which is the probability value that multiple interval acoustic signal data can be classified as surprise among the types of emotions, can be calculated as 0.06 in the time interval between the second time point (t2) and the third time point (t3), 0.46 in the time interval between the third time point (t3) and the fourth time point (t4), and 0.85 in the time interval between the fourth time point (t4) and the fifth time point (t5).

[0154] Referring again to FIG. 15, in step S1530, the electronic device (100) calculates a preference for each of the multiple segmented audio signal data based on a pre-set importance for at least one type of emotion and preference and a reliability value for each of the at least one type. In one embodiment of the present disclosure, the electronic device (100) can calculate a preference for an image frame in a time segment corresponding to the multiple segmented audio signal data included in the video by applying a pre-set importance for each of the at least one type of emotion or preference and performing an operation of multiplying the importance by the reliability value for each of the at least one type. The 'importance' is a value of the concept of weight for at least one type of emotion or preference and may be pre-set for each of the at least one type of emotion and preference. In one embodiment of the present disclosure, the importance may be pre-set by user input or pre-set based on user's history information regarding shooting and searching. For example, the importance of the emotion type 'joy' may be set to 2, and the importance of the emotion type 'sadness' may be set to 0.5.

[0155] Referring together with FIG. 16, the processor (130, see FIG. 5) of the electronic device (100) can calculate a preference of 1 by multiplying the reliability value of pleasure, 0.5, by the importance value of pleasure, 2, in the time interval between the first time point (t1) and the second time point (t2). Additionally, the electronic device (100) can calculate 1 and 1.7, respectively, by multiplying the reliability value of pleasure, 0.5, by the importance value, 2, and the reliability value of surprise, 0.85, by the importance value, 2, in the time interval between the fourth time point (t4) and the fifth time point (t5). The processor (130) can determine the maximum value among the multiple preference values, 1.7, as the preference for the time interval between the fourth time point (t4) and the fifth time point (t5).

[0156] In one embodiment of the present disclosure, the processor (130) may not calculate the preference value if the reliability value of the emotion or preference is smaller than a preset threshold (α). The threshold (α) may be, for example, 0.5. Referring to the embodiment illustrated in FIG. 16, in the time interval between the second time point (t2) and the third time point (t3), the reliability value of pleasure (1601) is 0.275 and the reliability value of surprise (1602) is 0.06, so both pleasure and surprise are below the threshold (α). Therefore, the preference in the time interval between the second time point (t2) and the third time point (t3) is determined to be 0. Similarly, in the time interval between the third time point (t3) and the fourth time point (t4), the reliability value of pleasure (1601) is 0.275 and the reliability value of surprise (1602) is 0.46, so both pleasure and surprise are below the threshold (α). Therefore, the preference in the time interval between the third time point (t3) and the fourth time point (t4) is also determined to be 0.

[0157] Referring again to FIG. 15, in step S1540, the electronic device (100) stores the preference as a tag for a plurality of image frames included in a time interval corresponding to each of the plurality of interval audio signal data within the video. The electronic device (100) may extract a time interval within the entire time interval of the video where the preference value exceeds 0, and store the preference value as a tag for a plurality of image frames included in the extracted time interval. Referring together to FIG. 16, the processor (130) may extract a first interval (1622) between a first time point (t1) and a second time point (t2) and a second interval (1624) between a fourth time point (t4) and a fifth time point (t5) within the entire time interval of the video where the preference value exceeds 0, and store the preference value as a tag for a plurality of image frames corresponding to the extracted first interval (1622) and second interval (1624) among the entire image frames included in the video. For example, for multiple image frames played in the first section (1622), a preference value of 1 can be stored as a tag, and for multiple image frames played in the second section (1624), a preference value of 1.7 can be stored as a tag.

[0158] FIG. 17 is a flowchart illustrating a method for an electronic device (100) according to one embodiment of the present disclosure to display a search result of a video when a search query including a search term regarding a search target object is input.

[0159] In step S1710, the electronic device (100) receives a search query containing search terms regarding a search target object. The 'search target object' refers to a person, animal, object, behavior, action, attribute, situation, event, etc., that the user wishes to search for. Search terms regarding the search target object may include, for example, a bag, shoes, a gift, a person, a puppy, Christmas, etc., but are not limited thereto. In one embodiment of the present disclosure, the electronic device (100) may display a plurality of previously stored videos and receive a search query containing search terms regarding the search target object from the user through a search UI (user interface). Since step 1710 is identical to step S1110 of FIG. 11 except that it searches for videos rather than photos, a redundant description is omitted.

[0160] In step S1720, the electronic device (100) searches for at least one video that matches a search query among a plurality of previously stored videos and a matching section in at least one video. In one embodiment of the present disclosure, the electronic device (100) can search for a section of at least one video that stores metadata matching a search term among a plurality of videos by comparing metadata stored for each of the plurality of videos stored in memory (140) with a search term included in the search query. The processor (130, see FIG. 5) of the electronic device (100) can analyze tags stored for a plurality of image frames included in each of the plurality of videos and search for at least one video containing tags that match a search query and a time section corresponding to a plurality of image frames within at least one video as a result of the analysis.

[0161] In step S1730, the electronic device (100) determines the preference of the searched section by calculating the average value of the preferences of a plurality of image frames included within the searched section in each of at least one video. In one embodiment of the present disclosure, the processor (130) may obtain a preference value by analyzing the tag of each of the plurality of image frames included within the searched section in each of at least one video. The processor (130) may determine the preference value of the searched section by calculating the average of the preference values obtained from each of the plurality of image frames.

[0162] In step S1740, the electronic device (100) displays at least one searched video in descending order based on the preference of the searched segment. The processor (130) sorts the videos from the video with high preference to the video with low preference and can display the sorted videos as video search results through a display (150, see FIG. 5). In one embodiment of the present disclosure, the processor (130) can play a thumbnail video by a preset offset time earlier than the start time of the searched segment, rather than at the start moment of the searched segment, taking into account the characteristics of the video highlight. The offset time may be, for example, a time of 1 second or more and less than 2 seconds, but is not limited thereto.

[0163] When the electronic device (100) according to the embodiment illustrated in FIG. 17 receives a search query containing a search term regarding a search target object, it does not simply sort and display at least one searched video in chronological order, but sorts the videos in descending order based on preference, and pre-plays a time segment containing specific image frames with high preference among the entire segment of the searched video in the form of a thumbnail video, thereby providing a technical effect that can improve user convenience and user experience regarding the video search function.

[0164] The present disclosure provides a method for an electronic device (100) to analyze and search for a photograph based on sound. The method of operation of the electronic device (100) may include a step (S210) of acquiring a sound signal through a microphone for a preset time before and after the time of taking a photograph using a camera (110). The method of operation of the electronic device (100) may include a step (S220) of acquiring at least one type of emotion and taste regarding the sound signal by inputting a feature vector extracted from the acquired sound signal into an artificial neural network model trained to classify at least one of emotion and taste from sound signal learning data, and analyzing the feature vector using the artificial neural network model. The method of operation of the electronic device (100) may include a step (S230) of storing at least one type of emotion and taste acquired as metadata for a photograph taken through the camera.

[0165] In one embodiment of the present disclosure, the step of receiving a user’s search query including a search term for at least one of emotion and taste (S410) may further include: a step of searching for at least one photo among the plurality of photos that stores metadata matching the search term by comparing the plurality of metadata stored for each of the plurality of photos with the metadata corresponding to the search term included in the search query (S420); and a step of displaying the searched at least one photo as a search result for the search query (S430).

[0166] In one embodiment of the present disclosure, the step (S220) of obtaining at least one type of emotion and taste regarding the acoustic signal may include: a step (S610) of extracting features representing at least one of volume, tone, tempo, and speech included in the acoustic signal from the acoustic signal, and converting the extracted features into a feature vector by vector embedding; and a step (S620) of inputting the feature vector into an artificial neural network model and analyzing the input feature vector by performing inference using the artificial neural network model to obtain at least one type of emotion or taste to which the acoustic signal is classified and at least one confidence value representing the probability that the acoustic signal is classified into at least one type. The step (S230) of storing at least one type of emotion and taste as metadata for a photograph may include a step (S630) of storing at least one type of emotion and taste having a confidence value greater than or equal to a preset threshold among the confidence values of each of the at least one type as a tag in the photograph.

[0167] In one embodiment of the present disclosure, the step (S230) of storing at least one type of emotion and taste as metadata for a photograph may further include: the step (S910) of calculating a preference for a photograph based on a pre-set importance for at least one type of emotion or taste and a reliability value for each of at least one type; and the step (S930) of storing the calculated preference as a tag for the photograph.

[0168] In one embodiment of the present disclosure, in the step of storing the preference as a tag in the photo, the electronic device (100) may store the maximum value among the preference values regarding at least one of emotion and taste as a tag in the photo.

[0169] In one embodiment of the present disclosure, the pre-set importance may be pre-set based on user input or history information regarding the user's shooting and searching.

[0170] In one embodiment of the present disclosure, the method of operation of the electronic device (100) may further include the step of receiving a user’s search query including a search term regarding a search target object (S1110); the step of searching for at least one photo among a plurality of previously stored photos that matches the search term included in the search query (S1120); and the step of sorting the at least one searched photo in descending order based on preference and displaying it as a search result (S1130).

[0171] In one embodiment of the present disclosure, the artificial neural network model may include an emotion classifier model trained by supervised learning, which applies a feature vector extracted from first acoustic signal learning data as input and a label value regarding the type of emotion as the ground truth. The artificial neural network model may include a taste classifier model trained by supervised learning, which applies a feature vector extracted from second acoustic signal learning data as input and a label value regarding the type of taste as the ground truth.

[0172] The present disclosure provides an electronic device (100) for analyzing and searching for photographs based on sound. The electronic device (100) may include a camera (110); a microphone (120); at least one processor (130) including processing circuitry; and a memory (140) for storing one or more instructions. By executing the one or more instructions individually or collectively by the at least one processor (130), the electronic device (100) may acquire a sound signal through the microphone (120) for a preset time before and after the time of taking a photograph using the camera (110). By executing one or more of the above commands individually or collectively by at least one processor (130), the electronic device (100) inputs a feature vector extracted from an acoustic signal into an artificial neural network model trained to classify at least one of an emotion and a taste from acoustic signal learning data, and analyzes the feature vector using the artificial neural network model, thereby obtaining at least one type of emotion and taste regarding the acoustic signal. By executing one or more of the above commands individually or collectively by at least one processor (130), the electronic device (100) can store at least one type of the obtained emotion and taste as metadata for a photograph taken through the camera (110).

[0173] In one embodiment of the present disclosure, the electronic device (100) may further include a display (150). By executing one or more of the above commands by at least one processor (130), the electronic device (100) receives a user’s search query including a search term regarding at least one of emotion and taste, and by comparing a plurality of metadata stored for each of a plurality of previously stored photos with metadata corresponding to the search term included in the search query, it can search for at least one photo among a plurality of photos that stores metadata matching the search term. By executing one or more of the above commands by at least one processor (130), the electronic device (100) can display at least one searched photo on the display (150) as a search result for the search query.

[0174] In one embodiment of the present disclosure, by executing one or more instructions by at least one processor (130), the electronic device (100) can extract features representing at least one of volume, tone, tempo, and utterance included in the acoustic signal from the acoustic signal, convert the extracted features into a feature vector by vector embedding, input the feature vector into an artificial neural network model, and analyze the input feature vector by performing inference using the artificial neural network model, thereby obtaining at least one type of emotion or taste to which the acoustic signal is classified and at least one confidence value representing the probability that the acoustic signal is classified into said at least one type. By executing one or more instructions by at least one processor (130), the electronic device (100) can store at least one type of emotion or taste having a confidence value greater than or equal to a preset threshold among the confidence values of each of the at least one type as a tag in a photograph.

[0175] In one embodiment of the present disclosure, by executing one or more of the instructions by at least one processor (130), the electronic device (100) can calculate a preference for a photo based on a pre-set importance for at least one type of emotion or taste and a reliability value for each of at least one type, and store the calculated preference as a tag on the photo.

[0176] In one embodiment of the present disclosure, by executing one or more of the instructions by at least one processor (130), the electronic device (100) can store the maximum value among the preference values regarding at least one of emotion and taste as a tag in a photograph.

[0177] In one embodiment of the present disclosure, the pre-set importance may be pre-set based on user input or history information regarding the user's shooting and searching.

[0178] In one embodiment of the present disclosure, the electronic device (100) may further include a display (150). By executing one or more of the above commands by at least one processor (130), the electronic device (100) may receive a user’s search query including a search term regarding a search target object and search for at least one photo among a plurality of previously stored photos that matches the search term included in the search query. By executing one or more of the above commands by at least one processor (130), the electronic device (100) may display at least one photo found on the display (150) as a search result by sorting it in descending order based on preference.

[0179] In one embodiment of the present disclosure, the artificial neural network model may include an emotion classifier model trained by supervised learning, which applies a feature vector extracted from first acoustic signal learning data as input and a label value regarding the type of emotion as the ground truth. The artificial neural network model may include a taste classifier model trained by supervised learning, which applies a feature vector extracted from second acoustic signal learning data as input and a label value regarding the type of taste as the ground truth.

[0180] The present disclosure provides a computer program product comprising a computer-readable storage medium. The storage medium may include instructions readable by the electronic device (100) for the electronic device (100) to perform the operations of acquiring a sound signal through a microphone for a preset time before and after the time of taking a photograph using a camera (110), inputting a feature vector extracted from the acquired sound signal into an artificial neural network model trained to classify at least one of emotion and taste from sound signal learning data, and acquiring at least one type of emotion and taste regarding the sound signal by analyzing the feature vector using the artificial neural network model, and storing at least one type of the acquired emotion and taste as metadata for a photograph taken through the camera.

[0181] The present disclosure provides a method for an electronic device (100) to analyze and search a video based on sound. The method of operation of the electronic device (100) may include a step (S1310) of obtaining a plurality of segmented sound signal data by dividing a sound signal included in a video into pre-set time interval units. The method of operation of the electronic device (100) may include a step (S1320) of obtaining at least one of the emotions and preferences regarding the plurality of segmented sound signal data by analyzing the plurality of segmented sound signal data by inputting a plurality of feature vectors extracted from the plurality of segmented sound signal data into an artificial neural network model trained to classify at least one of emotions and preferences from sound signal learning data. The method of operation of the electronic device (100) may include a step (S1330) of storing at least one of the obtained emotions and preferences as metadata for a plurality of image frames included in a time interval corresponding to each of the plurality of segmented sound signal data among the time intervals of the video.

[0182] In one embodiment of the present disclosure, the step (S1320) of obtaining at least one emotion and preference regarding the plurality of segmented acoustic signal data may include: the step (S1510) of extracting a feature vector representing at least one of volume, timbre, tempo, and speech included in the acoustic signal from each of the plurality of segmented acoustic signal data; and the step (S1520) of inputting the extracted feature vector into an artificial neural network model and analyzing the feature vector by performing inference using the artificial neural network model to obtain at least one type of emotion or preference to which each of the plurality of segmented acoustic signal data is classified and at least one confidence value representing the probability that each of the plurality of segmented acoustic signal data is classified into at least one type. The step (S1330) of storing at least one of the above emotions and tastes as metadata for a plurality of image frames may include the step (S1540) of storing at least one type of emotion and taste having a confidence value greater than or equal to a preset threshold among the confidence values of each of at least one type as a tag for a plurality of image frames included in a time interval corresponding to each of the plurality of interval acoustic signal data.

[0183] In one embodiment of the present disclosure, the step of storing at least one of the emotion and preference as metadata for the plurality of image frames may further include: a step of calculating a preference for each of the plurality of segmented audio signal data based on a pre-set importance for at least one type of emotion and preference and a reliability value for each of the at least one type; and a step of storing the calculated preference as a tag for the plurality of image frames included in the time interval corresponding to each of the plurality of segmented audio signal data in the video.

[0184] In one embodiment of the present disclosure, the method of operation of the electronic device (100) may further include the step of receiving a user’s search query including a search term regarding a search target object (S1710); and the step of searching for at least one video among a plurality of previously stored videos that matches the search term included in the search query and a matching section in at least one video (S1720). The method of operation of the electronic device (100) may further include the step of determining the preference of the searched section by calculating the average value of the preference of each of the plurality of image frames included within the searched section in each of the at least one video (S1730); and the step of sorting at least one video in descending order based on the preference of the searched section and displaying it as a video search result (S1740).

[0185] The present disclosure provides a computer program product comprising a computer-readable storage medium. The storage medium may include instructions readable by the electronic device (100) for the electronic device (100) to perform the operations of: dividing an audio signal included in a video into predetermined time interval units to obtain a plurality of interval audio signal data; inputting a plurality of feature vectors extracted from the plurality of interval audio signal data into an artificial neural network model trained to classify at least one of an emotion and a preference from audio signal learning data to analyze the plurality of interval audio signal data, thereby obtaining at least one of an emotion and a preference regarding the plurality of interval audio signal data; and storing at least one of the obtained emotion and preference as metadata for a plurality of image frames included in a time interval corresponding to each of the plurality of interval audio signal data among the time intervals of the video.

[0186] A program executed by the electronic device (100) described in the present disclosure may be implemented as a hardware component, a software component, and / or a combination of a hardware component and a software component. The program may be executed by any system capable of executing computer-readable instructions.

[0187] Software may include a computer program, code, instructions, or a combination of one or more of these, and may configure a processing unit to operate as desired or command the processing unit independently or collectively.

[0188] Software can be implemented as a computer program containing instructions stored on a computer-readable storage medium. Examples of computer-readable recording media include magnetic storage media (e.g., ROM (read-only memory), RAM (random-access memory), floppy disks, hard disks, etc.) and optical reading media (e.g., CD-ROMs, DVDs (Digital Versatile Discs)). Computer-readable recording media can be distributed across networked computer systems, allowing computer-readable code to be stored and executed in a distributed manner. The medium is readable by a computer, stored in memory, and can be executed by a processor.

[0189] Computer-readable storage media may be provided in the form of non-transitory storage media. Here, 'non-transitory' means only that the storage medium does not contain a signal and is tangible, and does not distinguish between cases where data is stored semi-permanently or temporarily on the storage medium. For example, a 'non-transitory storage medium' may include a buffer in which data is stored temporarily.

[0190] In addition, the program according to the embodiments disclosed herein may be provided by being included in a computer program product. The computer program product may be traded between a seller and a buyer as a product.

[0191] A computer program product may include a software program and a computer-readable storage medium on which the software program is stored. For example, the computer program product may be from the manufacturer of the electronic device (100) or an electronic market (e.g., Samsung Galaxy Store). TM It may include a product in the form of a software program that is distributed electronically through ). For electronic distribution, at least a portion of the software program may be stored on a storage medium or temporarily created. In this case, the storage medium may be a server of the manufacturer of the electronic device (100), a server of an electronic market, or a storage medium of a relay server that temporarily stores the software program.

[0192] A computer program product may include a storage medium of a server or a storage medium of an electronic device (100) in a system composed of an electronic device (100) and / or a server. Alternatively, if there is a third device that is communicationally connected to the electronic device (100), the computer program product may include a storage medium of the third device. Alternatively, the computer program product may include a software program itself that is transmitted from the electronic device (100) to the third device or from the third device to the electronic device.

Claims

1. A method in which an electronic device (100) analyzes and searches for a photograph based on sound, A step (S210) of acquiring a sound signal through a microphone (120) during a preset time before and after the time of taking a picture using a camera (110); A step (S220) of inputting a feature vector extracted from the acquired acoustic signal into an artificial neural network model trained to classify at least one of emotion and preference from acoustic signal learning data, and analyzing the feature vector using the artificial neural network model to acquire at least one type of emotion and preference regarding the acoustic signal; and A step (S230) of storing at least one type of the acquired emotion and taste as metadata for a photograph taken using the camera (110); A method including 2. In Paragraph 1, A step (S410) of receiving a user's search query including a search term regarding at least one of emotions and tastes; A step (S420) of searching for at least one photo among the plurality of photos that stores metadata matching the search term by comparing the plurality of metadata stored for each of the plurality of photos with the metadata corresponding to the search term included in the search query; and A step (S430) of displaying at least one of the searched photos as a search result for the search query; A method that further includes.

3. In Paragraph 1, The step (S220) of obtaining at least one type of emotion and taste regarding the above acoustic signal is, A step (S610) of extracting features representing at least one of volume, tone, tempo, and utterance included in the acoustic signal from the acquired acoustic signal, and vector embedding the extracted features to convert them into a feature vector; and The method includes the step (S620) of inputting the feature vector into the artificial neural network model and performing inference using the artificial neural network model to analyze the input feature vector, thereby obtaining at least one type among the emotions or tastes to which the acoustic signal is classified and at least one confidence value representing the probability that the acoustic signal is classified into the at least one type. The step (S230) of storing at least one type of the above emotions and tastes as metadata for the above photos is, A method comprising the step (S630) of storing at least one type of emotion and taste having a reliability value greater than or equal to a preset threshold among the reliability values of each of the at least one type as a tag in the photo.

4. In Paragraph 3, The step (S230) of storing at least one type of the above emotions and tastes as metadata for the above photos is, A step (S910) of calculating a preference for the photograph based on a pre-set importance for at least one type of emotion or taste and a reliability value for each of the at least one type; and Step of storing the above-calculated preference as a tag in the above photo (S930); A method that further includes.

5. In Paragraph 4, The step of storing the above preference as a tag on the above photo is, A method of storing the maximum value among preference values regarding at least one of the above emotions and tastes as a tag in the above photo.

6. In Paragraph 4, Step (S1110) of receiving a user's search query containing a search term regarding a search target object; A step (S1120) of searching for at least one photo among a plurality of previously stored photos that matches the search term included in the search query; and A step (S1130) of sorting at least one of the searched photos in descending order based on the preference and displaying it as a search result; A method that further includes.

7. In any one of paragraphs 1 through 6, The above artificial neural network model is, A method comprising an emotion classifier model trained by supervised learning by applying a feature vector extracted from first acoustic signal learning data as input and a label value regarding the type of emotion as ground truth, and a taste classifier model trained by supervised learning by applying a feature vector extracted from second acoustic signal learning data as input and a label value regarding the type of taste as ground truth.

8. In an electronic device (100) that analyzes and searches for photos based on sound, Camera (110); Microphone (120); At least one processor (130) including processing circuitry; and Memory (140) for storing one or more instructions; Includes, By executing the above one or more instructions individually or collectively by the at least one processor (130), the electronic device (100) A sound signal is acquired through a microphone (120) during a preset time before and after the time of taking a picture using the camera (110), and A feature vector extracted from the above-mentioned acoustic signal is input into an artificial neural network model trained to classify at least one of emotion and preference from acoustic signal learning data, and by analyzing the feature vector using the artificial neural network model, at least one type of emotion and preference regarding the acoustic signal is obtained. An electronic device (100) that stores at least one type of the above-mentioned acquired emotion and taste as metadata for a photograph taken using the camera (110).

9. In Paragraph 8, It further includes a display (150), By executing the above one or more instructions by the at least one processor (130), the electronic device (100) Receiving a user's search query that includes a search term regarding at least one of emotions and tastes, and By comparing the multiple metadata stored for each of the multiple previously stored photos with the metadata corresponding to the search term included in the search query, at least one photo among the multiple photos that stores metadata matching the search term is searched. An electronic device (100) that displays at least one searched photo on the display (150) as a search result for the search query.

10. In Paragraph 8, By executing the above one or more instructions by the at least one processor (130), the electronic device (100) Features representing at least one of volume, tone, tempo, and utterance included in the acoustic signal are extracted from the acoustic signal, and the extracted features are vector-embedded to convert them into the feature vector. By inputting the feature vector into the artificial neural network model and performing inference using the artificial neural network model to analyze the input feature vector, at least one type of emotion or taste to which the acoustic signal is classified and at least one confidence value representing the probability that the acoustic signal is classified into the at least one type are obtained. An electronic device (100) that stores at least one type of emotion and taste having a reliability value greater than or equal to a preset threshold among the reliability values of each of the above at least one type as a tag in the above photo.

11. In Paragraph 10, By executing the above one or more instructions by the at least one processor (130), the electronic device (100) An electronic device (100) that calculates a preference for the photograph based on a pre-set importance for at least one type of emotion or taste and a reliability value for each of the at least one type, and stores the calculated preference as a tag on the photograph.

12. In Paragraph 11, By executing the above one or more instructions by the at least one processor (130), the electronic device (100) An electronic device (100) that stores the maximum value among preference values regarding at least one of the above emotions and tastes as a tag in the above photo.

13. In Paragraph 11, The above-mentioned importance level is pre-set based on user input or history information regarding the user's shooting and searching, in an electronic device (100).

14. In Paragraph 11, It further includes a display (150), By executing the above one or more instructions by the at least one processor (130), the electronic device (100) Receives a user's search query containing search terms regarding a search target object, and Search for at least one photo among a plurality of previously stored photos that matches the search term included in the search query, and An electronic device (100) that displays at least one photo searched on the display (150) as a search result by sorting it in descending order based on the preference.

15. In any one of paragraphs 8 through 14, The above artificial neural network model is, An electronic device (100) comprising an emotion classifier model trained by supervised learning by applying a feature vector extracted from first acoustic signal learning data as input and a label value regarding the type of emotion as ground truth, and a taste classifier model trained by supervised learning by applying a feature vector extracted from second acoustic signal learning data as input and a label value regarding the type of taste as ground truth.