Electronic device and control method therefor

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The electronic device uses adaptive sampling and probability calculations to synchronize subtitles with the correct speaker, addressing the issue of incorrect speaker-subtitle matching during video playback.

WO2026134984A1PCT designated stage Publication Date: 2026-06-25SAMSUNG ELECTRONICS CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: SAMSUNG ELECTRONICS CO LTD
Filing Date: 2025-12-10
Publication Date: 2026-06-25

Application Information

Patent Timeline

10 Dec 2025

Application

25 Jun 2026

Publication

WO2026134984A1

IPC: G10L17/08; G10L15/26; G10L25/87; G06F40/279; G10L15/22; G10L15/06

AI Tagging

Application Domain

Natural language data processing Speech recognition

Technology Topics

Speech sound Audio frequency

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Conventional systems fail to accurately match speakers with subtitles when there are changes in speakers during video playback, leading to incorrect subtitle synchronization.

Method used

An electronic device that identifies speakers by analyzing audio signals using a sampling period and weight values, adjusting the sampling period based on speech segment detection points, and calculating probability values to match speakers with subtitles.

Benefits of technology

Effectively synchronizes subtitles with the correct speaker even when multiple speakers are present, reducing recognition errors and ensuring accurate speaker identification.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure KR2025021273_25062026_PF_FP_ABST

Patent Text Reader

Abstract

These instructions, when executed collectively or individually by at least one processor, instruct this electronic device to: obtain an audio signal sampled from an image according to a sampling period corresponding to a speech section included in an audio signal; and identify a speaker who is generating speech in the image, on the basis of the obtained audio signal and a weight value corresponding to the length of text included in the image.

Need to check novelty before this filing date? Find Prior Art

Description

Electronic device and method of controlling the same

[0001] The present disclosure relates to an invention concerning an electronic device and a method for controlling the same, and more specifically, includes an electronic device for matching subtitles and a speaker and a method for controlling the same.

[0002] Recently, voice signals and subtitles corresponding to the voice signals are provided to the user simultaneously.

[0003] However, conventionally, there was a problem where the speaker generating the voice signal was incorrectly matched with the subtitle and the speaker generating the voice signal corresponding to the subtitle when the speaker generating the voice signal changed while the subtitle was being displayed on the screen.

[0004] Accordingly, the purpose of the present disclosure is to provide the content of an invention that improves upon the prior art.

[0005] An electronic device according to one embodiment of the present disclosure comprises: a display; a communication interface; a memory for storing instructions; and at least one processor. When the instructions are executed collectively or individually by the at least one processor, the electronic device acquires an audio signal sampled from an image according to a sampling period corresponding to a voice segment included in the audio signal, and identifies a speaker generating speech in the image based on a weight value corresponding to the length of the acquired audio signal and the text included in the image. In this way, subtitles can match the speaker even when the speaker changes or when two or more speakers speak simultaneously.

[0006] In one embodiment, when the instructions are executed collectively or individually by the at least one processor, the electronic device may acquire a sampling period less than or equal to a preset value if the detection point for identifying the sampling period corresponds to the start point of a voice segment included in the audio signal, and acquire a sampling period greater than or equal to a preset value if the detection point for identifying the sampling period corresponds to a point after a threshold time has elapsed since the start point of a voice segment included in the audio signal.

[0007] In one embodiment, when the instructions are executed collectively or individually by the at least one processor, the electronic device may be controlled to reset the sampling period if the detection point for identifying the sampling period corresponds to the end point of the voice segment included in the audio signal.

[0008] In one embodiment, when the instructions are executed collectively or individually by the at least one processor, the electronic device may obtain the sampling period using the number of sampled audio signals obtained based on the total number of sampled audio signals, the length of the audio signals, and the length of the audio signals corresponding to the subtitle output.

[0009] In one embodiment, when the instructions are executed collectively or individually by the at least one processor, the electronic device may perform a function for recognizing text included within a text area obtained from an image output to the display and obtain a length corresponding to the text obtained through the function.

[0010] In one embodiment, when the instructions are executed collectively or individually by the at least one processor, the electronic device may obtain the length of the text of the image output to the display and the audio signal simultaneously using the length of the audio signal corresponding to the identified text obtained based on the length corresponding to the identified text, and obtain the weight value based on the length of the text of the image output to the display and the audio signal simultaneously and the sampling period.

[0011] In one embodiment, when the instructions are executed collectively or individually by the at least one processor, the electronic device may obtain a probability value for identifying a speaker generating speech in the video based on the acquired weight value and the total number of the sampled audio signals.

[0012] In one embodiment, when the instructions are executed collectively or individually by the at least one processor, the electronic device may identify that the first speaker is generating the voice if the acquired probability value corresponds to a first probability value corresponding to a first speaker generating the voice included in the audio signal, and identify that the second speaker is generating the voice if the acquired probability value corresponds to a second probability value corresponding to a second speaker generating the voice included in the audio signal.

[0013] In one embodiment, when the instructions are executed collectively or individually by the at least one processor, the electronic device may control the display to display text corresponding to the audio signal of the identified first speaker and information about the first speaker together when the first speaker is identified as generating voice, and to display text corresponding to the audio signal of the identified second speaker and information about the second speaker together when the second speaker is identified as generating voice.

[0014] In one embodiment, if the voice signal included in the acquired audio signal is identified as containing voice signals of multiple speakers, the weight value may be obtained based on the identification result.

[0015] A control method for an electronic device according to one embodiment of the present disclosure may include: a step of acquiring an audio signal sampled from an image according to a sampling period corresponding to a voice segment included in the audio signal; and a step of identifying a speaker generating voice in the image based on a weight value corresponding to the length of the acquired audio signal and the text included in the image.

[0016] In one embodiment, the step of setting the sampling period comprises: a step of obtaining the sampling period at or below a preset value when the detection point for identifying the sampling period corresponds to the start point of a voice segment included in the audio signal; and a step of obtaining the sampling period at or above a preset value when the detection point for identifying the sampling period corresponds to a point after a threshold time has elapsed since the start point of the voice segment included in the audio signal.

[0017] In one embodiment, the step of setting the sampling period may include the step of resetting the sampling period when the detection point for identifying the sampling period corresponds to the end point of a voice segment included in the audio signal.

[0018] In one embodiment, the step of setting the sampling period may include the step of obtaining the sampling period using the number of sampled audio signals obtained based on the total number of sampled audio signals, the length of the audio signal, and the length of the audio signal corresponding to the subtitle output.

[0019] In one embodiment, the step of obtaining information about text of an image displayed on the display may include: a step of performing a function for recognizing text included within a text area obtained from an image output to the display; and a step of obtaining a length corresponding to the text obtained through the function.

[0020] In one embodiment, the method may include: a step of obtaining a length during which the text of an image output to the display and the audio signal are simultaneously output using the length of an audio signal corresponding to the identified text obtained based on the length corresponding to the identified text; and a step of obtaining a weight value based on the length during which the text of an image output to the display and the audio signal are simultaneously output and the sampling period.

[0021] In one embodiment, the method may include the step of obtaining a probability value for identifying a speaker generating speech in the video based on the obtained weight value and the total number of sampled audio signals.

[0022] In one embodiment, the method may include the step of identifying that the first speaker is generating the voice if the acquired probability value corresponds to a first probability value corresponding to a first speaker generating the voice included in the audio signal, and identifying that the second speaker is generating the voice if the acquired probability value corresponds to a second probability value corresponding to a second speaker generating the voice included in the audio signal.

[0023] In one embodiment, if the first speaker is identified as generating voice, text corresponding to the audio signal of the identified first speaker and information about the first speaker may be displayed together, and if the second speaker is identified as generating voice, text corresponding to the audio signal of the identified second speaker and information about the second speaker may be displayed together.

[0024] In a non-transient computer-readable recording medium comprising a program for executing a method of controlling an electronic device, the method of controlling the electronic device may include: a step of acquiring an audio signal sampled from an image according to a sampling period corresponding to a voice segment included in the audio signal; and a step of identifying a speaker generating voice in the image based on a weight value corresponding to the length of the acquired audio signal and the text included in the image.

[0025] The contents of the present disclosure can be best understood by referring to the attached drawings.

[0026] FIG. 1 is a drawing for explaining an embodiment of an electronic device according to one or more embodiments of the present disclosure.

[0027] FIG. 2 is a block diagram for illustrating an electronic device according to one or more embodiments of the present disclosure.

[0028] FIGS. 3 to 7 and FIG. 9 are drawings for explaining a method for identifying a speaker corresponding to a subtitle based on a sampling period and weight values according to one or more embodiments of the present disclosure. By utilizing weighted sampling, recognition errors due to conversation length can be effectively reduced.

[0029] FIGS. 8 to 10 are flowcharts for explaining a method for controlling an electronic device according to one or more embodiments of the present disclosure.

[0030] The embodiments described herein are subject to various modifications and may have various forms; specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the scope of specific embodiments and should be understood to include various modifications, equivalents, and / or alternatives of the embodiments of the present disclosure. In relation to the description of the drawings, similar reference numerals may be used for similar components.

[0031] In describing the present disclosure, if it is determined that a detailed description of related known functions or configurations could unnecessarily obscure the essence of the present disclosure, such detailed description is omitted.

[0032] Additionally, the following embodiments may be modified in various other forms, and the scope of the present disclosure is not limited to the following embodiments. Rather, these embodiments are provided to make the present disclosure more faithful and complete and to fully convey the present disclosure to those skilled in the art.

[0033] The terms used in this disclosure are used merely to describe specific embodiments and are not intended to limit the scope of the rights. Singular expressions include plural expressions unless the context clearly indicates otherwise.

[0034] In the present disclosure, expressions such as “have,” “may have,” “include,” or “may include” indicate the presence of such features (e.g., numerical values, functions, actions, or components such as parts) and do not exclude the presence of additional features.

[0035] In the present disclosure, expressions such as “A or B,” “at least one of A or / and B,” or “one or more of A or / and B” may include all possible combinations of items listed together. For example, “A or B,” “at least one of A and B,” or “at least one of A or B” may refer to cases including (1) at least one A, (2) at least one B, or (3) both at least one A and at least one B.

[0036] Expressions such as "first," "second," "first," or "second" used in this disclosure may modify various components regardless of order and / or importance, and are used only to distinguish one component from another and do not limit said components.

[0037] Where it is stated that a certain component (e.g., a first component) is "(operatively or communicatively) coupled with / to" or "connected to" another component (e.g., a second component), it should be understood that the said certain component may be directly connected to the said other component or connected through another component (e.g., a third component).

[0038] On the other hand, when it is stated that a certain component (e.g., a first component) is "directly connected" or "directly coupled" to another component (e.g., a second component), it may be understood that no other component (e.g., a third component) exists between said certain component and said other component.

[0039] As used in this disclosure, the expression “configured to” may be replaced, depending on the context, with, for example, “suitable for,” “having the capacity to,” “designed to,” “adapted to,” “made to,” or “capable of.” The term “configured to” may not necessarily mean only “specifically designed to” in hardware.

[0040] Instead, in some situations, the expression “device configured to do something” may mean that the device is “capable of doing something” together with other devices or components. For example, the phrase “processor configured (or set) to perform A, B, and C” may mean a dedicated processor for performing those operations (e.g., an embedded processor), or a generic-purpose processor (e.g., a CPU or application processor) capable of performing those operations by executing one or more software programs stored in a memory device.

[0041] In the embodiments, a 'module' or 'part' performs at least one function or operation and may be implemented in hardware or software, or a combination of hardware and software. Additionally, a plurality of 'modules' or a plurality of 'parts' may be integrated into at least one module and implemented by at least one processor, except for the 'module' or 'part' that needs to be implemented in specific hardware.

[0042] Meanwhile, the various elements and areas in the drawings are depicted schematically. Accordingly, the technical concept of the present invention is not limited by the relative sizes or spacing depicted in the attached drawings.

[0043] Hereinafter, various embodiments of the present invention will be described in detail using the attached drawings.

[0044] The features described with reference to one embodiment can be combined with the features of other embodiments without introducing new details.

[0045] FIG. 1 is a drawing for illustrating an embodiment of an electronic device according to one or more embodiments of the present disclosure. Although the electronic device in FIG. 1 is depicted as a TV, this is merely one embodiment, and it is obvious that it can be implemented as various types of electronic devices such as a tablet, laptop computer, smartphone, camera, PC, etc.

[0046] The electronic device (100) can output an audio signal and a subtitle corresponding to the audio signal as shown in FIG. 1.

[0047] For example, when a male actor speaks the voice saying "What is for dinner tonight?", the display (130) can display the subtitle "What is for dinner tonight?". At this time, while the subtitle "What is for dinner tonight?" is displayed on the display (130), a female actor can speak the voice saying "Dinner is Jajangmyeon" in response to the male actor's question.

[0048] Conventionally, as shown in Fig. 1, when the speaker changes from a male actor to a female actor while the subtitle "What is for dinner tonight?" is displayed, there were cases where the subtitle "What is for dinner tonight?" was incorrectly displayed as a voice signal generated by the female actor.

[0049] To solve these problems, the electronic device (100) can identify whether a voice signal generated by a speaker has started when sampling an audio signal included in video data. If the electronic device (100) identifies that a voice signal has started, it can set a short sampling period. The sampling period may include the interval at which sampling takes place. The sampling period may represent the frequency or sampling rate of sampling. Sampling may include the process of converting an audio signal in analog form into digital data. The voice signal may include a portion of an audio signal containing voice or a sound similar to voice. The voice signal may include a portion of an audio signal that can be converted into subtitles.

[0050] In addition, if the segment where voice is output and the segment where subtitles are displayed overlap in the sampled audio signal, the weight value can be set high, and if the segment where voice is output and the segment where subtitles are displayed do not overlap, the weight value can be set low. In addition, the weight value can be set low even if voice is output by two or more speakers.

[0051] In this case, the weight value may be a value indicating whether the subtitle and the speaker who uttered the subtitle match. The weight value may indicate the degree of agreement between the subtitle and the speaker speaking at the time the subtitle is displayed. Specifically, if the weight value is large, it may indicate a case where the subtitle and the speaker who uttered the subtitle match significantly. For example, a high weight value may indicate that there is a significant overlap between the time the subtitle is displayed and the time the speaker uttering the subtitle speaks. On the other hand, if the weight value is small, it may indicate a case where the speaker generating the audio signal changes or multiple speakers simultaneously generate audio signals, resulting in a mismatch between the subtitle and the speaker who uttered the subtitle.

[0052] The electronic device (100) can identify a speaker who generates a voice signal corresponding to a subtitle using a weight value and a sampled audio signal. Specifically, the electronic device (100) can calculate a probability value for distinguishing a speaker by multiplying the sampled audio signal by a weight value. This will be described in detail below.

[0053] The electronic device (100) can identify the speaker matching the subtitle using the acquired probability value. As illustrated in FIG. 1, the electronic device (100) can identify the subtitle "What is for dinner tonight?" as the subtitle of the male actor even if the speaker changes from a male actor to a female actor. In this way, the electronic device (100) can display the speaker speaking the subtitle even if the speaker is not currently speaking. For example, the display (130) can display the subtitle for the voice signal generated by the female actor and the name of the female actor to indicate to the viewer who is speaking.

[0054] Meanwhile, for the sake of convenience of explanation, the above description was made on the premise that the electronic device (100) includes a display (130), but this is merely one embodiment, and it is obvious that the electronic device (100) can output video data or audio signals using HDMI or DP (Display Port) without including a display (130).

[0055] If the electronic device (100) does not include a display (130), the electronic device (100) may transmit data to an external device that includes a display and output video data or audio signals from the external device.

[0056] FIG. 2 is a block diagram for explaining the configuration of an electronic device (100). The configuration shown in FIG. 2 is merely an example of various embodiments, and some components may be omitted, and new components may be added. As shown in FIG. 2, the electronic device (100) may include a communication interface (110), a memory (120), a display (130), a speaker (140), and a processor (150). The configuration shown in FIG. 2 is merely an example, and it goes without saying that some components may be deleted or added depending on the configuration of the electronic device (100).

[0057] First, the communication interface (110) is configured to communicate with various types of external devices according to various types of communication methods. In particular, the communication interface (110) can acquire video data including subtitles and audio signals from an external device. Additionally, the communication interface (110) can acquire sampled audio signals from an external device or acquire data regarding a method for setting the sampling period. The communication interface (110) may also acquire data regarding a method for acquiring weight values or weight values, and may acquire data regarding a method for identifying a speaker.

[0058] A wireless communication module may be a module that communicates wirelessly with an external device. For example, the wireless communication module may include at least one module among a Wi-Fi module, a Bluetooth module, an infrared communication module, an Ultra Wide-Band (UWB) module, or other communication modules.

[0059] A wired communication module may be a module that communicates with an external device via a wire. For example, a wired communication module may include at least one of a Local Area Network (LAN) module, an Ethernet module, a pair cable, a coaxial cable, or a fiber optic cable.

[0060] The memory (120) can store an operating system (OS) for controlling the overall operation of the components of the electronic device (100) and instructions or data related to the components of the electronic device (100). In particular, the memory (120) can store video data including data for audio signals and subtitles. Additionally, the memory (120) may store data regarding a method for acquiring a sampled audio signal or a sampling period, data regarding weight values, data regarding a method for acquiring a sampled audio signal to which weight values are applied or speaker characteristic information.

[0061] Memory (120) can be implemented in various forms such as volatile memory (e.g., DRAM (dynamic RAM), SRAM (static RAM), or SDRAM (synchronous dynamic RAM), non-volatile memory (e.g., OTPROM (one time programmable ROM), PROM (programmable ROM), EPROM (erasable and programmable ROM), EEPROM (electrically erasable and programmable ROM), mask ROM, flash ROM, flash memory (e.g., NAND flash or NOR flash), hard drive, or solid state drive (SSD).

[0062] The display (130) can display various information. In particular, the display (130) can display subtitles and characteristic information of the speaker speaking the audio signal corresponding to the subtitles. For example, the display (130) can display subtitles for the voice signal generated by the female actor and the name of the female actor.

[0063] The display (130) can be implemented as various types of displays such as an LCD (Liquid Crystal Display), an OLED (Organic Light Emitting Diodes) display, and a PDP (Plasma Display Panel). The display (130) may also include a driving circuit, a backlight unit, etc., which can be implemented in forms such as an a-si TFT (amorphous silicon thin film transistor), an LTPS (low temperature poly silicon) TFT, and an OTFT (organic TFT). The display (130) can be implemented as a touch screen combined with a touch sensor, a flexible display, a 3D display, a three-dimensional display, etc. According to various embodiments of the present disclosure, the display (130) may include not only a display panel that outputs an image, but also a bezel that houses the display panel.

[0064] The speaker (140) can output an audio signal. For example, the speaker (140) can output an audio signal obtained through the communication interface (120) or an audio signal stored in the memory (120). At this time, the output of the speaker (140) can be controlled by the processor (150). For example, the timing of the speaker (140) outputting the audio signal can be controlled by the processor (150).

[0065] Meanwhile, the speaker (140) may include various types of speaker modules. For example, the speaker (140) may include a woofer speaker, a mid-range speaker, and a tweeter.

[0066] Additionally, the speaker (140) can be placed at various locations on the electronic device (100). For example, if the electronic device (100) is a TV, the speaker (140) can be placed on the top of the electronic device (100). However, this is merely an example, and the speaker (140) can be placed on the bottom of the electronic device (100). Additionally, the speaker (140) can output audio data in various directions. For example, the speaker (140) can output audio data toward the top of the electronic device (100). However, this is merely an example, and the speaker (140) can output audio data toward the bottom, side, or front of the electronic device (100). When multi-channel audio data is received, the speaker (140) can output audio according to the channel information.

[0067] The processor (150) may include one or more processors. Specifically, the one or more processors may include one or more of a CPU (Central Processing Unit), GPU (Graphics Processing Unit), APU (Accelerated Processing Unit), MIC (Many Integrated Core), DSP (Digital Signal Processor), NPU (Neural Processing Unit), hardware accelerator, or machine learning accelerator. The processor (150) may control one or any combination of other components of an electronic device and may perform operations or data processing related to communication. The one or more processors (150) may execute one or more programs or instructions stored in memory. For example, the one or more processors (150) may perform a method according to one embodiment of the present disclosure by executing one or more instructions stored in memory.

[0068] One or more processors (150) may be implemented as a single-core processor including one core, or as one or more multicore processors including multiple cores (e.g., homogeneous multicore or heterogeneous multicore). When one or more processors (150) are implemented as multicore processors, each of the multiple cores included in the multicore processor may include internal processor memory such as cache memory or on-chip memory, and a common cache shared by multiple cores may be included in the multicore processor. Additionally, each of the multiple cores included in the multicore processor (or some of the multiple cores) may independently read and execute program instructions for implementing a method according to one embodiment of the present disclosure, or all (or some) of the multiple cores may be linked together to read and execute program instructions for implementing a method according to one embodiment of the present disclosure.

[0069] In particular, the processor (150) can acquire an audio signal sampled from the video according to a sampling period corresponding to a voice segment included in the audio signal. The processor (150) can identify a speaker generating voice in the video based on a weight value corresponding to the length of the acquired audio signal and the text included in the video. A voice segment may represent a segment where a voice signal is output. A voice segment may represent the length of the voice signal (e.g., in seconds).

[0070] Additionally, if the detection point for identifying the sampling period corresponds to the start point of the voice segment included in the audio signal, the processor (150) can obtain a sampling period less than or equal to a preset value. The preset value may be a threshold value. The preset value may be set by the user. If the detection point for identifying the sampling period corresponds to a point where a threshold time has elapsed since the start point of the voice segment included in the audio signal, the processor (150) can obtain a sampling period greater than or equal to a preset value. The threshold time may be set by the user. The threshold time may be set in advance. That is, sampling is performed at shorter intervals immediately after the start of the voice segment included in the audio signal, and subsequently, sampling is performed at longer intervals.

[0071] Additionally, the processor (150) can control the sampling period to be reset if the detection point for identifying the sampling period corresponds to the end point of the voice segment included in the audio signal. The processor (150) can obtain the sampling period using the number of sampled audio signals obtained based on the total number of sampled audio signals, the length of the audio signal, and the length of the audio signal corresponding to the subtitle output.

[0072] The processor (150) can perform a function for recognizing text contained within a text area obtained from an image output to a display. The processor (150) can obtain a length corresponding to the text obtained through the function. The processor (150) can obtain a length during which the text and audio signals of the image output to the display are simultaneously output using the length of the audio signal corresponding to the identified text obtained based on the length corresponding to the identified text, and can obtain a weight value based on the length during which the text and audio signals of the image output to the display are simultaneously output and the sampling period.

[0073] The processor (150) can obtain a probability value for identifying a speaker generating voice in a video based on the acquired weight value and the total number of sampled audio signals. The processor (150) can identify that the first speaker is generating voice if the acquired probability value corresponds to a first probability value corresponding to a first speaker generating voice included in the audio signal. The processor (150) can identify that the second speaker is generating voice if the acquired probability value corresponds to a second probability value corresponding to a second speaker generating voice included in the audio signal. When the processor (150) identifies that the first speaker is generating voice, it can control the display to display text corresponding to the audio signal of the identified first speaker and information about the first speaker together. When the processor (150) identifies that the second speaker is generating voice, it can control the display to display text corresponding to the audio signal of the identified second speaker and information about the second speaker together.

[0074] If the voice segments included in the acquired audio signal correspond to voice segments of multiple speakers, the processor (150) can obtain a weight value based on the multiple voice segments.

[0075] Hereinafter, each step described above will be described in detail with reference to FIGS. 3 to 7.

[0076] First, the processor (150) can acquire audio data including a voice segment. The processor (150) can recognize the start or end point of the voice segment and change the sampling period. At this time, sampling may refer to the process of converting an analog audio signal into digital data.

[0077] In one embodiment, as illustrated in FIG. 3, the processor (150) can obtain a sampling period less than or equal to a preset value when the detection point for identifying the sampling period corresponds to the start point of a voice segment included in the audio signal. The processor (150) can obtain a sampling period greater than or equal to a preset value when the detection point for identifying the sampling period corresponds to a point where a threshold time has elapsed since the start point of a voice segment included in the audio signal. Additionally, the processor (150) may control the sampling period to be reset when the detection point for identifying the sampling period corresponds to the end point of a voice segment included in the audio signal.

[0078] In another embodiment, the processor (150) can dynamically set the sampling period based on the length of the audio signal and the length of the subtitle, as shown in FIG. 4. Using an adaptive or dynamic sampling period has the advantage of improving the voice and dialogue recognition rate.

[0079] The processor (150) can adjust the sampling period when short lines of dialogue appear consecutively. Specifically, the processor (150) can obtain a probability distribution based on the length of the subtitles. The probability distribution value can be (length of the interval containing the audio signal) / (time when the subtitles are displayed).

[0080] At this time, the processor (150) can identify the length of the interval containing the audio signal based on the acquired probability distribution value. If the length of the interval containing the audio signal is smaller than a threshold value, the processor (150) can identify that the speed at which the speaker outputs the audio signal (or dialogue) will increase. Accordingly, the processor (150) can set the sampling period to be short.

[0081] As illustrated in FIG. 4, the processor (150) can recognize dialogue at the point (410) when the audio signal is identified.

[0082] Afterward, the processor (150) can obtain a probability distribution value of an area containing an audio signal and determine whether the obtained value is smaller than a threshold value.

[0083] When the processor (150) identifies that the acquired value is smaller than a threshold value, it can set the sampling period to be short. At this time, since the sampling period is set to be short, the processor (150) can identify subtitles in near real-time.

[0084] When short lines of dialogue follow one another, if the sampling period is constant, the point at which the start of a line of dialogue is recognized may be delayed. In other words, if short lines of dialogue follow one another for a short period, the next line of dialogue may begin while the previous line of dialogue is still being recognized.

[0085] Accordingly, if the processor (150) identifies that a short sequence of metabolic events is continuous, it can recognize the metabolic events in real time by setting the sampling period to short.

[0086] Additionally, the processor (150) may reset the sampling period when it identifies that the voice segment included in the audio signal has ended. At this time, the method for setting the sampling period will be described later with reference to FIG. 5.

[0087] First, the processor (150) can acquire an audio signal (S510). Meanwhile, for the sake of convenience of explanation, the audio signal is described below on the premise that it is an audio signal already stored in the video data, but this is only one embodiment, and the processor (150) can acquire an audio signal through a microphone (not shown).

[0088] Specifically, the electronic device (100) can acquire an audio signal using a microphone (not shown) included in the electronic device (100), and can also acquire an audio signal using a microphone included in an external remote control.

[0089] Meanwhile, the electronic device (100) may transmit the acquired audio signal to an external server. At this time, the external server may perform only the function of converting the acquired audio signal into text, or it may identify the speaker by sampling the acquired audio signal or using a weight value corresponding to the length of the text included in the video.

[0090] However, this is merely one embodiment, and it is obvious that the operation of transmitting the audio signal to an external server can be omitted, and the electronic device (100) can convert the audio signal into text, sample the audio signal, and identify the speaker using weight values.

[0091] In addition, it goes without saying that the audio signal can be a voice signal input by the user into the remote control.

[0092] At this time, the audio signal may include a voice segment, and the voice segment may include a voice spoken by at least one speaker.

[0093] The processor (150) can identify whether a voice segment begins in the audio signal (S520). That is, it is determined whether the audio signal contains dialogue or voice.

[0094] In one embodiment, the processor (150) can detect a voice segment by using energy per frame of the audio signal to identify a voice segment in the audio signal.

[0095] In another embodiment, the processor (150) may detect a speech segment by using a zero crossing rate for each frame-unit audio signal and extracting a feature vector from the frame-unit audio signal, and detecting a speech segment by determining the presence or absence of a speech signal from the extracted feature vector using a Support Vector Machine (SVM).

[0096] However, this is merely one example, and the processor (150) can identify voice segments in an audio signal using a method of Voice Activity Detection (VAD) or a method of Mel-Frequency Cepstral Coefficients (MFCC).

[0097] If the start of a voice segment is not identified (S520-N), the processor (150) can set a longer sampling period (S530). If dialogue or voice is not identified in the audio signal, the sampling rate may be lowered. In this case, the processor (150) can save resources by reducing the frequency of audio signal data collection.

[0098] However, if the processor (150) identifies that a voice segment begins in the audio signal (S520-Y), it may set the sampling period to be short (S540). If dialogue or voice is identified in the audio signal, the sampling rate may be increased.

[0099] At this time, the processor (150) can obtain the section where the voice section and the subtitle are output simultaneously among the sections where the voice section and the subtitle are output, by using the following mathematical formula to obtain the sampling period.

[0100]

[0101] <Mathematical Formula 1>

[0102] At this time, is the total length (time) of the audio segment corresponding to the subtitles, and can mean the number of words included in the subtitles. Also, This can mean the average time required to utter a single word.

[0103] For example, if the subtitle is "What's for dinner tonight?" and the average utterance time per word is 0.5 seconds, the number of words included in the subtitle can be 5. In this case, It can be 0.5*5, that is, 2.5 seconds. The processor (150) You can use this to identify how well the time when subtitles are displayed and the time when audio segments are output match.

[0104] Meanwhile, the processor (150) can obtain the number of audio samples to be used to calculate the sampling period. At this time, the processor (150) can use the following <Equation 2>.

[0105]

[0106] ( : The number of sampled audio signals used to obtain the sampling period, : Total number of sampled audio signals, : Total length of the audio signal, : Length of the audio signal corresponding to the time the subtitle is displayed on the screen)

[0107] For example, as illustrated in FIG. 3, while a first speaker subtitle "What is for dinner tonight?" is displayed on the display (130), the first speaker and the second speaker may generate audio signals. Specifically, the first speaker may generate the first speaker's audio signal "What is for dinner tonight?", and then the second speaker may generate the second speaker's audio signal "Dinner is."

[0108] At this time, the processor (150) This is 3.5, and It can be identified that is 2.5 and the total number of sampled audio signals is 35. In addition, the processor (150) uses <Equation 2> It can be identified as 25.

[0109] The processor (150) can set the sampling period using the number of audio samples obtained by the method described above.

[0110] Additionally, when the processor (150) identifies that the speaker's speech has started in the audio signal, it can increase the frequency of collecting audio signal data by setting the sampling period short.

[0111] On the other hand, if the voice segment lasts for a threshold time (S550-Y), the processor (150) can set the sampling period to be long (S560). Specifically, if the voice segment lasts for a threshold time, the processor (150) can identify that the speaker uttering the voice signal has not changed. Therefore, the processor (150) can save resources by setting the sampling period to be long.

[0112] Meanwhile, if the processor (150) identifies that the voice segment has not lasted for a threshold time (S550-N), it can set the sampling period to be short (S540). That is, if the voice segment lasts for less than the threshold time, the processor (150) can increase the frequency of audio signal collection because the speaker speaking the audio signal may change.

[0113] According to the method described above, the processor (150) can change the sampling period according to the voice segment included in the audio signal and acquire the audio signal sampled according to the changed sampling period.

[0114] The processor (150) can obtain a weight value applied to the sampled audio signal as illustrated in FIG. 5. Specifically, when an audio signal generated by a speaker and a subtitle corresponding to the audio signal are output simultaneously, the weight value may be set to a large value because it may include information about the speaker corresponding to the subtitle.

[0115] The processor (150) can set a small weight value to apply to the sampling of the second speaker's audio signal when the audio signal generated by the second speaker is output while the first speaker's subtitle is displayed on the display (130). Accordingly, even if the second speaker's audio signal is generated while the first speaker's subtitle is displayed on the display (130), the processor (150) sets the weight value to apply to the second speaker's audio signal to a small value so that the first speaker's subtitle can be identified as a subtitle generated by the first speaker.

[0116] A specific process for identifying subtitles and speakers corresponding to the subtitles using weight values will be described later with reference to FIGS. 6 and FIGS. 7.

[0117] First, the processor (150) can identify a text area of an image displayed on the display (130) (S710). At this time, the text area of the image may be located on the lower or upper side of the display (130).

[0118] The processor (150) can recognize characters in the text area and identify the length of the text (S720).

[0119] Specifically, the processor (150) performs a function for recognizing text included within a text area obtained from an image output to a display (130), and can obtain a length corresponding to the text obtained through the above-described function.

[0120] The processor (150) can use an OCR (Optical Character Recognition) method to analyze pixel data of an image displayed on a display (130) to recognize a pattern of characters and convert the recognized characters into text.

[0121] In addition, the processor (150) can identify the length of the text by using OCR (Optical Character Recognition) to identify spaces between characters.

[0122] The processor (150) can obtain a weight value to apply to the sampled audio signal according to the length of the recognized text (S730). Specifically, the processor (150) can obtain the length during which the text and audio signals of the image output to the display (130) are simultaneously output using the length of the audio signal corresponding to the identified text obtained based on the length corresponding to the identified text. Additionally, the processor (150) can obtain a weight value based on the length during which the text and audio signals of the image output to the display (130) are simultaneously output and the sampling period.

[0123] The method for obtaining weight values will be described later with reference to FIGS. 8 and FIGS. 9.

[0124] The processor (150) can obtain the length of the audio signal based on the length identified according to the method described above (S810).

[0125] For example, if the recognized subtitle is "Tonight's dinner is Jajangmyeon.", the length of the audio signal "Tonight's dinner is Jajangmyeon." corresponding to the recognized subtitle can be obtained.

[0126] At this time, the processor (150) can calculate the length of the audio signal output while subtitles are displayed using the length of the audio signal (S820).

[0127] For example, if a recognized subtitle is displayed for 5 seconds and an audio signal corresponding to the recognized subtitle is output for 7 seconds, the processor (150) can obtain the signal corresponding to the first 5 seconds of the audio signal.

[0128] However, if the audio signal is output for 3 seconds and the recognized subtitle is displayed for 5 seconds, the processor (150) can obtain the length of the audio signal output for 3 seconds.

[0129] Meanwhile, the processor (150) can obtain a weight value based on the audio signal length and sampling period (S830).

[0130] For example, even though the first speaker's voice utterance has ended and the second speaker's voice utterance has begun, a subtitle corresponding to the first speaker's voice may be displayed on the display (130). At this time, the processor (150) may set a high weight value when the first speaker's voice utterance and the subtitle corresponding to the first speaker's voice are output simultaneously. However, the processor (150) may set a low weight value when the first speaker's subtitle is output on the display (130) even though the second speaker's voice utterance has begun.

[0131] Meanwhile, as illustrated in FIG. 9, the processor (150) can set a low weight value when the first speaker and the second speaker speak simultaneously.

[0132] Specifically, when the voice of the first speaker and the voice of the second speaker are output simultaneously, a probability (speaker match probability) regarding whether the first speaker and the second speaker are the same person can be obtained. The method for calculating the speaker match probability will be described in detail below in step S740 of FIG. 7.

[0133] The processor (150) can identify that when the voices (or speech) of the first speaker and the second speaker are output simultaneously, the 'speaker match probability' between the first speaker and the second speaker is less than a threshold value. At this time, the processor (150) can reduce the weight value after identifying that the 'match probability' of the first and second speakers is less than a threshold value.

[0134] Returning to Fig. 7, the processor (150) can obtain a probability value based on the sampling period and weight value obtained according to the method described above (S740).

[0135] Additionally, the processor (150) can obtain information about the speaker corresponding to the subtitle using the obtained probability value (S750).

[0136] Specifically, when the voice utterance of the first speaker and the subtitle corresponding to the first speaker's voice are output simultaneously, the person speaking the voice can be identified as the first speaker. Additionally, when the second speaker speaks and the subtitle corresponding to the first speaker's voice is output simultaneously, the weight value of the second speaker's voice utterance is set low, so the subtitle corresponding to the first speaker's voice can be identified as voice utterance data of the first speaker rather than the second speaker.

[0137] The processor (150) can obtain a probability value based on the audio signal and weight value sampled according to the acquired sampling period using the following mathematical formula.

[0138]

[0139] ( : Total number of sampled audio signals, : weight, : Sample)

[0140] <Mathematical Formula 3>

[0141] That is, the processor (150) can obtain a probability value for identifying the speaker generating the voice in the video based on the acquired weight value and the total number of sampled audio signals. Using multiple adaptive sampling may have the advantage of improving the recognition rate of voice and conversation.

[0142] The weight can be calculated using the following mathematical formula.

[0143]

[0144] : Weighted threshold, : hour

[0145] <Mathematical Formula 4>

[0146] At this time, the processor (150) can identify that the first speaker is generating the voice if the acquired probability value corresponds to the first probability value corresponding to the first speaker generating the voice included in the audio signal, and can identify that the second speaker is generating the voice if the acquired probability value corresponds to the second probability value corresponding to the second speaker generating the voice included in the audio signal.

[0147] Additionally, the processor (150) can control the display (130) to display the text corresponding to the audio signal of the identified first speaker and information about the first speaker together when it is identified that the first speaker is generating voice based on the identification result, and to control the display (130) to display the text corresponding to the audio signal of the identified second speaker and information about the second speaker together when it is identified that the second speaker is generating voice.

[0148] For example, the electronic device (100) can set the probability value corresponding to the first speaker to 0 and the probability value corresponding to the second speaker to 1. At this time, if the obtained probability value is 0.3, the electronic device (100) can identify the speaker corresponding to the subtitle as the first speaker. However, if the obtained probability value is 0.85, the speaker corresponding to the subtitle can be identified as the second speaker.

[0149] In the above-described method, the processor (150) can recognize a speaker that matches (or corresponds to) the dialogue based on the sampled audio signal and weight values.

[0150] FIG. 10 is a flowchart illustrating a method for controlling an electronic device according to one embodiment of the present disclosure.

[0151] First, the electronic device (100) can acquire an audio signal sampled from an image according to a sampling period corresponding to a voice segment included in the audio signal (S1010). The electronic device (100) can acquire a sampling period based on whether the voice segment included in the audio signal has started or whether the voice segment has lasted for a threshold time.

[0152] The electronic device (100) can identify a speaker generating speech in a video based on a weight value corresponding to the length of the text included in the video and the acquired audio signal (S1020). The electronic device (100) can acquire a sampled audio signal using the sampling period, the audio signal, and the length of the text included in the video obtained by the method described above.

[0153] Additionally, methods according to various embodiments of the present disclosure may be provided by being included in a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)), or distributed online (e.g., download or upload) through an application store (e.g., Play Store™) or directly between two user (20) devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product (e.g., downloadable app) may be temporarily stored or temporarily created in a device-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.

[0154] A method according to various embodiments of the present disclosure may be implemented as software comprising instructions stored on a machine-readable storage medium (e.g., a computer). The machine may include a server device or an electronic device according to the disclosed embodiments, which is a device capable of calling instructions stored from the storage medium and operating according to the called instructions.

[0155] Meanwhile, a device-readable storage medium may be provided in the form of a non-transitory readable recording medium. Here, 'non-transitory readable recording medium' simply means that it is a tangible device and does not contain a signal (e.g., electromagnetic waves), and this term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily. For example, a 'non-transitory storage medium' may include a buffer in which data is stored temporarily.

[0156] When the above instruction is executed by a processor, the processor may perform the function corresponding to the instruction directly or by using other components under the control of the processor. The instruction may include code generated or executed by a compiler or an interpreter.

[0157] Although preferred embodiments of the present disclosure have been illustrated and described above, the present disclosure is not limited to the specific embodiments described above. It is understood that various modifications can be made by those skilled in the art without departing from the essence of the present disclosure as claimed in the claims, and such modifications should not be understood individually from the perspective of the present disclosure.

Claims

1. In an electronic device, display; Communication interface; Memory for storing instructions; and Includes at least one processor; and When the above instructions are executed collectively or individually by the at least one processor, the electronic device, Acquire an audio signal sampled from an image according to a sampling period corresponding to a voice segment included in the audio signal, and An electronic device that identifies a speaker generating speech in an image based on a weight value corresponding to the length of the acquired audio signal and the text included in the image.

2. In Paragraph 1, When the above instructions are executed collectively or individually by the at least one processor, the electronic device, If the detection point for identifying the above sampling period corresponds to the start point of the voice segment included in the audio signal, the sampling period is obtained at or below a preset value, and An electronic device that acquires a sampling period greater than or equal to a preset value when the detection point for identifying the sampling period corresponds to a point where a threshold time has elapsed since the start point of the voice segment included in the audio signal.

3. In Paragraph 1, When the above instructions are executed collectively or individually by the at least one processor, the electronic device, An electronic device that resets the sampling period when the detection point for identifying the sampling period corresponds to the end point of a voice segment included in the audio signal.

4. In Paragraph 1, When the above instructions are executed collectively or individually by the at least one processor, the electronic device, An electronic device for obtaining the sampling period using the number of sampled audio signals obtained based on the total number of sampled audio signals, the length of the audio signals, and the length of the audio signals corresponding to subtitle output.

5. In Paragraph 1, When the above instructions are executed collectively or individually by the at least one processor, the electronic device, It performs a function for recognizing text contained within a text area obtained from an image output on the above display, and An electronic device that obtains a length corresponding to the text obtained through the above function.

6. In Paragraph 5, When the above instructions are executed collectively or individually by the at least one processor, the electronic device, Using the length of the audio signal corresponding to the identified text obtained based on the length corresponding to the identified text, the length of the text of the image output to the display and the audio signal output simultaneously is obtained, and An electronic device for obtaining the weight value based on the length of simultaneous output of text of an image output on the display and the audio signal and the sampling period.

7. In Paragraph 6, When the above instructions are executed collectively or individually by the at least one processor, the electronic device, An electronic device for obtaining a probability value for identifying a speaker generating the voice in the image based on the above-mentioned acquired weight value and the total number of the above-mentioned sampled audio signals.

8. In Paragraph 7, When the above instructions are executed collectively or individually by the at least one processor, the electronic device, If the above-mentioned obtained probability value corresponds to a first probability value corresponding to a first speaker generating the voice included in the audio signal, it is identified that the first speaker generates the voice, and An electronic device that identifies that a second speaker generates the voice if the above-mentioned probability value corresponds to a second probability value corresponding to a second speaker generating the voice included in the audio signal.

9. In Paragraph 8, When the above instructions are executed collectively or individually by the at least one processor, the electronic device, When the first speaker is identified as generating voice, the display is controlled to display text corresponding to the audio signal of the identified first speaker and information about the first speaker together. An electronic device that controls the display to display text corresponding to the audio signal of the identified second speaker and information about the second speaker together when the second speaker is identified as generating voice.

10. In Paragraph 6, An electronic device that obtains a weight value based on multiple voice segments when a voice segment included in the above-mentioned acquired audio signal corresponds to multiple voice segments of multiple speakers.

11. A method for controlling an electronic device, A step of acquiring an audio signal sampled from an image according to a sampling period corresponding to a voice segment included in the audio signal; A control method comprising: a step of identifying a speaker generating speech in the image based on a weight value corresponding to the length of the acquired audio signal and the text included in the image.

12. In Paragraph 11, The step of setting the sampling period above is, If the detection point for identifying the sampling period corresponds to the start point of a voice segment included in the audio signal, the step of obtaining the sampling period at or below a preset value; A control method comprising: a step of acquiring a sampling period greater than or equal to a preset value when the detection point for identifying the sampling period corresponds to a point where a threshold time has elapsed since the start point of the voice segment included in the audio signal.

13. In Paragraph 11, The step of setting the sampling period above A control method comprising the step of resetting the sampling period when the detection point for identifying the sampling period corresponds to the end point of a voice segment included in the audio signal.

14. In Paragraph 11, The step of setting the sampling period above A control method comprising the step of obtaining a sampling period using the number of sampled audio signals obtained based on the total number of sampled audio signals, the length of the audio signals, and the length of the audio signals corresponding to subtitle output.

15. A non-transient computer-readable recording medium comprising a program for executing a method of controlling an electronic device, The control method of the above electronic device is, A step of acquiring an audio signal sampled from an image according to a sampling period corresponding to a voice segment included in the audio signal; and A computer-readable recording medium comprising: a step of identifying a speaker generating speech in the image based on a weight value corresponding to the length of the acquired audio signal and the text included in the image.