Display device interworking with external device

The display device synchronizes AI-generated translated subtitles with media playback by transmitting audio data to an external device for real-time translation, addressing language barriers and improving user experience.

WO2026141748A1PCT designated stage Publication Date: 2026-07-02LG ELECTRONICS INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
LG ELECTRONICS INC
Filing Date
2024-12-27
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing display devices face difficulties in synchronizing translated subtitles with video playback, limiting users' access to multimedia content due to language barriers.

Method used

A display device transmits audio data to an external device for AI-based translation and real-time subtitle generation, synchronizing the subtitles with media playback using time stamp information.

Benefits of technology

Enables accurate alignment of translated subtitles with media playback, enhancing user experience by overcoming language barriers and ensuring synchronized subtitle display.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2024021341_02072026_PF_FP_ABST
    Figure KR2024021341_02072026_PF_FP_ABST
Patent Text Reader

Abstract

This display device comprises: a transmission module configured to transmit a data stream including video data and audio data to an external device; a media playback module configured to parse the audio data from the data stream and decode the parsed audio data; a text reception module configured to receive, from the external device, text data obtained by converting speech detected in the data stream into text and timestamp information; and a subtitle generation module configured to synchronize, on the basis of the timestamp information, the audio data extracted from the data stream with the text data.
Need to check novelty before this filing date? Find Prior Art

Description

Display device that interacts with external devices

[0001] The present disclosure relates to a display device that interacts with an external device. More specifically, the present disclosure relates to a display device that displays subtitles in interaction with an external device in a content provision system.

[0002] Recently, digital TV services using wired or wireless communication networks have become commonplace. Digital TV services can provide a variety of services that were not available through existing analog broadcasting services.

[0003] For example, IPTV (Internet Protocol Television) and SMART TV services, which are types of digital TV services, provide interactivity that allows users to actively select the types of programs and viewing time.

[0004] Meanwhile, among media content, video is provided through telecommunication or broadcasting networks, so it is available to all users regardless of region or language. However, because the language of videos varies by country, region, or language, the videos available to users are inevitably limited depending on their translation skills.

[0005] For example, domestic (Korean) users can watch videos produced in English-speaking countries through display devices. However, since the audio and subtitles of videos from English-speaking countries are produced based on English, domestic users are restricted from using videos created by users from English-speaking countries.

[0006] Meanwhile, translated subtitles can be displayed via voice recognition while playing video content on a display device. However, since the display device generates the translated subtitles while playing the video content, there are difficulties in displaying the subtitles on the screen in sync with the playback of the video content.

[0007] The present disclosure is intended to resolve the difficulty of displaying translated subtitles on the screen at the time the video content is played, as the display device generates translated subtitles while playing video content.

[0008] The present disclosure is for transmitting a data stream containing audio to an external device, wherein the external device decodes the audio to generate AI-based translated text and provides it to a display device.

[0009] The present disclosure is intended to propose a method for synchronizing media being played on a webOS-based display device with subtitles generated in real time through an external device.

[0010] The present disclosure is for transmitting media being played to an external device and generating subtitles in real time using artificial intelligence (AI).

[0011] The present disclosure is for adaptively controlling the timing of transmitting audio data to be translated to an external device according to the state of the external device.

[0012] A display device that interacts with an external device according to the present specification includes: a transmission module configured to transmit a data stream including video data and audio data to an external device via a communication module; a media playback module configured to transmit the data stream to the transmission module, parse the audio data from the data stream, and decode the parsed audio data; a text receiving module configured to receive text data, in which voice detected in the data stream is converted into text, and time stamp information from the external device via the communication module; and a subtitle generation module configured to synchronize the audio data extracted from the data stream with the text data based on the time stamp information. The subtitle generation module displays the synchronized text data as the subtitle on a specific frame of the screen.

[0013] According to an embodiment, the media playback module may include: a source input module configured to receive the data stream including the video data and the audio data; a demultiplexer configured to classify the video data, the audio data, and control information in the data stream; a parser configured to parse the audio data; a decoder configured to decode the parsed audio data; and an audio sink module configured to output the decoded audio data. The extracted audio data may be transmitted to the transmission module through the source input module.

[0014] According to an embodiment, the transmission module may include a second source input module configured to receive the data stream extracted from the source input module; a queue configured to store the data stream; and a TCP sink module configured to transmit the data stream stored in the queue to the external device via TCP.

[0015] According to an embodiment, the display module may further include a processor configured to control the operation of the external device through the transmission module and the text receiving module. The processor may control a text generation module configured to generate text data from the voice included in the audio data extracted from the data stream.

[0016] According to an embodiment, the processor may control the text generation module to extract and decode the audio data from the data stream transmitted through the transmission module. The processor may control the text generation module to collect first audio data in which voice is detected from the decoded audio data. The processor may control the text generation module to generate text data in which the voice is converted into text. The processor may control the text generation module to translate the text data into a plurality of languages.

[0017] According to an embodiment, the text generation module controlled by the processor may include: a demultiplexer configured to classify video data, audio data, and control information in the data stream; an audio parser configured to parse the audio data; an audio decoder configured to decode the parsed audio data; and a voice activity detector (VAD) configured to detect the voice in the decoded audio data.

[0018] According to an embodiment, the text generation module controlled by the processor may further include: a speech-to-text (STT) module configured to convert the voice detected by the voice activation detector from the decoded audio data into the text data; a text translation module configured to translate the converted text data into a plurality of languages ​​to generate translated text data; and a TCP transmission module configured to transmit the translated text data to the text receiving module of the display device in a WebVTT manner.

[0019] According to an embodiment, the text receiving module may include a third source input module configured to receive a second data stream containing the translated text data received from the text generating module; and a transmission module configured to transmit the second data stream containing the text data to the subtitle generating module in a WebVTT manner.

[0020] According to an embodiment, the subtitle generation module can extract a second time stamp of the second data stream transmitted from the text receiving module via the WebVTT method. The subtitle generation module can detect translated text data having the second time stamp corresponding to the first time stamp of the audio data. The subtitle generation module can control the subtitle of the translated text data having the second time stamp to be synchronized with the audio data of the first time stamp. The subtitle generation module can display the synchronized translated text data as the subtitle on a specific frame of the first time stamp of the screen.

[0021] According to an embodiment, the data stream transmitted from the media playback module to the transmission module may be one of an MPEG-4 data stream of a first quality, a TS data stream of a second quality higher than the first quality, and an MKV data stream. The TS data stream and the MKV data stream may be a first type of data stream, and the MPEG-4 data stream may be a second type of data stream. The data stream transmitted from the transmission module to the external device may be the first type of data stream or the second type of data stream.

[0022] According to an embodiment, the second data stream containing the text data transmitted from the external device to the text receiving module may be a WebVTT data stream.

[0023] According to an embodiment, if the external device is a mobile terminal, the transmission module can convert the first type of data stream of the first quality into the second type of data stream. The transmission module can transmit the converted second type of data stream to the external device.

[0024] According to an embodiment, the processor can transmit the data stream to the external device through the transmission module when power is supplied to the external device or when the battery level is above a threshold. When power is not supplied to the external device or when the battery level is below the threshold, the processor can transmit audio data parsed from the data stream to the external device through a voice detection module.

[0025] According to an embodiment, if the data stream is a first type of data stream, the processor can transmit audio data parsed from the data stream to the external device through a voice detection module. If the data stream is a second type of data stream, the processor can transmit the data stream to the external device through a transmission module.

[0026] According to an embodiment, if the processing speed associated with voice detection, text conversion, and translation by the external device is greater than or equal to a threshold speed, the processor can transmit audio data parsed from the data stream to the external device through a voice detection module. If the processing speed is less than a threshold speed, the processor can transmit audio data parsed from the data stream to the external device through a voice detection module. If the processing speed is greater than or equal to a threshold speed, the processor can transmit the data stream to the external device through a transmission module.

[0027] According to an embodiment, the voice detection module may include: a source input module configured to receive audio data extracted from a parser of the media playback module; a second audio decoder configured to decode the extracted audio data; a voice active detector configured to detect the voice in the decoded audio data; a payloader module that forms a data stream including a header for synchronization and a payload of data associated with the detected voice; and a transmission module configured to transmit the data stream including the header and the payload to the external device through the communication module.

[0028] According to an embodiment, the data stream may include a first data stream and a second data stream following the first data stream. While decoding the first voice of the first data stream, the processor may transmit the second data stream to the external device or parse audio data from the second data stream and transmit it to the voice detection module. The processor may store translated second text data for the second voice of the second data stream from the external device in a storage unit. The processor may extract a translated first subtitle of the text data from the storage unit. The processor may display the extracted first subtitle on the screen so as to be synchronized with the output of the first voice.

[0029] According to an embodiment, the processor may be configured to receive status information and capability information of the external device and to control the timing of transmitting the data stream to the transmission module or transmitting the audio data to the voice detection module. The processor may transmit the data stream to the transmission module if the processing speed of the external device is greater than or equal to the threshold speed. If the processing speed is less than the threshold speed, the processor may decode the audio data extracted through the parser through the voice detection module and transmit the audio data decoded to a first sound quality to the external device. If the text conversion accuracy of the external device is less than or equal to the threshold ratio, the processor may transmit the audio data decoded to a second sound quality through the audio decoder of the media playback module to the external device through the voice detection module. The second sound quality may be of higher quality than the first sound quality.

[0030] According to an embodiment, if the translation processing speed of the external device is below a threshold speed, the processor can decode the audio data extracted through the parser through the voice detection module and transmit the audio data decoded to the first sound quality to the external device.

[0031] According to an embodiment, the processor can determine whether the speech times of multiple speakers included in the audio data extracted through the audio decoder of the media playback module overlap. If the speech times overlap, the processor can transmit sequence information including a start time stamp and an end time stamp of each speaker's speech time, and the audio data, to the external device. The processor can display the translated subtitles of the text data synchronized with the speech times according to the sequence information.

[0032] According to the present specification, as an external device generates translated subtitles while a display device plays video content, the difficulty of displaying translated subtitles on the screen at the time the video content is played can be resolved.

[0033] According to the present specification, a data stream containing audio is transmitted to an external device, and the external device decodes the audio to generate AI-based translated text, allowing the display device to focus on playing content on the screen.

[0034] According to the present specification, a method can be proposed to synchronize subtitles generated in real time through an external device by transmitting audio data to an external device in advance before decoding media on a webOS-based display device.

[0035] According to the present specification, media currently being played can be transmitted to an external device to generate subtitles in real time using artificial intelligence (AI), and the subtitles can be accurately aligned with the timing of media playback to enhance the user's viewing experience.

[0036] According to the present specification, it is possible to determine whether to transmit a data stream or parsed audio data to an external device by considering the power status of the external device and the processing speed associated with voice detection, text conversion, and translation. Accordingly, translation work can be performed adaptively depending on the state of the external device.

[0037] According to the present specification, it is possible to determine whether to transmit the data stream or the parsed audio data to an external device by considering the type of the data stream. Accordingly, translation work can be performed in collaboration with an external device in a manner optimized for the data stream.

[0038] According to the present specification, the timing of transmitting audio data to be translated to an external device can be adaptively controlled depending on the state of the external device, such as text translation accuracy and translation processing speed.

[0039] FIG. 1 is a block diagram illustrating the configuration of a display device according to one embodiment of the present disclosure.

[0040] FIG. 2 is a drawing for explaining a content server according to an embodiment of the present disclosure.

[0041] FIG. 3 is a drawing for explaining a content provision system according to an embodiment of the present disclosure.

[0042] FIG. 4 shows a detailed configuration diagram of a display device that transmits a data stream to an external device according to an embodiment and displays subtitles received from the external device.

[0043] Figure 5 shows a detailed configuration diagram of an external device that interacts with a display device that displays subtitles of Figure 4.

[0044] Figure 6 shows a configuration diagram of a display device and an external device that transmit parsed audio data before decoding.

[0045] Figure 7 shows a detailed configuration diagram of an external device that interacts with a display device that displays subtitles of Figure 6.

[0046] Figure 8 shows the header format of audio data and text data transmitted between a display device and an external device.

[0047] Figure 9 shows a detailed configuration diagram of a display device that transmits sequence information and audio data to an external device and receives and displays subtitles from an external device.

[0048] Figure 10 shows a detailed configuration diagram of an external device that interacts with a display device that displays subtitles of Figure 9.

[0049] FIG. 11 shows a configuration diagram of a display device and an external device that further include a voice message tracking module in the embodiment of FIG. 9.

[0050] Figure 12 shows a detailed configuration diagram of an external device that interacts with a display device that displays subtitles of Figure 11.

[0051] FIG. 13 shows the configuration of a display device configured to control when a data stream or audio data is transmitted to a transmission module.

[0052] FIG. 14 is a flowchart of a control method that controls the transmission of a data stream to an external device or the transmission of parsed audio data according to the state of an external device according to the present disclosure.

[0053] FIG. 15 is a flowchart of a control method that controls the timing of adaptively transmitting audio data according to the state of an external device according to the present disclosure.

[0054] Figure 16 shows a flowchart of a control method for transmitting audio data of a different format to an external device by determining whether the speech times of the speaker overlap.

[0055] It should be noted that technical terms used in this specification are used merely to describe specific embodiments and are not intended to limit the invention. Additionally, singular expressions used in this specification include plural expressions unless the context clearly indicates otherwise. The suffixes "module" and "part" for components used in the following description are assigned or used interchangeably solely for the ease of drafting the specification and do not inherently possess distinct meanings or roles.

[0056] In this specification, terms such as "composed of" or "comprising" should not be interpreted as necessarily including all of the various components or steps described in the specification, and should be interpreted as potentially excluding some of the components or steps, or including additional components or steps.

[0057] In addition, when describing the technology disclosed in this specification, if it is determined that a detailed description of related prior art could obscure the essence of the technology disclosed in this specification, such detailed description is omitted.

[0058] In addition, the attached drawings are intended only to facilitate understanding of the embodiments disclosed in this specification, and the technical concept disclosed in this specification is not limited by the attached drawings; it should be understood that they include all modifications, equivalents, and substitutions that fall within the concept and technical scope of the present invention. Furthermore, not only each of the embodiments described below, but also combinations of embodiments may fall within the concept and technical scope of the present invention as modifications, equivalents, and substitutions that fall within the concept and technical scope of the present invention.

[0059] Hereinafter, embodiments disclosed in this specification will be described in detail with reference to the attached drawings.

[0060] FIG. 1 is a block diagram illustrating the configuration of a display device according to one embodiment of the present disclosure.

[0061] Referring to FIG. 1, the display device (100) may include a broadcast receiver (130), an external device interface unit (135), a storage unit (140), a user input interface (150), a processor (170), a communication module (173), a voice acquisition unit (175), a display unit (180), an audio output unit (185), and a power supply unit (190). Since the external device interface unit (135) performs wired communication with a peripheral device, the external device interface unit (135) may be referred to as a wired communication module. Since the communication module (173) performs wireless communication through a wireless signal, it may be referred to as a wireless communication module.

[0062] The broadcast receiving unit (130) may include a tuner (131), a demodulating unit (132), and a network interface unit (133).

[0063] The tuner (131) can tune to a specific broadcast channel according to a channel tuning command. The tuner (131) can receive a broadcast signal for the tuned specific broadcast channel.

[0064] The demodulator (132) can separate the received broadcast signal into a video signal, an audio signal, and a data signal related to the broadcast program, and can restore the separated video signal, audio signal, and data signal into a form that can be output.

[0065] The network interface unit (133) may provide an interface for connecting the display device (100) to a wired / wireless network including the Internet network. The network interface unit (133) may transmit or receive data to or from other users or other electronic devices through the connected network or another network linked to the connected network.

[0066] The network interface unit (133) can access a specific web page through a connected network or another network linked to the connected network. That is, it can access a specific web page through a network and transmit or receive data with the corresponding server.

[0067] In addition, the network interface unit (133) can receive content or data provided by a content provider or network operator. That is, the network interface unit (133) can receive content such as movies, advertisements, games, VOD, broadcast signals, and related information provided by a content provider or network provider through a network.

[0068] Additionally, the network interface unit (133) can receive firmware update information and update files provided by the network operator, and can transmit data to the internet, content provider, or network operator.

[0069] The network interface unit (133) can select and receive a desired application among the applications that are open to the public through the network.

[0070] The external device interface unit (135) can receive an application or a list of applications within an adjacent external device and transmit it to a processor (170) or a storage unit (140).

[0071] The external device interface section (135) can provide a connection path between the display device (100) and an external device. The external device interface section (135) can receive one or more of video and audio output from an external device connected to the display device (100) wirelessly or via a wired connection and transmit them to the processor (170). The external device interface section (135) may include a plurality of external input terminals. The plurality of external input terminals may include an RGB terminal, one or more HDMI (High Definition Multimedia Interface) terminals, and a component terminal.

[0072] The video signal of an external device input through the external device interface unit (135) can be output through the display unit (180). The voice signal of an external device input through the external device interface unit (135) can be output through the audio output unit (185).

[0073] The external device that can be connected to the external device interface section (135) may be any one of a set-top box, Blu-ray player, DVD player, game console, soundbar, smartphone, PC, USB memory, or home theater, but this is merely an example.

[0074] In addition, some of the content data stored in the display device (100) can be transmitted to another user or other electronic device selected among other users or other electronic devices that are previously registered in the display device (100).

[0075] The storage unit (140) can store programs for each signal processing and control within the processor (170), and can store signal-processed video, audio, or data signals.

[0076] Additionally, the storage unit (140) may perform the function of temporarily storing video, audio, or data signals input from the external device interface unit (135) or the network interface unit (133), and may also store information regarding a predetermined image through a channel memory function.

[0077] The storage unit (140) can store an application or a list of applications input from an external device interface unit (135) or a network interface unit (133).

[0078] The display device (100) can play content files (video files, still image files, music files, document files, application files, etc.) stored in the storage unit (140) and provide them to the user.

[0079] The user input interface (150) can transmit a signal input by the user to the processor (170) or transmit a signal from the processor (170) to the user. For example, the user input interface (150) can receive and process control signals such as power on / off, channel selection, and screen setting from the remote control device (200) according to various communication methods such as Bluetooth, Ultra Wideband (UWB), ZigBee, Radio Frequency (RF) communication, or Infrared (IR) communication, or process to transmit control signals from the processor (170) to the remote control device (200).

[0080] Additionally, the user input interface (150) can transmit control signals input from local keys (not shown), such as a power key, channel key, volume key, and setting value, to the processor (170).

[0081] The image signal processed by the processor (170) can be input to the display unit (180) and displayed as an image corresponding to the image signal. Additionally, the image signal processed by the processor (170) can be input to an external output device through the external device interface unit (135).

[0082] The voice signal processed by the processor (170) can be output as audio to the audio output unit (185). Additionally, the voice signal processed by the processor (170) can be input to an external output device through the external device interface unit (135).

[0083] In addition, the processor (170) can control the overall operation within the display device (100).

[0084] Additionally, the processor (170) can control the display device (100) by means of user commands or internal programs input through the user input interface (150). The processor (170) can connect to a network to enable the user to download desired applications or a list of applications into the display device (100). The processor (170) may be configured to execute at least one application program to control the display device (100). The processor (170) may be configured to play media of a data stream including video and audio through a media playback module (100). The media playback module (100) may be an application program of a media player that plays media. The processor (170) may detect voice included in the audio data through a voice detection module (20).

[0085] The processor (170) enables the processed video or audio signal, such as channel information selected by the user, to be output through the display unit (180) or audio output unit (185).

[0086] Additionally, the processor (170) enables a video signal or audio signal from an external device, such as a camera or camcorder, which is input through the external device interface unit (135), to be output through the display unit (180) or audio output unit (185) in accordance with an external device video playback command received through the user input interface (150).

[0087] Meanwhile, the processor (170) can control the display unit (180) to display an image, for example, a broadcast image input through the tuner (131), an external input image input through the external device interface unit (135), an image input through the network interface unit, or an image stored in the storage unit (140) can be controlled to be displayed on the display unit (180). In this case, the image displayed on the display unit (180) may be a still image or a video, and may be a 2D image or a 3D image.

[0088] Additionally, the processor (170) can control the playback of content stored in the display device (100), received broadcast content, or external input content input from the outside, and the content may be in various forms such as broadcast video, external input video, audio file, still image, connected web screen, and document file.

[0089] The communication module (173) can communicate with an external device via wired or wireless communication. The communication module (173) can perform short-range communication with an external device. To this end, the communication module (173) can support short-range communication by using at least one of Bluetooth™, BLE (Bluetooth Low Energy), RFID (Radio Frequency Identification), Infrared Data Association (IrDA), UWB (Ultra Wideband), ZigBee, NFC (Near Field Communication), Wi-Fi (Wireless-Fidelity), Wi-Fi Direct, and Wireless USB (Wireless Universal Serial Bus) technologies. Such a communication module (173) can support wireless communication between a display device (100) and a wireless communication system, between a display device (100) and another display device (100), or between a display device (100) and a network where a display device (100, or an external server) is located, via a wireless area network. The wireless area network may be a wireless personal area network.

[0090] Here, another display device (100) may be a wearable device (e.g., a smartwatch, smart glass, head-mounted display, or mobile terminal such as a smartphone) capable of exchanging (or interacting with) data with the display device (100) according to the present invention. A communication module (173) may detect (or recognize) a wearable device capable of communicating around the display device (100). Furthermore, if the detected wearable device is an authenticated device to communicate with the display device (100) according to the present invention, the processor (170) may transmit at least a portion of the data processed in the display device (100) to the wearable device through the communication module (173). Thus, a user of the wearable device may use the data processed in the display device (100) through the wearable device.

[0091] The voice acquisition unit (175) can acquire audio. The voice acquisition unit (175) may include at least one microphone (not shown) and can acquire audio around the display device (100) through the microphone (not shown).

[0092] The display unit (180) can generate a driving signal by converting the video signal, data signal, OSD signal processed by the processor (170) or the video signal, data signal, etc. received from the external device interface unit (135) into R, G, and B signals, respectively.

[0093] Meanwhile, since the display device (100) illustrated in FIG. 1 is merely an embodiment of the present invention, some of the illustrated components may be integrated, added, or omitted depending on the specifications of the actual implemented display device (100).

[0094] That is, as needed, two or more components may be combined into a single component, or a single component may be subdivided into two or more components. In addition, the functions performed in each block are intended to explain embodiments of the present invention, and the specific operations or devices do not limit the scope of the present invention.

[0095] FIG. 2 is a drawing for explaining a content server according to an embodiment of the present disclosure.

[0096] Referring to FIGS. 1 and FIGS. 2, the content server (300) can provide a recommendation service that recommends content that a viewer using the display device (100) may prefer.

[0097] The content server (300) may include a communication interface (310), memory (320), and a processor (330).

[0098] The content server (300) can transmit and receive data to and from at least one display device (100) via wired or wireless communication through the communication interface (310).

[0099] The memory (320) may include a content information database (321). The content information database (321) may store information related to content played on each device. For example, the content information database (321) may store content playback information, content setting information, or application installation information in association with the identification information of each device.

[0100] When the processor (330) receives a content recommendation request from a display device (100) or an external device, it can recommend content optimized for each device based on data stored in the content information database (321).

[0101] FIG. 3 is a drawing for explaining a content provision system according to an embodiment of the present disclosure.

[0102] Referring to FIGS. 1 to 3, the content providing system (1000) may include at least one display device (100), at least one remote control device (200), a content server (300), and an external device (400).

[0103] The processor (170) of the display device (100) can play content.

[0104] Additionally, the processor (170) can generate content playback information regarding the played content. Additionally, the processor (170) can generate content setting information, which is information regarding the quality, volume, and preferred channel status set when playing the content.

[0105] Content playback information may include at least one of content identification information, content genre information, content playback start time information, content playback end time information, and content total playback time information for the played content.

[0106] Content setting information may include at least one of quality information set for the content when playing the content, volume information, and preferred channel information regarding whether the user has registered the channel providing the content as a preferred channel.

[0107] The processor (170) can transmit device identification information of the display device (100), generated content playback information, and generated content setting information to the content server (300) through the communication interface (173). The device identification information may be unique identification information for distinguishing it from other devices.

[0108] The content server (300) can store content playback information and content setting information received from the display device (100) in the content information database (321) in association with device identification information.

[0109] Meanwhile, the processor (170) can receive a content recommendation command through the user input interface unit (150) or the voice acquisition unit (175).

[0110] When the processor (170) receives a content recommendation command, it can transmit device identification information of the display device (100) and a content recommendation request to the content server (300) through the communication interface (173).

[0111] The communication interface (310) of the content server (300) can receive device identification information and a content recommendation request from the display device (100).

[0112] The processor (330) of the content server (300) can obtain content playback information and content setting information associated with the display device (100) from the content information database (321) based on device identification information.

[0113] The processor (330) can generate content recommendation information and recommendation setting information for the display device (100) based on content playback information and content setting information. The content recommendation information may include recommended content identification information and recommended content genre information for at least one recommended content. Additionally, the recommendation setting information may include recommended image quality setting information and preferred channel information.

[0114] The processor (330) can transmit content recommendation information and recommendation setting information to the display device (100) through the communication interface (310).

[0115] The processor (170) can receive content recommendation information and recommendation setting information from the content server (300) through the communication interface (173).

[0116] The processor (170) can display at least one recommended content based on the received content recommendation information. Additionally, when a playback command for the recommended content is input through the user input interface unit (150) or the voice acquisition unit (175), the processor (170) can set the quality of the recommended content to be played based on the received recommendation setting information and play it.

[0117] The quality of recommended content is set for playback, and if a user requests a change to a preferred channel, a channel change to the preferred channel can be performed based on the preferred channel information.

[0118] Meanwhile, the display device (100) can mirror the content currently being played to an external device (400). The external device (400) may include another display device or a mobile device. In this case, the mirrored content can be viewed through the external device (400). Therefore, viewing information regarding the mirrored content needs to serve as basic data for recommending content to the external device (400).

[0119] Meanwhile, when the display device (100) performs a mirroring operation to an external device (400), it may receive a control command from the external device (400) to control the display device (100). The control command may include a content change command to change the content being played from the first content to the second content. When the display device (100) receives the content change command, it may play the changed content. In this case, the display device (100) needs to transmit content playback information regarding the changed content to the content server (300) as information for content recommendation by the external device (400).

[0120] Hereinafter, a display device that displays subtitles in conjunction with an external device in a content provision system according to the present disclosure will be described. In this regard, FIG. 4 shows a detailed configuration diagram of a display device that transmits a data stream to an external device according to an embodiment and displays subtitles received from the external device. FIG. 5 shows a detailed configuration diagram of an external device that interacts with the display device displaying subtitles of FIG. 4.

[0121] With reference to FIGS. 4 and 5, a display device (100) that interacts with an external device (400) according to the present disclosure will be described. The display device (100) may be configured to include a media playback module (10), a text receiving module (30), a subtitle generating module (40), and a transmission module (60). In this regard, the media playback module (10), the text receiving module (30), the subtitle generating module (40), and the transmission module (60) may be composed of software modules and executed by a processor (170).

[0122] Each software module is formed as a connected structure where the output of one processing stage serves as the input to the next. The software module consists of an instruction pipeline structure in which multiple instructions are executed step-by-step, divided into detailed cycles such as fetching, decoding, and computation, and executed by each pipeline stage. Additionally, multiple software modules are configured as a software pipeline where the output of each software module is automatically connected as the input to another software module.

[0123] Accordingly, the media playback module (10) may be referred to as a media playback pipeline, and the text receiving module (30) may be referred to as a receiver pipeline. The transmission module (60) may be referred to as a sender pipeline.

[0124] The transmission module (60) may be configured to transmit a data stream containing video data and audio data to an external device (400) via a communication module (173). In this regard, the data stream may include an MPEG-4 data stream, a TS (Transport Stream), an MKV (Matroska Multimedia Container) data stream, etc.

[0125] The media playback module (10) transmits the received media data in a container format to an external device (400), which is an AI offloading device. Since the external device (400) has received the media data in a container format, it performs media processing first before performing AI processing. Accordingly, the external device (400) performs demultiplexing, audio parsing, and audio decoding, and then performs speech activation detection, STT conversion, and translation processing.

[0126] Finally, subtitles are transmitted from an external device (400) to a display device (100), which is a webOS device, by including subtitle timestamp information in the text that has been finally translated. The display device (100) parses the timestamp information and text from the received data, sets the time stamp and duration for the subtitles to be output, and outputs the subtitles in conjunction with media playback.

[0127] The media playback module (10) may be configured to transmit a data stream containing audio data to be converted into video data and text data to the transmission module (60). The media playback module (10) may be configured to transmit the data stream to the transmission module (60) and to parse the audio data from the data stream. The media playback module (10) may be configured to decode the parsed audio data.

[0128] The text receiving module (30) may be configured to receive text data, in which voice detected in a data stream is converted into text, and timestamp information from an external device (400) via a communication module (173). The communication module (173) may be configured to receive a second signal from the external device (400) containing data necessary for the voice to be displayed as a subtitle on a specific frame of the screen. The communication module (173) may be implemented as a short-range wireless communication module configured to transmit a first signal to the external device (400) and receive a second signal from the external device (400). The external device (400) may be implemented as a mobile terminal, a tablet terminal, a PC, or another display device. Meanwhile, the external device (400) may be connected to the display device (100) via a wired interface.

[0129] If the external device (400) is configured to receive the data stream itself rather than voice or audio data, the external device (400) may be implemented via a wired interface. The external device (400) receiving the data stream may have increased power consumption for audio decoding and voice detection. Therefore, the external device (400) receives the data stream while connected to power, performs audio decoding and voice detection independently, and subsequently enables text conversion and translation into a specific language.

[0130] Meanwhile, the subtitle generation module (40) may be configured to synchronize audio data and text data extracted from a data stream based on time stamp information. The subtitle generation module (40) may be configured to display the synchronized text data as subtitles on a specific frame of the screen.

[0131] The media playback module (10) may be configured to include components corresponding to a plurality of sub-modules. The media playback module (10) may be configured to include a source input module (SRC) (11), a demultiplexer (12), a parser (13), an audio decoder (14), and an audio sink module (15).

[0132] The source input module (11) may be configured to receive a data stream containing video data and audio data. The source input module (11) may be configured to receive a data stream containing video data and audio data from a content server (300) through a broadcast receiver (130). The source input module (11) may be configured to transmit a data stream containing video data and audio data to be converted into text data to a transmission module (60). The data stream transmitted to the transmission module (60) through the source input module (11) is raw data prior to being classified into video data and audio data. The source input module (11) may be configured to output a data stream containing video data and audio data to a demultiplexer (12). The demultiplexer (12) may be configured to classify video data, audio data, and control information from the data stream.

[0133] Audio data extracted through the source input module (11) can be transmitted to the transmission module (20). Meanwhile, the parser (13) can be configured to parse audio data classified by the demultiplexer (12) via the source input module (11). The parser (13) can parse the audio data to extract playback time, audio codec information, and audio metadata. The audio decoder (14) can be configured to decode and extract the parsed audio data. The audio decoder (14) can be configured to decode and extract audio data based on the audio codec information and audio metadata. The audio data extracted through the audio decoder (14) can be transmitted to the voice detection module (20).

[0134] The audio sink module (15) may be configured to output decoded audio data. The audio sink module (15) may be configured to output decoded audio data through an audio output unit (185), such as a speaker. Before outputting audio data through the audio sink module (15), the display device (100) must receive text data to be displayed as subtitles. Text data with converted speech or text data with translated second speech must be received from an external device (400) to the display device (100).

[0135] Meanwhile, the transmission module (60) that transmits the data stream to an external device (400) may be composed of a plurality of sub-modules. The transmission module (60) may be configured to include a second source input module (61), a queue (62), and a TCP sink module (63). The second source input module (61) may be configured to receive a data stream extracted from the source input module (111) of the media playback module (10). The queue (62) may be configured to store the input data stream. The TCP sink module (63) may be configured to transmit the data stream stored in the queue (62) to the external device (400) via TCP.

[0136] Meanwhile, the processor (170) may be configured to control the operation of the external device (400) through the transmission module (60) and the text reception module (30). The processor (170) may control the text generation module (420) of the external device (400) to generate text data from voice included in audio data extracted from a data stream. The external device (400) may include a communication module (410) and a text generation module (420) that generates text based on AI. The text generation module (420) may be configured to generate text data from voice included in audio data extracted from a data stream.

[0137] The processor (170) can control the text generation module (420) to extract and decode audio data from a data stream transmitted through the transmission module (60). The processor (170) can control the text generation module (420) to collect first audio data in which voice is detected from the decoded audio data. The processor (170) can control the text generation module (420) to generate text data in which the voice included in the first audio data is converted into text. The processor (170) can control the text generation module (420) to translate the text data into multiple languages. In this regard, the processor (170) can control the text generation module (420) to generate text data translated into a specific language according to user input through the display device (100).

[0138] A text generation module (420) controlled by a processor (170) may be configured to include a source input module (431), a demultiplexer (432), an audio parser (433), an audio decoder (434), and a voice activity detector (VAD) (435). The source input module (431), the demultiplexer (432), the audio parser (433), the audio decoder (434), and the voice activity detector (435) may constitute a voice detection module (430). Thus, the text generation module (420) may be configured to include a voice detection module (430).

[0139] The text generation module (420), controlled via the processor (170), may be configured to further include a speech-to-text (STT) module (423), a text translation module (424), and a TCP transmission module (426). The STT module (423) may be referred to as a text converter. The STT module (423) may be configured to convert speech detected by an active detector (435) from decoded audio data into text data. The text translation module (424) may be configured to generate translated text data by translating the converted text data into multiple languages. The TCP transmission module (426) may transmit the translated text data to the text receiving module (30) of the display device (100) in a WebVTT manner.

[0140] The text receiving module (30) may be configured to include a third source input module (31) and a transmission module (32). The third source input module (31) may be configured to receive a second data stream containing translated text data received from the text generation module (420).

[0141] The transmission module (32) may be configured to transmit a second data stream containing text data to the subtitle generation module (40) in a WebVTT (Web Video Text Tracks) manner. The second data stream may be configured as a WebVTT-based data stream. Since the transmission module (32) transmits the second data stream in synchronization with the application program of the subtitle generation module (40), it may also be referred to as an app sync module.

[0142] The subtitle generation module (40) may be configured to extract a first time stamp of a second data stream transmitted via WebVTT from the text receiving module (30) and a second time stamp of the translated text data. The subtitle generation module (40) may detect translated text data having a second time stamp corresponding to the first time stamp of the audio data. In this regard, the value of the second time stamp may be the same as the value of the first time stamp or a value within a threshold error range. The threshold error range may be set within a range of 0.1 times, 0.2 times, or 0.3 times the length of the audio segment. For example, the length of the audio segment may be set to a range of 6 seconds or 3 seconds to correspond to the length of the video segment. The length of the audio segment may be set within a range of 2 seconds to 4 seconds.

[0143] The subtitle generation module (40) can control the subtitles of the translated text data having a second time stamp to be synchronized with the audio data of the first time stamp. The subtitle generation module (40) can display the translated text data synchronized with a specific frame of the first time stamp on the screen as subtitles.

[0144] As described above, the data stream transmitted from the media playback module (10) to the transmission module (60) may be one of an MPEG-4 data stream, a TS data stream, and an MKV data stream. The data stream transmitted from the media playback module (10) to the transmission module (60) may be one of an MPEG-4 data stream of first quality and a TS data stream and an MKV data stream of second quality higher than the first quality. The TS data stream and the MKV data stream are data streams of the first type, and the MPEG-4 data stream may be a data stream of the second type.

[0145] The data stream transmitted from the transmission module (60) to the external device (400) may be a first type of data stream or a second type of data stream. The second data stream containing text data transmitted from the external device (400) to the text receiving module (30) may be a WebVTT data stream.

[0146] If the display device (100) is a mobile terminal such as a smartphone or tablet device, the data stream may be composed of an MPEG-4 data stream. If live content of HD quality is played through the display device (100), the data stream may be composed of a TS data stream. If VOD-based video content is played through the display device (100), the data stream may be composed of an MKV data stream.

[0147] TS data streams and MKV data streams consist of a first type of data stream having a lower video compression rate and higher image quality than MPEG-4 data streams. MPEG-4 data streams consist of a second type of data stream having a higher video compression rate than TS data streams and MKV data streams.

[0148] Meanwhile, if the external device (400) is a mobile terminal such as a smartphone or tablet device, it may be unable to decode MPEG-4 data streams and TS data streams and MKV data streams. Therefore, the data stream transmitted through the transmission module (60) may consist of an MPEG-4 data stream. The transmission module (60) may convert a second type of data stream of a second quality into a first type of data stream having a first quality lower than the second quality. The transmission module (60) may transmit the first type of data stream converted to have a first quality lower than the second quality to the external device (400).

[0149] In this regard, the source input module (61) may further include a source conversion module that converts a second type of data stream having a second quality into a first type of data stream having a first quality. Accordingly, the external device (400) can perform voice detection, text conversion, and translation in a shorter time than the second type of data stream by using a first type of data stream having a high compression rate.

[0150] The embodiments of FIGS. 4 and 5 allow data streaming to be transmitted to an external device (400), which is an AI offloading device, as soon as it is input from a display device (100) where a media player is running. Since the display device (100) can transmit spare data to the external device (400) to be AI processed in advance of the rendering time, delays caused by AI subtitle processing can be mitigated. Additionally, implementation is simple due to the relatively low complexity, and accurate synchronization of subtitles and media playback can be achieved for VOD content.

[0151] Meanwhile, the embodiments of FIGS. 4 and 5 can be utilized for URL-based streaming playback. Additionally, an external device (400), which is an AI offloading device, must possess media processing capabilities. The external device (400) must perform media processing such as demultiplexing, parsing, and decoding, as well as text timestamp processing. Furthermore, in the case of a live stream, a base time must be managed separately to synchronize subtitles and media playback. Status information of the external device (400) may be detected for media processing and text timestamp processing. In this regard, the display device (100) may transmit a data stream to the external device (400) or transmit parsed / decoded audio data to the external device (400).

[0152] Meanwhile, a display device that displays subtitles in conjunction with an external device according to the present disclosure may transmit audio data, rather than a data stream, to the external device depending on the capabilities and status of the external device. Additionally, the timing of transmitting audio data to the external device may be controlled depending on the capabilities and status of the external device. Meanwhile, a display device that displays subtitles in conjunction with an external device according to an embodiment may transmit parsed audio data to a voice detection module in advance before the audio data is decoded.

[0153] In this regard, FIG. 6 shows a configuration diagram of a display device and an external device that transmit parsed audio data prior to decoding. FIG. 7 shows a detailed configuration diagram of an external device that interacts with the display device that displays subtitles of FIG. 6. Referring to FIG. 6 and FIG. 7, the display device (100) may be configured to transmit decoded audio data to an external device (400) along with sequence information associated with a time stamp.

[0154] Referring to FIGS. 1 through 7, the display device (100) may be configured to include a media playback module (10), a voice detection module (20), a communication module (173), a text reception module (30), and a subtitle generation module (40). The media playback module (10) may be configured to include a source input module (SRC) (11), a demultiplexer (12), a parser (13), an audio decoder (14), and an audio sink module (15). Redundant descriptions of the operation of the aforementioned modules are replaced by the descriptions in FIGS. 4 and 5.

[0155] The source input module (11) may be configured to receive a data stream containing video data and audio data. The demultiplexer (12) may be configured to classify video data, audio data, and control information from the data stream. The parser (13) may be configured to parse audio data. The audio decoder (14) may be configured to decode and extract the parsed audio data. The audio decoder (14) may be configured to decode and extract audio data based on audio codec information and audio metadata. The audio sink module (15) may be configured to output the decoded audio data.

[0156] Meanwhile, audio data extracted through the parser (13) can be transmitted to the voice detection module (20). Thus, audio data extracted through the parser (13), rather than decoded audio data, can be transmitted to the voice detection module (20). The audio data extracted through the parser (13) can be composed of an audio elementary stream that further includes playback time, audio codec information, and audio metadata.

[0157] The media playback module (10) can extract an audio elementary stream and transmit it to the voice detection module (20). The voice detection module (20) can be configured to decode the audio elementary stream and collect first audio data in which voice is detected from the decoded audio elementary stream. The communication module (173) can transmit transmission data with a header inserted into the first audio data to an external device (400).

[0158] The voice detection module (20) may be configured to include a source input module (21), a second audio decoder (22), a voice activity detector (VAD) (23), a payloader module (24), and a transmission module (25). The voice detection module (20) is also additionally equipped with a second audio decoder (22). Thus, as soon as audio data is decoded, it is transmitted to an external device (400) to perform AI-based translation processing in advance. Only the first audio data in which voice is detected can be selected and collected through the voice activity detector (23). A header containing synchronization information may be inserted before transmitting the first audio data in which voice is detected to an external device (400) that performs AI offloading. Since communication / processing occurs only when voice is detected, communication / processing efficiency can be improved.

[0159] The external device (400) does not require media processing for the audio data and checks the size of the audio data to be received containing voice by checking the inserted header information. After receiving the audio data in the size indicated in the header, it performs speech-to-text (STT) processing and translation. Therefore, the external device (400) performs only AI-based STT processing and translation. Meanwhile, when text data is transmitted to the TV via the display device (100), the synchronization information of the received header is transmitted along with it. The display device (100) synchronizes the time between the media playback and the subtitles by referring to the synchronization information of the received text and the header.

[0160] The source input module (21) may be configured to receive audio data extracted from the parser (13). The source input module (21) may receive an audio elementary stream, which is audio data extracted from the parser (13), and transmit it to a second audio decoder (22). The second audio decoder (22) may be configured to decode the audio data extracted from the parser (13).

[0161] The voice active detector (23) may be configured to detect voice in the decoded audio data. The voice active detector (23) may be configured to distinguish between a voice region where speech occurs and a pause region in the decoded audio data, and to detect the voice spoken in the voice region. The payloader module (24) may configure a data stream including a header for synchronization and a payload of data associated with the detected voice. The transmission module (25) may be configured to transmit the data stream containing the header and payload to an external device (400) via the communication module (173).

[0162] The external device (400) may include a communication module (410) and a text generation module (420) that generates text based on AI. The text generation module (420) may include a source input module (421), a de-payloader module (422), a speech-to-text (STT) module (423), a text translation module (424), a payloader module (425), and a transmission module (426).

[0163] Meanwhile, the first audio data (PCM1) may be PCM data converted by pulse code modulation (PCM). The transmission data transmitted to the external device (400) may be a header and the first audio data (PCM1) followed by the header. The reception data received from the external device (400) may be a header and text data (TEXT) corresponding to a second voice in which the voice following the header is translated. Meanwhile, the reception data may be a header and the second audio data (PCM2) in which the voice following the header is translated into the second voice.

[0164] In this regard, FIG. 8 illustrates the header format of audio data and text data transmitted between a display device and an external device. Referring to FIG. 8, the header may be configured to include a sequence, a timestamp, a duration, a header length, and a payload length. The header may further be configured to include a stream type, audio information, text information, and video information.

[0165] With reference to FIGS. 1 to 8, the operation of a display device that displays subtitles based on synchronization information of a header in conjunction with an external device according to the present disclosure is described. In this regard, received data received from an external device (400) may consist of a header and text data.

[0166] Meanwhile, if playback of the second voice is requested, playback of the second voice can be performed on the display device (100) through the text-to-speech (TTS) engine of the display device (100). The text receiving module (30) may be configured to receive a header and text data (TEXT). The text data (TEXT) may be converted into a translated second voice through the TTS engine. The audio decoder (14) of the media playback module (10) may be configured to include a TTS engine, and the audio decoder (14) of the media playback module (10) may convert the text data (TEXT) into a translated second voice.

[0167] When a request is made to play a second voice, the audio decoder (14) may be configured to convert text data (TEXT) into a second voice. The converted second voice may be output through an audio sink module (15). As a request is made to play a second voice, the output of the voice of the first audio data needs to be stopped.

[0168] When the media playback module (10) is requested to play a second voice, the audio decoder (14) can perform decoding of the second audio data or convert text data (TEXT) into the second voice. Meanwhile, since decoding of the first audio data (PCM1) is performed in the voice detection module (20), the audio decoder (15) of the media playback module (10) can be configured to convert text data (TEXT) into the second voice. Accordingly, it is preferable that the received data transmitted from the external device (400) to the display device (100) consists only of text data (TEXT) without audio data.

[0169] The text receiving module (30) can determine the size of the first audio data (PCM1) based on the synchronization information in the header. The synchronization information in the header may include information regarding the sequence, timestamp, duration, and payload_length. The subtitle generation module (40) can synchronize the first audio data (PCM1) and the text data (TEXT) based on the size of the first audio data (PCM1). The subtitle generation module (40) can synchronize the first audio data (PCM1) and the text data (TEXT) based on the size of the first region containing voice in the first audio data (PCM1).

[0170] The text receiving module (30) can determine the size of the second audio data (PCM2) corresponding to the text data (TEXT) based on the synchronization information of the header. The subtitle generation module (40) can synchronize the second audio data (PCM2) and the text data (TEXT) based on the size of the second area containing the second voice in the second audio data (PCM2). The subtitle generation module (40) can synchronize the second audio data (PCM2) and the text data (TEXT) based on the size of the second audio data (PCM2). The subtitle generation module (40) can be configured to display the text data (TEXT) synchronized to a specific frame of the screen as a translated subtitle. The media playback module (10) can output the second audio data (PCM2) synchronized to the translated subtitle of the text data (TEXT) through the voice output module (20).

[0171] Meanwhile, a display device that displays subtitles in conjunction with an external device according to the present disclosure may be configured such that a voice detection module (20) generates translated subtitles for audio data of a subsequent data stream in conjunction with an external device (400). With reference to FIGS. 1 to 8, an operation of generating translated subtitles in advance before audio data is decoded will be described.

[0172] In this regard, while decoding voice in the audio decoder (15) of the media playback module (10), the voice detection module (20) can generate translated subtitles for subsequent audio data in conjunction with an external device (140). Meanwhile, pre-translated subtitles for subsequent audio data can be stored in the storage unit (140).

[0173] In this regard, the data stream may include a first data stream and a second data stream following the first data stream. The processor (170) of the display device (100) configured to execute the aforementioned modules may be configured to store pre-translated text data in the storage unit (140).

[0174] The processor (170) can store translated second text data (TEXT2) for the second voice of the second data stream (DS2) from an external device (400) in the storage unit (140). The processor (170) can extract the translated first subtitle of the text data (TEXT) from the storage unit (140). The processor (170) can display the extracted first subtitle on the screen so as to be synchronized with the output of the first voice.

[0175] Meanwhile, a display device that displays translated subtitles in conjunction with an external device according to the present disclosure transmits decoded audio data, and the external device may perform only operations related to translation. In this regard, FIG. 9 shows a detailed configuration diagram of a display device that transmits sequence information and audio data to an external device and receives and displays subtitles from the external device. FIG. 10 shows a detailed configuration diagram of an external device that interacts with the display device displaying subtitles of FIG. 9. Referring to FIG. 9 and FIG. 10, the display device (100) may be configured to transmit decoded audio data to an external device (400).

[0176] With reference to FIGS. 1 to 3, FIG. 9 and FIG. 10, a display device that displays subtitles in conjunction with an external device according to the present disclosure will be described. In this regard, the display device (100) may be configured to include a media playback module (10), a voice detection module (20a), a communication module (173), a text reception module (30), and a subtitle generation module (40). In this regard, the media playback module (10), the voice detection module (20a), the text reception module (30), and the subtitle generation module (40) may be composed of software modules and executed by a processor (170).

[0177] The media playback module (10) may be configured to extract audio data from a data stream and transmit it to a voice detection module (20a). The voice detection module (20a) may be configured to collect first audio data in which voice is detected from the extracted audio data. The communication module (173) may be configured to transmit a first signal containing the first audio data to an external device (400).

[0178] The communication module (173) may be configured to receive a second signal from an external device (400) containing data necessary for voice to be displayed as a subtitle in a specific frame of the screen. The communication module (173) may be implemented as a short-range wireless communication module configured to transmit a first signal to the external device (400) and receive a second signal from the external device (400). The external device (400) may be implemented as a mobile terminal, a tablet terminal, a PC, or another display device. Meanwhile, the external device (400) may be connected to the display device (100) via a wired interface.

[0179] The voice detection module (20a) may be configured to include a source input module (21), a voice activity detector (VAD) (23), and a transmission module (25).

[0180] The source input module (21) may be configured to receive audio data extracted from the audio decoder (14). The source input module (21) may transmit the audio data extracted from the audio decoder (14) to the voice active detector (23). The voice active detector (23) may be configured to distinguish between a voice region where speech occurs and a pause region in the decoded audio data, and to detect the speech that is spoken in the voice region. The transmission module (25) may be configured to transmit the detected speech to an external device (400) through the communication module (173).

[0181] Accordingly, PCM data, which is audio data decoded from the media playback module (10), is transmitted to the voice detection module (20a). The voice detection module (20a) can select and collect only the first audio data in which voice is detected through the voice activation detector (23). Meanwhile, the external device (400) does not require media processing for the audio data, and the external device (400) can directly perform STT (speech to text) processing and translation. Therefore, the external device (400) can perform only AI-based processing related to text conversion and translation.

[0182] The text receiving module (30) may be configured to receive text data from an external device (400) via a communication module (173), in which voice or a second voice translated into text is converted into text. The subtitle generation module (40) may be configured to synchronize text data based on one of the time stamp information of the first audio data, the size of the text data, or the synchronization information included in the header of the first audio data. The subtitle generation module (40) may be configured to synchronize the first audio data and text data based on the time stamp information, the size of the audio / text data, and the synchronization information. The subtitle generation module (40) may be configured to display the synchronized text data as subtitles on a specific frame of the screen.

[0183] The media playback module (10) may be configured to transmit audio data to the voice detection module (20) through configurations corresponding to a plurality of sub-modules. The media playback module (10) may be configured to include a source input module (SRC) (11), a demultiplexer (12), a parser (13), an audio decoder (14), and an audio sink module (15).

[0184] The source input module (11) may be configured to receive a data stream containing video data and audio data. The source input module (11) may be configured to receive a data stream containing video data and audio data from a content server (300) through a broadcast receiver (130). The source input module (11) may be configured to output the data stream containing video data and audio data to a demultiplexer (12). The demultiplexer (12) may be configured to classify video data, audio data, and control information from the data stream.

[0185] The parser (13) may be configured to parse audio data. The parser (13) may parse audio data to extract playback time, audio codec information, and audio metadata. The audio decoder (14) may be configured to decode and extract the parsed audio data. The audio decoder (14) may be configured to decode and extract audio data based on audio codec information and audio metadata. The audio data extracted through the audio decoder (14) may be transmitted to the voice detection module (20a).

[0186] The audio sink module (15) may be configured to output decoded audio data. The audio sink module (15) may be configured to output decoded audio data through an audio output unit (185), such as a speaker.

[0187] Meanwhile, a display device that displays subtitles in conjunction with an external device according to the present disclosure can perform synchronization between audio data and subtitles based on time stamp information. In this regard, FIG. 11 shows a configuration diagram of a display device and an external device that further includes a voice message tracking module in the embodiment of FIG. 9. FIG. 12 shows a detailed configuration diagram of an external device that interacts with the display device displaying subtitles of FIG. 11. Referring to FIG. 11 and FIG. 12, the display device (100) may be configured to transmit decoded audio data to an external device (400) along with sequence information associated with a time stamp.

[0188] With reference to FIGS. 1 to 3, FIGS. 11, and FIGS. 12, a display device that generates and controls subtitles based on time stamp information through a voice message tracking module (50) is described. In this regard, descriptions that overlap with the operation in FIGS. 9 and FIGS. 10 are replaced with the descriptions in FIGS. 9 and FIGS. 10.

[0189] The media playback module (10) can extract decoded audio data and transmit it to the voice detection module (20a). The decoded audio data may consist of pulse code modulation (PCM) data. The voice detection module (20a) may be configured to include a source input module (21), a voice activity detector (VAD) (23), and a transmission module (25). The source input module (21) may be configured to receive audio data extracted from the audio decoder (14). The source input module (21) may transmit the audio data extracted from the audio decoder (14) to the voice activity detector (23). The voice activity detector (23) may be configured to distinguish between a voice region where speech occurs and a pause region in the decoded audio data, and to detect the voice spoken in the voice region. The transmission module (25) may be configured to transmit the detected voice to an external device (400) via the communication module (173).

[0190] Accordingly, PCM data, which is audio data decoded from the media playback module (10), is transmitted to the voice detection module (20a). The voice detection module (20a) can select and collect only the first audio data in which voice is detected through the voice activation detector (23). Only the first audio data in which voice is detected is transmitted to an external device (400) that performs AI offloading. Before transmitting the first audio data, time stamp information for the corresponding section can be recorded sequentially in the voice message tracker (50).

[0191] In the external device (400), media processing for audio data is not required, and the external device (400) can directly perform speech-to-text (STT) processing and translation. Therefore, the external device (400) can perform only AI-based processing related to text conversion and translation. Meanwhile, when text data is transmitted to the display device (100), received sequence information is transmitted along with it. When the text and sequence are received by the display device (100), the time between media playback and subtitles can be synchronized by referring to the time stamp information of the sequence recorded in the voice message tracker (50).

[0192] For the aforementioned operation, the display device (100) of FIG. 6 may be configured to further include a voice message tracker (50). Since the voice message tracker (50) is implemented as a software module, it may be referred to as a voice message tracking module.

[0193] The voice message tracker (50) may be configured to record timestamp information of the first audio data. The timestamp information of the first audio data may include information regarding the sequence, start timestamp, and end timestamp. The text receiving module (30) may be configured to receive synchronized text data and sequence information of the first audio data from an external device (400).

[0194] The first audio data may consist of PCM data converted by pulse code modulation (PCM). The transmission data transmitted to the external device (400) may be combined with transmission sequence information (Tx_seq) associated with time stamp information and the first audio data (PCM1) converted by PCM following the transmission sequence information (Tx_seq).

[0195] Accordingly, the display device (100) can transmit only voice data to an external device (400) that performs an AI off-loading function. Additionally, before transmitting voice data to the external device (400), the display device (100) can sequentially record time stamp information for the corresponding segment of the voice data in the voice message tracker (50). Since communication between the display device (100) and the external device (400) occurs only when voice is detected, communication efficiency can be increased.

[0196] The received data from the external device (400) may be combined with sequence information (seq) and second audio data (PCM2) following the sequence information (seq). In this regard, the second audio data (PCM2) is data in which the voice to be played on the display device (100) is translated into a second voice. The second audio data containing the translated second voice may be second PCM data converted by pulse code modulation. Meanwhile, the received data from the external device (400) may be combined with sequence information (seq) and text data (TEXT) following the sequence information (seq). The text data (TEXT) may correspond to the second audio data containing the translated second voice.

[0197] The subtitle generation module (40) may be configured to synchronize the first audio data (PCM1) with the text data (TEXT) corresponding to the translated second voice based on time stamp information and transmission sequence information (Tx_seq). Meanwhile, a request to play the translated second voice may be made via the remote control device (200) or by default settings. When a request to play the translated second voice is made, the subtitle generation module (40) may synchronize the second audio data (PCM2) with the text data (TEXT) corresponding to the translated second voice based on time stamp information and sequence information (seq). The subtitle generation module (40) may be configured to output the text data (TEXT) synchronized with the first or second audio data to the screen.

[0198] Since the display device (100) does not manage time stamp information through a voice message tracker, the subtitle generation module (40) can set the duration for displaying subtitles by considering the size of the text data. The subtitle generation module (40) can set the duration for displaying subtitles in proportion to the size of the text data of the translated second voice.

[0199] The subtitle generation module (40) can synchronize the first audio data (PCM1) with the text data (TEXT) corresponding to the translated second voice based on the set period for displaying subtitles. When a request is made to play the translated second voice, the subtitle generation module (40) can synchronize the second audio data (PCM2) with the text data (TEXT) corresponding to the translated second voice based on the set period for displaying subtitles. The subtitle generation module (40) can be configured to output the text data (TEXT) synchronized with the first or second audio data to the screen.

[0200] Meanwhile, the subtitle generation module (40) can control the screen to flush the subtitle currently being displayed and to display the newly received subtitle. The subtitle generation module (40) can set a second period (D2) for displaying the second subtitle in proportion to the size of the third audio data (PCM3) in which the third voice is detected, following the first audio data (PCM1). The subtitle generation module (40) can synchronize the second text data for the third audio data (PCM3) and the third voice based on the second period. Meanwhile, the subtitle generation module (40) can control the screen to flush the subtitle displayed on the screen when the period (D1) expires. The subtitle generation module (40) can control the synchronized second text data to be displayed on the screen as the second subtitle for the second period (D2).

[0201] The first audio data (PCM1) and the third audio data (PCM3) may consist of the first PCM data and the third PCM data converted by pulse code modulation (PCM). The transmitted data sent to the external device (400) is the converted PCM data. The received data received from the external device (400) may be the second audio data (PCM2) in which the voice is translated into the second voice. Additionally, the received data received from the external device (400) may be the fourth audio data (PCM4) in which the third voice following the voice is translated into the fourth voice. The second audio data (PCM2) and the fourth audio data (PCM4) containing the translated second voice and the fourth voice may be the second PCM data and the fourth PCM data.

[0202] Meanwhile, the received data from the external device (400) may be implemented as text data (TEXT). The received data from the external device (400) may be first text data (TEXT1) corresponding to second audio data (PCM2) of the translated second voice. Additionally, the received data from the external device (400) may be second text data (TEXT2) corresponding to fourth audio data (PCM4) of the translated fourth voice.

[0203] Meanwhile, the processor (170) of the display device (100) configured to execute the aforementioned modules may selectively execute a switch between the control methods of the aforementioned embodiments based on the operating state and capability of the external device (400). In this regard, FIG. 13 shows the configuration of a display device configured to control the timing at which a data stream or audio data is transmitted to a transmission module.

[0204] Referring to FIGS. 1 through 13, the processor (170) may be configured to include a media playback module (10), a voice detection module (20a, 20), a transmission module (60), and a switching module (171). When power is supplied to an external device (400) or when the battery level is above a threshold, a data stream including video data and audio data may be transmitted to the external device (400) through the transmission module (60).

[0205] If power is not supplied to the external device (400) or the battery level is below a threshold, audio data parsed from the data stream through the parser (14) can be transmitted to the external device (400) through the voice detection module (20). Meanwhile, while transmitting the parsed audio data to the external device (400) through the voice detection module (20), it can be determined whether the translation processing speed at the external device (400) is below a threshold speed.

[0206] If the translation processing speed of the external device (400) is below a threshold speed, the external device (400) needs to receive audio data that is small in capacity or short in length. Accordingly, the switching module (171) transmits audio data decoded to a first sound quality through the voice detection module (20) to the external device (400). Accordingly, the switching module (171) can transmit parsed audio data to the voice detection module (20). The parsed audio data can be transmitted through the source input module (21) to the second audio decoder (22) and voice activity detector (23) of the voice detection module (20) to perform decoding and voice detection.

[0207] Meanwhile, if the translation processing speed in the external device (400) is greater than or equal to the threshold speed, the switching module (171) can transmit audio data decoded from the audio decoder (15) of the media playback module (10) to the external device (400). Accordingly, the switching module (171) can transmit audio data decoded from the audio decoder (15) to the external device (400) through the voice detection module (20a). The decoded audio data can be transmitted to the voice activity detector (23) through the source input module (21) of the voice detection module (20a) to enable voice detection.

[0208] The processor (170) may be configured to receive status information and capability information of an external device (400) and to control the timing of transmitting a data stream to a transmission module (60) or transmitting audio data to a voice output module (20a, 20). In this regard, FIG. 14 is a flowchart of a control method for controlling the transmission of a data stream to an external device or the transmission of parsed audio data according to the state of the external device according to the present disclosure. FIG. 15 is a flowchart of a control method for adaptively controlling the timing of transmitting audio data according to the state of the external device according to the present disclosure.

[0209] Referring to FIGS. 1 through 14, the control method may be performed by a processor (170) of a display device. The processor (170) may determine (S10) whether power is supplied to an external device (400) or whether the battery level of the external device (400) is above a threshold value. If power is supplied to the external device (400) or if the battery level is above the threshold value, the processor (170) may transmit a data stream to a transmission module (60) (S50). The processor (170) may transmit a data stream stored in a queue of the transmission module (60) to the external device (400) (S60).

[0210] When power is supplied to the external device (400) or the battery level is below a threshold, a transition may be made from the data stream-based packet transmission method of FIGS. 10 and FIGS. 11 to the parsed audio data-based packet transmission method of FIGS. 9 and FIGS. 10. When power is supplied to the external device (400) or the battery level is below a threshold, the processor (170) may transmit the audio data parsed from the data stream to the voice detection module (20) (S110). The processor (170) may transmit the decoded audio data to the external device (400) (S310).

[0211] Meanwhile, by determining whether the data stream is a low-capacity MPEG-4 data stream with a high compression ratio, a switch in the packet transmission method may be made. In this regard, the data stream transmitted from the media playback module (10) to the transmission module (60) may be one of an MPEG-4 data stream, a TS data stream, and an MKV data stream. The data stream transmitted from the media playback module (10) to the transmission module (60) may be one of an MPEG-4 data stream of a first quality, a TS data stream of a second quality higher than the first quality, and an MKV data stream. The TS data stream and the MKV data stream are data streams of the first type, and the MPEG-4 data stream may be a data stream of the second type.

[0212] The processor (170) can determine whether the data stream is a first type of data stream of low quality / low capacity. If the data stream is a first type of data stream of low quality / low capacity, the processor (170) can transmit the data stream to the transmission module (60) (S50). The processor (170) can transmit the data stream stored in the queue of the transmission module (60) to an external device (400) (S60). If the data stream is a second type of data stream of high quality / high capacity, the processor (170) can transmit the audio data parsed from the data stream to the voice detection module (20) (S110). The processor (170) can transmit the decoded audio data to the external device (400) (S310).

[0213] The processor (170) can determine (S30) whether the processing speed associated with voice detection, text conversion, and translation by the external device (400) is greater than or equal to a threshold speed. If the processing speed associated with voice detection, text conversion, and translation is greater than or equal to the threshold speed, the processor (170) can transmit the data stream to the transmission module (60) (S50). The processor (170) can transmit the data stream stored in the queue of the transmission module (60) to the external device (400) (S60). If the processing speed associated with voice detection, text conversion, and translation is less than the threshold speed, the processor (170) can transmit the audio data parsed from the data stream to the voice detection module (20) (S110). The processor (170) can transmit the decoded audio data to the external device (400) (S310).

[0214] Referring to FIG. 15, a transition can be made between a packet transmission method based on parsed audio data and a packet transmission method based on decoded audio data. Referring to FIGS. 1 through 15, the processor (170) can determine (S90) whether the external device (400) can detect synchronization information of the header. It can be determined (S90) whether the external device (400) supports synchronization based on NAIS header information such as the time stamp, period, and header length of FIG. 8.

[0215] In this regard, the capability information of the external device (400) refers to the ability to detect header information and to detect and translate the voice payload from the audio data. If the external device (400) does not have the capability to detect the NAIS header, the processor (170) can transmit and receive audio data based on the configuration of FIGS. 9 to 12. Accordingly, the processor (170) can transmit and receive data streams based on sequence information or subtitle display periods according to the size of the text data.

[0216] The status information of the external device (400) may include information regarding the text conversion accuracy and translation processing speed of converting speech into text in the external device (400). In this regard, the audio decoder (15) of the media playback module (10) performs decoding at a quality level that can be output through the audio sink module (16). Meanwhile, the second audio decoder (22) of the speech detection module (20) performs decoding at a quality level that can detect speech from the audio data. The second audio decoder (22) of the speech detection module (20) may be referred to as the second audio decoder. The text conversion accuracy based on the audio data transmitted through the second audio decoder (22) may be lower than the text conversion accuracy based on the audio data transmitted through the audio decoder (15).

[0217] If the external device (400) can detect the synchronization information of the header, the processor (170) can transmit the audio data extracted through the parser (13) to the voice detection module (20) (S110). The processor (170) can decode the audio data extracted through the parser (13) into a first sound quality through the second audio decoder (22) (S210). The processor (170) can transmit the audio data decoded into the first sound quality to the external device (400) (S310).

[0218] The processor (170) can determine (S350) whether the text conversion accuracy of the external device (400) is below a threshold ratio. For example, the text conversion accuracy can be set to 90%, 95%, 97%, 98%, 99%, etc., but is not limited thereto and can be changed depending on the application.

[0219] If the text conversion accuracy is below a threshold ratio, the processor (170) can decode (S220) the audio data into a second sound quality through the audio decoder (15) of the media playback module (10). The processor (170) can transmit (S320) the audio data decoded into the second sound quality to an external device (400) through the voice detection module (20a). The second sound quality can be set to a higher quality than the first sound quality or implemented.

[0220] If the text conversion accuracy is above a threshold ratio, the audio data extracted through the parser (13) can be transmitted to the voice detection module (20) (S110). The processor (170) can control (S210) the audio data extracted through the parser (13) to be decoded into a first sound quality through the voice detection module (20). The processor (170) can transmit the audio data decoded into the first sound quality to an external device (400) (S310).

[0221] Meanwhile, if the external device (400) is not able to detect or does not support the synchronization information of the header, the processor (170) can decode (S220) the audio data decoded into a second sound quality through the audio decoder (15) of the media playback module (10) based on the configuration of FIGS. 9 to 12. The processor (170) can transmit (S320) the audio data decoded into the second sound quality to the external device (400) through the voice detection module (20a). The second sound quality may be set to a higher quality than the first sound quality or implemented.

[0222] The processor (170) receives status information and capability information of the external device (400) and can control the timing of transmitting the data stream to the transmission module (60) or transmitting audio data to the voice detection module (20a, 20). If the processing speed of the external device (400) is greater than or equal to a threshold speed, the processor (170) can transmit the data stream to the transmission module (60) (S50). If the processing speed of the external device (400) is less than the threshold speed, the audio data extracted through the parser (13) can be transmitted to the voice detection module (20) (S110).

[0223] When audio data decoded through the audio decoder (15) is transmitted to an external device (400) through the voice detection module (20a), the time required for the external device (400) to perform translation processing can be reduced. Compared to the method of transmitting audio data to the external device (400) through the voice detection module (20) before transmitting it to the audio decoder (15), the time required for the external device (400) to perform translation processing is reduced.

[0224] Accordingly, the processor (170) can determine (S360) whether the translation processing speed of the external device (400) is below a threshold speed. If the translation processing speed of the external device (400) is below a threshold speed, the processor (170) can transmit the audio data extracted through the parser (13) to the voice detection module (20) (S110). The processor (170) can control (S210) the audio data extracted through the parser (13) to be decoded into a first sound quality through the voice detection module (20). The processor (170) can transmit the audio data decoded into the first sound quality to the external device (400) (S310).

[0225] The translation processing speed of the external device (400) may be defined as the speed processed by the operation associated with the text translation module (424). The translation processing speed of the external device (400) may be defined as the speed processed by the operation of the depayloader module (422) to the payloader module (425). If the translation processing speed of the external device (400) is greater than or equal to a threshold speed, the processor (170) may determine (S350) whether the text conversion accuracy of the external device (400) is less than or equal to a threshold ratio. The processing speed / time associated with voice detection, text conversion, and translation of the external device (400) may be defined as the speed / time processed by the operation of the text generation module (420) and the voice detection module (430). The processing speed / time associated with voice detection, text conversion, and translation of the external device (400) may be defined as the speed / time processed by the operation of the demultiplexer (432) to the text translation module (424). Therefore, the processing time associated with speech detection, text conversion, and translation may take longer than the translation processing time.

[0226] Meanwhile, if the text conversion accuracy of the external device (400) is below a threshold rate and the translation processing speed is above a threshold rate, it is necessary to improve the text conversion accuracy. To improve the text conversion accuracy, audio data decoded into a second sound quality higher than the first sound quality by an audio decoder (15) based on the configuration of FIGS. 9 to 12 can be transmitted to the external device (400).

[0227] Meanwhile, it is necessary to determine whether to transmit only PCM-based audio data to an external device (400) as in FIG. 5, or to transmit to an external device (400) a combination of sequence information and PCM-based audio data as in FIG. 7. If the speech times of multiple speakers overlap in the audio data, the speeches must be separated based on accurate time stamp information to generate translated subtitles. On the other hand, if the speech times do not overlap and the speech of a single speaker is detected in each speech time interval, it is possible to generate translated subtitles for each speech without sequence information containing time stamp information.

[0228] In this regard, FIG. 16 shows a flowchart of a control method for transmitting audio data of a different format to an external device by determining whether the speech times of a speaker overlap. With reference to FIGS. 1 to 16, a control method for transmitting audio data of a different format performed by a processor (170) is described.

[0229] The processor (170) can determine (S201) whether the speech times of multiple speakers included in the audio data extracted through the audio decoder (15) of the media playback module (10) overlap. If the speech times overlap, the processor (170) can transmit sequence information including the start time stamp and end time stamp of each speaker's speech time and audio data to an external device (400) (S301). The processor (170) can receive the sequence information, text data of the translated subtitles, and / or the second audio data from the external device (400) (S401). The processor (170) can display the translated subtitles of the text data synchronized with the speech times according to the sequence information.

[0230] If the speech times of each speaker do not overlap, the processor (170) can sequentially transmit (S301) segments of audio data corresponding to the speech times to an external device (400). The processor (170) can receive (S402) text data of translated subtitles and / or second audio data from the external device (400). The processor (170) can set the respective periods (S410) for the segments of audio data corresponding to speech times that are distinguished and do not overlap in the time domain. The processor (170) can display (S502) each of the translated subtitles of the segments of text data corresponding to the segments of audio data for a respective period set according to the sizes of the segments of text data.

[0231] The foregoing has described a display device that displays subtitles in conjunction with an external device in a content provision system according to the present disclosure. The technical effects of the display device that displays subtitles in conjunction with an external device in a content provision system according to the present disclosure may be summarized as follows, but are not limited thereto.

[0232] According to the present specification, as an external device generates translated subtitles while a display device plays video content, the difficulty of displaying translated subtitles on the screen at the time the video content is played can be resolved.

[0233] According to the present specification, a data stream containing audio is transmitted to an external device, and the external device decodes the audio to generate AI-based translated text, allowing the display device to focus on playing content on the screen.

[0234] According to the present specification, a method can be proposed to synchronize subtitles generated in real time through an external device by transmitting audio data to an external device in advance before decoding media on a webOS-based display device.

[0235] According to the present specification, media currently being played can be transmitted to an external device to generate subtitles in real time using artificial intelligence (AI), and the subtitles can be accurately aligned with the timing of media playback to enhance the user's viewing experience.

[0236] According to the present specification, it is possible to determine whether to transmit a data stream or parsed audio data to an external device by considering the power status of the external device and the processing speed associated with voice detection, text conversion, and translation. Accordingly, translation work can be performed adaptively depending on the state of the external device.

[0237] According to the present specification, it is possible to determine whether to transmit the data stream or the parsed audio data to an external device by considering the type of the data stream. Accordingly, translation work can be performed in collaboration with an external device in a manner optimized for the data stream.

[0238] According to the present specification, the timing of transmitting audio data to be translated to an external device can be adaptively controlled depending on the state of the external device, such as text translation accuracy and translation processing speed.

[0239] The foregoing disclosure may be implemented as computer-readable code on a medium on which a program is recorded. A computer-readable medium includes all types of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable media include a Hard Disk Drive (HDD), a Solid State Disk (SSD), a Silicon Disk Drive (SSD), ROM, RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, etc., and also include implementations in the form of a carrier wave (e.g., transmission over the Internet). Additionally, the computer may include a control unit (180) of a terminal. Accordingly, the above detailed description should not be interpreted restrictively in all respects and should be considered exemplary. The scope of the invention should be determined by a reasonable interpretation of the appended claims, and all modifications within the equivalent scope of the invention are included within the scope of the invention.

Claims

1. In a display device that interacts with an external device, A transmission module configured to transmit a data stream including video data and audio data to an external device via a communication module; A media playback module configured to transmit the above data stream to the above transmission module, parse the audio data from the above data stream, and decode the parsed audio data; A text receiving module configured to receive text data in which voice detected in the above data stream is converted into text and timestamp information from the external device through the communication module; and It includes a subtitle generation module configured to synchronize audio data extracted from the data stream and text data based on the above time stamp information, and A display device in which the above subtitle generation module displays the synchronized text data as the subtitle in a specific frame of the screen.

2. In Paragraph 1, The above media playback module is, A source input module configured to receive the data stream including the video data and the audio data; A demultiplexer configured to classify video data, audio data, and control information in the above data stream; A parser configured to parse the above audio data; A decoder configured to decode the parsed audio data above; and It includes an audio sink module configured to output the above-decoded audio data, and A display device in which the extracted audio data is transmitted to the transmission module through the source input module.

3. In Paragraph 2, The above transmission module is, A second source input module configured to receive the data stream extracted from the source input module; A queue configured to store the above data stream; A display device comprising a TCP sink module configured to transmit a data stream stored in the above queue to the external device via TCP.

4. In Paragraph 2, It further includes a processor configured to control the operation of the external device through the transmission module and the text reception module, and The above processor is, A display device that controls a text generation module configured to generate text data from voice included in the audio data extracted from the above data stream.

5. In Paragraph 4, The above processor is, Control the text generation module to extract and decode the audio data from the data stream transmitted through the transmission module, and Control the text generation module to collect the first audio data in which voice is detected from the above decoded audio data, and Control the text generation module to generate text data in which the above voice is converted into text, and A display device that controls the text generation module to translate the above text data into multiple languages.

6. In Paragraph 5, The text generation module controlled through the above processor is, A demultiplexer configured to classify video data, audio data, and control information in the above data stream; An audio parser configured to parse the above audio data; An audio decoder configured to decode the above parsed audio data; A display device configured to include a voice activity detector (VAD) configured to detect the voice in the decoded audio data.

7. In Paragraph 6, The text generation module controlled through the above processor is, An STT (speech to text) module configured to convert speech detected through the speech activation detector in the above decoded audio data into the above text data; A text translation module configured to generate translated text data by translating the above-mentioned converted text data into multiple languages; and A display device further comprising a TCP transmission module configured to transmit the above-mentioned translated text data to the text receiving module of the display device via WebVTT.

8. In Paragraph 7, The above text receiving module is, A third source input module configured to receive a second data stream containing the translated text data received from the text generation module; A display device comprising a transmission module configured to transmit the second data stream containing the text data to the subtitle generation module in a WebVTT manner.

9. In Paragraph 8, The above subtitle generation module is, Extracting the second timestamp of the second data stream transmitted from the text receiving module using the WebVTT method, Detecting translated text data having the second time stamp corresponding to the first time stamp of the audio data, and Control the subtitles of the translated text data having the second time stamp to be synchronized with the audio data of the first time stamp, and A display device that displays the synchronized translated text data as subtitles at a specific frame of the first time stamp of the screen.

10. In Paragraph 4, The data stream transmitted from the media playback module to the transmission module is one of an MPEG-4 data stream of a first quality, a TS data stream of a second quality higher than the first quality, and an MKV data stream, and The above TS data stream and MKV data stream are data streams of the first type, and the above MPEG-4 data stream is a data stream of the second type, and A display device characterized in that the data stream transmitted from the transmission module to the external device is the first type of data stream or the second type of data stream.

11. In Paragraph 10, A display device characterized in that the second data stream containing the text data transmitted from the external device to the text receiving module is a WebVTT data stream.

12. In Paragraph 10, The above transmission module is, If the above external device is a mobile terminal, the above first type data stream of the above first image quality is converted into the above second type data stream, and A display device that transmits the converted second type data stream to the external device.

13. In Paragraph 5, The above processor is, When power is supplied to the above external device or the battery level is above a threshold, the data stream is transmitted to the above external device through the transmission module, and A display device that transmits audio data parsed from the data stream to the external device through a voice detection module when the external device is not powered or when the battery level is below a threshold.

14. In Paragraph 10, The above processor is, If the above data stream is the first type of data stream, audio data parsed from the data stream is transmitted to the external device through the voice detection module, and A display device that transmits the data stream to the external device through the transmission module if the data stream is the second type of data stream.

15. In Paragraph 10, The above processor is, If the processing speed associated with voice detection, text conversion, and translation by the above external device is greater than or equal to a threshold speed, audio data parsed from the data stream is transmitted to the external device through the voice detection module, and If the above processing speed is less than the threshold speed, the audio data parsed from the data stream is transmitted to the external device through the voice detection module, and A display device that transmits the data stream to the external device through the transmission module when the processing speed is greater than or equal to the threshold speed.

16. In Paragraph 15, The above voice detection module is, A source input module configured to receive audio data extracted from the parser of the media playback module; A second audio decoder configured to decode the extracted audio data above; A voice active detector configured to detect the voice from the decoded audio data; A payloader module that configures a data stream including a header for synchronization and a payload of data associated with the detected voice; and A display device comprising a transmission module configured to transmit a data stream containing the above header and the above payload to the above external device through the above communication module.

17. In Paragraph 16, The above data stream includes a first data stream and a second data stream following the first data stream, and The above processor is, While decoding the first voice of the first data stream, the second data stream is transmitted to the external device, or audio data is parsed from the second data stream and transmitted to the voice detection module. Translated second text data for the second voice of the second data stream from the above external device is stored in a storage unit, and Extract the first translated subtitle of the above text data from the storage unit, and A display device that displays the extracted first subtitle on the screen so as to be synchronized with the output of the first voice.

18. In Paragraph 17, The processor is configured to receive status information and capability information of the external device and to control the timing of transmitting the data stream to the transmission module or transmitting the audio data to the voice detection module. The above processor is, If the processing speed of the above external device is greater than or equal to the threshold speed, the data stream is transmitted to the transmission module, and If the processing speed is less than the threshold speed, the audio data extracted through the parser is decoded through the voice detection module, and the audio data decoded to the first sound quality is transmitted to the external device. If the text conversion accuracy of the above external device is below a threshold ratio, audio data decoded into a second sound quality through the audio decoder of the media playback module is transmitted to the external device through the voice detection module, and A display device in which the second sound quality is of higher quality than the first sound quality.

19. In Paragraph 18, The above processor is, A display device that, if the translation processing speed of the external device is below a threshold speed, decodes audio data extracted through the parser through the voice detection module and transmits the audio data decoded to the first sound quality to the external device.

20. In Paragraph 17, The above processor is, Determining whether the speech times of multiple speakers included in the audio data extracted through the audio decoder of the media playback module overlap, and When the above utterance times overlap, sequence information including the start time stamp and end time stamp of each speaker's utterance time and the audio data are transmitted to the external device, and A display device that displays translated subtitles of the above text data in synchronization with the above speech times according to the above sequence information.