Structured video documents

The system generates a semantically rich structured document from audio and image data to address the inefficiencies in existing video player interfaces, enabling semantic query understanding and interactive content retrieval.

JP7883592B2Active Publication Date: 2026-07-01GOOGLE LLC

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
GOOGLE LLC
Filing Date
2023-03-02
Publication Date
2026-07-01

AI Technical Summary

Technical Problem

Existing video player interfaces lack the ability to semantically understand user queries and efficiently retrieve specific content, relying on inefficient scrubbing and keyword searches that do not account for the meaning of queries or higher-level content.

Method used

A system that generates a semantically rich structured document from audio and image data, using a large-scale language model to process user queries and provide coherent responses, incorporating author-provided text and speaker diarization to enhance interaction with video content.

Benefits of technology

Enables efficient retrieval of specific video content through semantic interpretation of queries, providing accurate and interactive responses without requiring creator-provided structured text, thus enhancing user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 0007883592000001
    Figure 0007883592000001
  • Figure 0007883592000002
    Figure 0007883592000002
  • Figure 0007883592000003
    Figure 0007883592000003
Patent Text Reader

Abstract

The method (500) includes receiving a content feed (120) including audio data (122) corresponding to a voice utterance and processing the content feed to generate a semantically rich structured document (300). The structured document includes a transcript (310) of the voice utterance (123) and further includes a plurality of words (123), each of the plurality of words (123) being aligned with a corresponding audio segment (222) indicating a time at which the word was recognized in the audio data. During playback of the content feed, the method also includes receiving a query (112) from a user requesting information included in the content feed and processing the query and the structured document with a large-scale language model (180) to generate a response (182) to the query. The response conveys the requested information included in the content feed. The method also includes providing the response to the query in an output from a user device (102) associated with the user.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to structured video documents.

Background Art

[0002] Video is a common environment where users watch entertainment, news, and educational content. However, there are still challenges with video for users who use it as an information medium because of the limited ability to search for and retrieve video content. In the case of information-based tasks, users typically interact with the user interface of a timeline-based video player to scrub the video forward / backward to find specific content of interest. With the function of generating transcripts / captions for the dialogues in the video, users can enter keyword searches to find relevant content in the transcripts / captions, and the function of searching for content in the video has been improved to some extent. However, these user interfaces that use transcripts / captions to search for content do not have the ability to understand the meaning of queries spoken (or typed) for specific content in the video, let alone the ability to execute queries with semantically required information.

Summary of the Invention

[0003] One aspect of this disclosure provides a computer implementation method. When executed on data processing hardware, the computer implementation method causes the data processing hardware to perform an operation that includes receiving a content feed containing audio data corresponding to a speech utterance, and processing the content feed to generate a semantically-rich structured document. The structured document includes a transcript of the speech utterance and contains multiple words, each of which is aligned with a corresponding audio segment of the audio data indicating the time the word was recognized in the audio data. During playback of the content feed, the operation also includes receiving a query from a user requesting information contained in the content feed, and processing the query and the structured document using a large language model to generate a response to the query, where the response conveys the requested information contained in the content feed. The operation also includes providing the response to the query to the output from a user device relating to the user.

[0004] Embodiments of the present disclosure may include one or more of the following optional features. In some embodiments, the operation also includes extracting a segment of a transcript containing the requested information conveyed by the response to a query, wherein the segment of the transcript is bounded by a start word and an end word; identifying the start audio segment of the audio data as the corresponding audio segment of the audio data aligned with the start word that bounds the segment of the transcript; and identifying the end audio segment of the audio data as the corresponding audio segment of the audio data aligned with the end word that bounds the segment of the transcript. In those embodiments, providing the response to a query includes replaying the audio data from the start audio segment to the end audio segment of the audio data from the user's device relating to the user. The content feed may further include image data comprising multiple image frames, and the operation further includes pausing the playback of the multiple image frames of the image data while replaying the audio data from the start audio segment to the end audio segment of the audio data.

[0005] In some examples, the content feed further includes image data containing multiple image frames, and the semantically rich structured document further includes author-provided text recognized within one or more of the multiple image frames. Here, the author-provided text is aligned with corresponding audio segments of audio data so as to indicate the time when the author-provided text was recognized within one or more image frames. In these examples, processing the content feed to generate a semantically rich structured document may further include annotating the transcript of the spoken utterance with the author-provided text by inserting the author-provided text between adjacent pairs of words in the transcript, based on the corresponding audio segments of audio data aligned with the author-provided text recognized within one or more image frames.

[0006] A response to a query may include a text response that conveys the requested information as a coherent and focused response to the query. In some embodiments, the operation may also include performing text-to-speech conversion on the text response to generate a synthesized speech representation of the response to the query, and providing the response to the query to the output from the user device may include outputting the synthesized speech representation of the response to the query so that it is audible from the user device. In these embodiments, the operation may further include pausing playback of the content feed while outputting the synthesized speech representation of the response to the query so that it is audible from the user device. Furthermore, the text response to the query may further include one or more references to source material related to the requested information.

[0007] In some examples, the large-scale language model includes a pre-trained large-scale language model, which performs few-shot learning using structured documents as context for the query to generate a response to the query. The query may include a question in natural language, and the response to the query may include a natural language response to the question.

[0008] In some embodiments, processing a content feed to generate a semantically rich structured document includes: dividing audio data into multiple audio segments; performing speaker diarization on the multiple audio segments to predict diarization results that include the corresponding speaker labels assigned to each audio segment; and indexing the transcript of the speech utterances using the corresponding speaker labels assigned to each audio segment separated from the audio data.

[0009] Another aspect of the present disclosure provides a system including data processing hardware and memory hardware that communicates with the data processing hardware. The memory hardware stores instructions, which, when executed by the data processing hardware, cause the data processing hardware to perform operations including receiving a content feed containing audio data corresponding to spoken utterances, and processing the content feed to generate a semantically rich structured document. The structured document includes a transcript of the spoken utterances and contains multiple words, each of which is aligned with a corresponding audio segment of the audio data indicating the time the word was recognized in the audio data. During playback of the content feed, the operation also includes receiving a query from a user requesting information contained in the content feed, and processing the query and the structured document using a large language model to generate a response to the query, where the response conveys the requested information contained in the content feed. The operation also includes providing the response to the query to the output from a user device relating to the user.

[0010] This embodiment may include one or more of the following optional features. In some embodiments, the operation also includes extracting a segment of a transcript containing the requested information conveyed by the response to a query, wherein the segment of the transcript is bounded by a start word and an end word; identifying the start audio segment of the audio data as the corresponding audio segment of the audio data aligned with the start word that bounds the segment of the transcript; and identifying the end audio segment of the audio data as the corresponding audio segment of the audio data aligned with the end word that bounds the segment of the transcript. In those embodiments, providing a response to a query includes replaying the audio data from the start audio segment to the end audio segment of the audio data from the user device relating to the user. The content feed may further include image data comprising multiple image frames, and the operation further includes pausing the playback of the multiple image frames of the image data while replaying the audio data from the start audio segment to the end audio segment of the audio data.

[0011] In some examples, the content feed further includes image data containing multiple image frames, and the semantically rich structured document further includes author-provided text recognized within one or more of the multiple image frames. Here, the author-provided text is aligned with corresponding audio segments of audio data so as to indicate the time when the author-provided text was recognized within one or more image frames. In these examples, processing the content feed to generate a semantically rich structured document may further include annotating the transcript of the spoken utterance with the author-provided text by inserting the author-provided text between adjacent pairs of words in the transcript, based on the corresponding audio segments of audio data aligned with the author-provided text recognized within one or more image frames.

[0012] A response to a query may include a text response that conveys the requested information as a coherent and focused response to the query. In some embodiments, the operation may also include performing text-to-speech conversion on the text response to generate a synthesized speech representation of the response to the query, and providing the response to the query to the output from the user device may include outputting the synthesized speech representation of the response to the query so that it is audible from the user device. In these embodiments, the operation may further include pausing playback of the content feed while outputting the synthesized speech representation of the response to the query so that it is audible from the user device. Furthermore, the text response to the query may further include one or more references to source material related to the requested information.

[0013] In some examples, the large-scale language model includes a pre-trained large-scale language model, which performs fusion learning using structured documents as context for the query to generate a response to the query. The query may include a question in natural language, and the response to the query may include a natural language response to the question.

[0014] In some embodiments, processing a content feed to generate a semantically rich structured document includes: dividing audio data into multiple audio segments; performing speaker dialization on the multiple audio segments to predict dialization results that include the corresponding speaker labels assigned to each audio segment; and indexing the transcript of the spoken utterances using the corresponding speaker labels assigned to each audio segment separated from the audio data.

[0015] Details of one or more embodiments of this disclosure are shown in the accompanying drawings and the following description. Other embodiments, features, and advantages will become apparent from the description and drawings, as well as from the claims. [Brief explanation of the drawing]

[0016] [Figure 1] This is a schematic diagram of an exemplary system that allows users to interact with videos using structured documentation generated about the video. [Figure 2] This is a schematic diagram of an exemplary document structuring unit for generating a structured document from audio and image data of an audiovisual feed. [Figure 3] This is a schematic diagram of an exemplary structured document, including transcripts, author-provided text, and annotated transcripts. [Figure 4] This is a schematic diagram of an exemplary video interface that displays information from a generated structured document about the video while the video is playing. [Figure 5] This flowchart illustrates the illustrative sequence of actions for interacting with a content feed using structured documents while the content feed is playing. [Figure 6] This is a schematic diagram of an exemplary computing device that can be used to carry out the systems and methods described herein. [Modes for carrying out the invention]

[0017] In various drawings, the same reference numeral indicates the same component.

[0018] Video players used by media playback applications and web browsers allow users to issue commands to control video playback. For example, users can not only play / pause / stop videos, but also scan forward / backward via dedicated buttons / commands or by scrubbing along a video timeline. Recent advances in automatic speech recognition (ASR) have made it possible for users to issue these video playback commands via voice. While the video timeline feature allows users to preview visual content frame by frame, users often need to scrub back and forth multiple times along the timeline to find content of interest. For example, if a user is watching a video explaining how to cook a particular dish, and the presenter / instructor / participant (also called "speaker") lists the ingredients and their respective proportions, and the user doesn't have time to figure out the proportion of one of the ingredients, the user must manually scrub backward through the video to replay the part where the instructor lists the ingredients. Clearly, scrubbing along a timeline is an inefficient and time-consuming process for users trying to find content of interest. Furthermore, the interactive timeline search is limited to the visual content of each video frame and does not retrieve audio content or higher-level content such as plot / scene descriptions or topics.

[0019] Some video player user interfaces leverage the text transcripts and captions of the audio content within a video to support keyword searches entered by the user. In the example above, where a user is watching a cooking instruction video, if the speaker mentions the proportion of cumin, the user may be able to enter (speak or text) the keyword "how much cumin?" and find the proportion of cumin in the audio content transcript. However, if the instructor / presenter only mentions a list of ingredients by name without specifying the proportion of each ingredient needed for the dish, and instead the creator of the cooking instruction video provides / adds a creator-provided chart that visually shows the proportions of each ingredient, the user will not be able to find the information they are looking for through a keyword search because that information is not in the transcript.

[0020] Furthermore, information extracted from audio transcripts and captions in response to keyword searches is often time-consuming to read and difficult to find the necessary content. This is because audio transcripts and captions tend to contain the hesitations and redundancies typical of audio. Transcripts simply present long blocks of text, and captions simply contain sequences of short phrases. Therefore, the lack of structured organization limits users' ability to browse specific topics within a video or to get an overview of any kind of video content.

[0021] Some of the inherent shortcomings of existing technologies for searching for specific content within a video can be addressed by allowing video creators to embed structured text to create a navigable representation of the video, enabling users / viewers to search for content they query. In addition to transcripts and captions, creators can embed structured text documents in their videos that convey key topics, chapter titles, plot summaries, and summaries of different segments of the video. While using creator-provided structured text documents can be somewhat effective in enabling users / viewers to find specific content within a video, the vast majority of creators are unwilling to undertake this due to the resources and costs required to create and embed the necessary structured text documents in their videos. Furthermore, even if creators are willing to create structured text documents for their video content, the content selected in response to user searches will only be as extensive as the structured text documents that creators choose to embed in their videos. In other words, it is simply impossible for creators to anticipate all kinds of content that might be subject to user searches in order to include it in structured text documents. By the same logic, when users enter queries to find content of interest within a video, embedding a creator-provided structured text document into the video will never provide a truly interactive experience for the user / viewer, because they cannot interpret the meaning of the query in the same way as it is integrated by the creator-provided structured text document.

[0022] Embodiments herein relate to automatically generating a semantically rich structured document about a content feed (i.e., video) to enable semantic interpretation of queries asking for information contained in the content feed. Referring to FIG. 1, system 100 includes user 2 who views content feed 120 played on computing / user device 10 via media player application 150. Media player application 150 may be a stand-alone application executed on user device 10 or a web application accessed via a web browser. In the illustrated example, content feed 120 includes a recorded cooking recipe video played on computing device 10 for user 2 to view and interact with. In the examples herein, content feed 120 is shown as an audio-visual (AV) feed (e.g., video) that includes both audio data 122 (e.g., audio content, audio signal, or audio stream) and image data 124 (e.g., image content or video content), but content feed 120 may be an audio-only feed that includes only audio data 122, such as a podcast episode or an audio book, without limitation. For simplicity, content feed 120 may be referred to herein as synonymous with video, AV signal, AV feed, or simply AV data, unless otherwise specified.

[0023] System 100 also includes a remote system 130 that communicates with the computing device 10 via a network 120. The remote system 130 may be a distributed system with scalable / elastic resources (e.g., a cloud computing environment or a storage abstraction). The resources include computing resources 134 (e.g., data processing hardware) and / or storage resources 136 (e.g., memory hardware). In some embodiments, the remote system 130 hosts a media player application 150 (e.g., on a computing resource), coordinates the playback of the content feed 120 on the computing device, generates semantically rich structured documents 300 about the content feed 120, and uses the structured documents 300 to enable user 2 to interact with the content feed 120 via the computing device 10 by issuing queries 112 requesting information contained in the content feed 120 during its playback. For example, the data processing hardware 134 of the remote system 130 can execute instructions stored in the memory hardware 136 of the remote system 130 for running the application 150. Additionally or alternatively, the media player application 150 may run on the computing device 10 relating to user 2. For example, the data processing hardware 12 of the computing device 10 can execute instructions stored in the memory hardware 14 of the computing device 10 for running the application 150. Some examples of the data processing hardware 12 include a central processing unit (CPU), a graphics processing unit (GPU), or a tensor processing unit (TPU).

[0024] The computing device 10 includes, or communicates with, a display 11 capable of displaying a video interface 400 for presenting image data 124, and a speaker 18 for audible output of audio data 122. The audio data 122 may correspond to voice utterances 123 spoken by a performer(s), instructor(s), narrator, participant in a gathering, host, or other individual recorded in the video 120. Some examples of the computing device 10 include computers, laptops, mobile computing devices, smart TVs, monitors, smart devices (e.g., smart speakers, smart displays, smart appliances), and wearable devices. In the illustrated example, the content feed 120 includes a recorded cooking instruction video that is played on the computing device 10 for user 2 to view and interact with. Embodiments herein relate to a media player application 150 that provides an interactive experience to user 2 during playback of a cooking instruction video 120, which enables user 2 to issue natural language queries 112 requesting information contained in the video 120, and thus application 150 retrieves the requested information using a semantically rich structured document 300 generated about the video 120 and provides user 2 with a response 182 containing the requested information.

[0025] User 2 can issue query 112 as an audio query 112 captured in streaming audio by microphone 16 that communicates with computing device 10. And application 150 (or another application) can perform speech recognition to convert audio query 112 into its corresponding text representation. Alternatively, User 2 can also have the ability to input query 112 via input device 20, which can include a physical keyboard or a virtual keyboard presented for display on video interface 400 that communicates with computing device 10. Input device 20 can also include a mouse, stylus, or graphical user interface, whereby the user can input query 112 that requests information about an object displayed in video data 124 (e.g., words in closed captions, words / phrases in creator-provided text, entities shown in a scene of the video) by selecting the object or hovering over the object.

[0026] For example, cooking instruction video 120 plays a segment in which the performer mixes a series of ingredients needed to make a popular Thai seafood curry dish called Haw Mok Talay. If user 2 did not grasp the amount of cumin mentioned earlier while playing video 120, they can ask query 112, "How much cumin?" to confirm the amount of cumin. The user does not need to manually scrub back through video 120 to find the segment in which the amount of cumin is mentioned; application 150 can retrieve the amount of cumin (i.e., the requested information) from a semantically rich structured document 300. For example, if application 150 clearly states the amount of cumin in a voice utterance 123, it can retrieve the amount of cumin from the transcript 310 of the audio data 122. As other types of information, the structured document 300 may also include author-provided text 320 recognized within image data 124. For example, the creator of video 120 may temporarily display creator-provided text 320 in image data 124 indicating that the recipe requires half a teaspoon of cumin. Thus, application 150 can extract the percentage of cumin from the creator-provided text 320 contained in structured document 300, regardless of whether the performer explicitly stated the percentage in the voice utterance 123, and provide a response 182 to the user's query 112 informing them that half a teaspoon of cumin is required.

[0027] The media player application 150 includes a document structuring unit 200, a large-scale language model 180, and an output module 190. The document structuring unit 200 is configured to receive / incorporate and process an audiovisual feed 120 to generate a semantically rich structured document 300. In particular, the document structuring unit 200 can automatically generate a structured document 300 for an imported audiovisual feed 120 without requiring the creator of the audiovisual feed 120 to provide any structured text, or to ask the creator to contribute to the creation of the structured document 300. Thus, the document structuring unit 200 can import any new or existing content feed 120 and generate a structured document 300 on the fly without input from the creator of the feed 120.

[0028] The audiovisual feed 120 captured by the document structuring unit 200 includes audio data 122 and image data 124. The audio data 122 can characterize the speech utterance 123, and the image data 124 may include multiple image frames 125 (125a to n (Figure 2)). As described above, the content feed 120 may include an audio-only feed 120 that includes only the audio data 122. The structured document 300 includes a transcript 310 of the speech utterance 123. Referring to Figure 2, the transcript 310 includes multiple words, and the structured document 300 aligns each word in the transcript 310 with the corresponding audio segment 222 (Figure 2) of the audio data 122. The corresponding audio segment 222 (Figure 2) of the audio data 122 indicates the time when the word was recognized in the audio data 122. That is, the structured document 300 includes a timestamp for each word in the transcript 310.

[0029] In some embodiments, the document structuring unit 200 also processes the image data 124 to determine whether any author-provided text 320 is recognized in the image data 124. In these embodiments, the structured document 300 generated by the document structuring unit 200 also includes any author-provided text 320 recognized in one or more image frames 125a-n (Figure 2) of the image data 124. The document structuring unit 200 may recognize any author-provided text 320 in each image frame 125 using techniques such as object character recognition. The author-provided text 320 used here may include charts of text (any combination of letters, words, or other symbols) superimposed on the scenes projected in the image frames by the video creator to convey necessary content to the user / viewer 2. The author-provided text 320 may also include any text recognized in the actual scenes projected in the image frames. The structured document 300 can align any recognized author-provided text 320 with the corresponding audio segment 222 (Figure 2) of the audio data 122, and indicate the time when the author-provided text 320 was recognized in one or more image frames 125.

[0030] The document structuring unit 200 can further perform other processing techniques (e.g., speaker diarization, summarization, and formatting, without limitation) on the captured audiovisual feed 120, and save the results of these processing techniques in the structured document 300. Speaker diarization answers the question of "who is speaking when" and has various applications. Some examples of these applications include retrieving multimedia information, speaker turn analysis, audio processing, and automatic transcription of conversational audio. The document structuring unit 200 can utilize a text generation model. The text generation model takes in the transcript 310 and / or author-provided text 320 and outputs key topics or corresponding summaries for one or more different segments of the audiovisual feed 120. Formatting can identify different chapters / scenes within the audiovisual feed 120.

[0031] Figure 2 shows an example of a document structuring unit 200, which includes a dialization module 220, an automatic speech recognition (ASR) module 230, an object character recognition (OCR) module 240, and a generator 250. Application 150 runs the ASR module 230 to generate a transcript 310 of speech utterances 123 spoken by one or more speakers (e.g., actors / participants) in a content feed 120 (e.g., an audiovisual signal including audio data 122 and video data 124, or an audio-only signal including only audio data 122).

[0032] The dialization module 220 is configured to receive audio data 122 corresponding to utterances 123 from speakers (or multiple speakers) from a content feed 120 (and optionally image data 124 showing the faces of the speaker(s)), divide the audio data 122 into multiple segments 222(222a-n) (e.g., fixed-length segments or variable-length segments), and generate a dialization result 224 containing corresponding speaker labels 226 assigned to each segment 222 using a probabilistic model (e.g., a probabilistic generative model) based on the audio data 122 (and optionally image data 124). That is, the dialization module 220 includes a series of speaker recognition tasks (e.g., segments 222) for short utterances to determine whether two segments 222 of a given conversation were spoken by the same speaker. Simultaneously, the dialization module 220 can run a face tracking routine to identify which participant is speaking in which segment 222 and further optimize speaker recognition. The dialization module 220 is then configured to repeat the above process for all segments 222 of the conversation. Here, the dialization result 224 assigns time-stamped speaker labels 226(226a~n) to the received audio data 122. The time-stamped speaker labels 226, 226a~n not only identify who is speaking in a given segment 222, but also identify when the speaker changed between adjacent segments 222.

[0033] The ASR module 230 is configured to receive audio data 122 (and optionally image data 124 representing the face of the speaker(s) in the utterance 123) corresponding to the utterance 123. The ASR module 230 transcribes the audio data 122 into a corresponding ASR result 232, where the ASR result 232 refers to a text transcript of the audio data 122 (e.g., transcript 310) or a group of text transcript candidates. In some examples, the ASR module 230 communicates with the dialization module 220 to utilize the dialization result 224 associated with the audio data 122 to improve speech recognition based on the utterance 123. For example, the ASR module 230 may apply different speech recognition models (e.g., language model, prosodic model) to different speakers identified from the dialization result 224. Additionally or alternatively, the ASR module 230 and / or the dialization module 220 (or any other component of application 150) may index the transcript 310 of the audio data 122 using a timestamped speaker label 226 predicted for each segment 222 obtained from the dialization result 224. As shown in Figure 2, the transcript 310 for the content feed 120 may be indexed speaker by speaker to identify what each speaker said, associating multiple parts of the transcript 202 with their respective speakers.

[0034] In some embodiments, the document structuring unit 200 receives captions for speech utterances 123 that were previously generated by another application or provided by the creator of the content feed 120. The captions may be aligned / timestamped with the audio segment 222 of the audio data 122 to indicate when the captions for utterances 123 were spoken. The captions can be used as a transcript 310 without requiring the ASR module 230 to process the audio data 122, or the captions can be used in combination with or to improve the recognition result 232 for the transcript 310 generated by processing the audio data 122. In some examples, if the captions do not contain punctuation, the document structuring unit 200 adds punctuation to the previously generated captions to improve the accuracy of the response 182 generated by the large language model 180.

[0035] The transcript 310 of the utterance 123 to be included in the structured document 300 also includes alignment information 315. The alignment information 315 provides alignment between each word 312 (Figure 3) of the multiple words 312 (312a~n (Figure 3)) in the transcript 310 and the corresponding audio segment 222 of the audio data 122 that indicates the time when the corresponding word was recognized.

[0036] The OCR module 240 is configured to recognize any author-provided text 320 that may be present in one or more image frames 125a-n of the image data 124. The OCR module 240 may include an OCR machine learning model (e.g., a recognition engine) 244 trained to recognize any author-provided text 320 within each image frame 125. The author-provided text 320 used here may include a chart of text (any combination of letters, words, or other symbols) superimposed on the scene projected in the image frame by the video creator to convey the necessary content to the user / viewer 2. The author-provided text 320 may also include any text recognized in the actual scene projected in the image frame 125. In some examples, the OCR module 240 includes an OCR data store 242 that the OCR machine learning model 244 accesses to recognize specific fonts, symbols, or text patterns. The author-provided text 320 recognized in one or more image frames 125a-n may further include corresponding alignment information 322. Here, the alignment information 322 provides an alignment between any recognized author-provided text 320 and the corresponding audio segment 222 (Figure 2) of the audio data 122 that indicates the time when the author-provided text 320 was recognized within one or more image frames 125.

[0037] In some scenarios, the ASR module 230 uses the recognized author-provided text 320 and corresponding alignment information 322 to improve the accuracy of the transcript 310. Referring to the example above where the content feed 120 contains a cooking instruction video, the ASR module 230 may produce a recognition result 232 that misidentifies the name of the Thai dish "Haw Mok Talay" as "Hamook Taley". Similarly, previously generated captions may misidentify the name of the dish. In some cases, part of the author-provided text 320 recognized in the image data 124 may contain the phrase "Haw Mok Talay". In some examples, the correct spelling ("Haw Mok Talay") may be a less confident hypothesis in the list of candidate hypotheses included in the speech recognition result 232. Therefore, if there is a match with the phrase "Haw Mok Talay" present in the recognized author-provided text 320, the confidence of the candidate hypothesis "Haw Mok Talay" in the recognition result 232 increases, and as a result, it is ultimately selected to be included in the transcript 310.

[0038] In some embodiments, the generator 250 receives a transcript 310, author-provided text 320 recognized in one or more image frames 125, and corresponding alignment information 315, 322, and generates a structured document 300 by annotating the transcript 310 of the speech utterance 123 with the author-provided text 320. In these embodiments, the alignment information 315, 322 can indicate the likelihood of which parts of the speech utterance 123 the author-provided text 320 is relevant. For example, the generator 250 can insert the author-provided text 320 between adjacent pairs of words in the transcript 310 based on corresponding audio segments 222 of audio data 120 aligned with the author-provided text 320 recognized in one or more image frames.

[0039] Figure 3 shows an example of a semantically rich structured document 300 generated by the document structuring unit 200 in Figures 1 and 2 for an audiovisual feed 120 corresponding to a cooking instruction video. The transcript 310 relates to a speaker's utterance 123 talking about part of the procedure for making satay marinade for the Thai dish "Haw Mok Talay". The transcript 310 contains multiple words 312 (312a-n), and the corresponding alignment information 315 results in an alignment between each word 312a-n and the corresponding audio segment 222 of the audio data 122 indicating the time when the corresponding word 312 was recognized. The creator-provided text 320 includes "1.5 tsp coriander" and "0.5 tsp cumin," indicating the respective proportions of coriander and cumin needed for satay marinade. In particular, these proportions are not included in the transcript 310 because the speaker does not make any utterances 123 indicating them. The corresponding alignment information 322 provides the alignment between the creator-provided text 320 and the corresponding audio segment 222 (Figure 2) of the audio data 122, which indicates the time when the creator-provided text 320 was recognized within one or more image frames 125.

[0040] In the illustrated example, the structured document 300 also includes an annotated transcript 330. The annotated transcript 330 uses alignment information 315, 322 to insert author-provided text 320 between adjacent pairs of words in the transcript 310 (for example, "anything" and "Coriander") based on the corresponding audio segments 222 of the audio data 120 that are aligned with the author-provided text 320 recognized within one or more image frames. Here, the annotated transcript 330 includes author-provided text 320 inserted at the appropriate locations in the transcript 310.

[0041] Referring again to Figure 1, in some embodiments, during playback of a content feed (e.g., an audiovisual feed) 120, the large-scale language model 180 is configured to receive a semantically rich structured document 300 and a query 112 issued by user 2 as input, and to produce, as output, a response 182 that conveys the requested information contained in the content feed 120 in response to the query 112. In some examples, the query 112 includes a natural language question, and the response 182 to the query 112 includes a natural language response that provides an answer to the question. For example, the response 182 to the query 112 may include a text response that conveys the requested information as a coherent and focused response to the query 112, generated by the large-scale language model 180. In some examples, the large-scale language model 180 may further augment the coherent / focused response 182 to the query 112 by referring to source material to emphasize the reliability of the information contained in the response 182. In other words, the text response 182 to query 112 may include one or more references to source material relating to the requested information (e.g., mention in response 182 a link to an entity that can lead user 2 to further information). In addition to generating text to provide a natural language response / answer 182 to a natural language query 112, the large language model 182 may perform other generation tasks (e.g., generating natural language text that summarizes one or more parts of a structured document 300).

[0042] The large-scale language model 180 may include a pre-trained large-scale language model 180, which is pre-trained on general world perception using one or more generative tasks (i.e., multi-task learning) to learn a highly effective contextual representation. Thus, the large-scale language model 180 may include a multi-task integrated model (MUM). The pre-trained large-scale language model 180 may be based on a transformer model or a conformer model, or other coding / decoding architecture with a multi-head attention mechanism. For example, the pre-trained large-scale language model may include one coding branch for coding structured document 300, another coding branch for coding query 112, and a common decoder that receives both codes and obtains / generates a response answering the query. In particular, transformer / conformer models can be efficiently parallelized for training large-scale language models and have been shown to generalize better and achieve significantly better performance than language models based on autoregressive neural network architectures such as recurrent neural network models. A pre-trained neural network model 180 may contain more than one billion parameters, and may exceed the upper limit of one trillion parameters.

[0043] Embodiments herein relate to a pre-trained large-scale language model 180, which performs fusion learning using a structured document 300 as context for generating a response 182 to a query 180. That is, fusion learning fine-tunes the parameters of the pre-trained large-scale language model 180, allowing it to be applied to a downstream task of retrieving necessary information contained in an audio-video feed in response to a query 112 issued by user 2. The use of fusion learning is particularly useful for tasks with limited available training data, because the language model 180 can be well generalized based on a structured document 300 that provides specific examples labeled to improve the retrieval of relevant information (e.g., information contained in the audio-visual feed 120 currently being viewed by user 2). By providing the query 112 and the structured document 300 as input pairs to the pre-trained large-scale language model 300, the structured document is labeled as somehow relevant to generate a response 182 as an output. The large-scale language model 180 can also perform zero-shot learning tasks. In zero-shot learning tasks, the language model 180 can, by default, possess knowledge about its world when generating responses 182 to queries 112.

[0044] After the large-scale language model 180 generates response 182, the output module 190 is configured to provide response 182 to the output from the user computing device. Output module 190 may include any combination of a playback settings controller 190a, a user interface (UI) generator 190b, and a text-to-speech (TTS) system 190c. If the audiovisual feed 120 continues with an example that includes a cooking instruction video, user 2 may provide the query 112, “What, how were they toasted?” if they realize they missed the details the presenter said in the video about how to toast the coriander seeds used in the recipe. Response 182 may include the answer “They were toasted in a dry saute pan.” In some examples, the output module 190 receives the response 182 and the structured document 300 as input and extracts a segment of the transcript 310 (and / or a segment of the annotated transcript 330 (Figure 3)) containing the requested information conveyed by the response to the query 112. For example, a segment extracted from the transcript 310 might contain "You want to toast them in a dry saute pan," which is bounded by the start word "You" and the end word "pan." Thus, the output module 190 can then identify both the start audio segment 222 (Figure 2) of audio data 122 as the corresponding audio segment, which is aligned with the start word that bounds the segment of the transcript, and the end audio segment 222 of audio data 122, which is aligned with the end word that bounds the segment of the transcript 310.Using the identified start audio segment 222 and end audio segment 222, the output module 190 can instruct the playback setting controller 190a to replay the audio data 122 from the start audio segment to the end audio segment as an audible output from the speaker 18, resulting in the replay of the utterance 123 "You want to toast them in a dry saute pan" and the response 182 to the query 112 being transmitted. In particular, the playback setting controller 190a can pause the playback of multiple image frames of the image data 125 while replaying the required audio data 120. In some examples, the controller 190a pauses the playback of the audio video feed 120 in response to receiving the query 112.

[0045] In some additional examples, the output module 190 instructs the TTS system 190c to perform a TTS conversion on the text response 182 output from the large language model 180 to generate a synthesized speech representation of the response 182 to query 112. In a scenario where the requested information conveyed by the response to query 112 is not present in the transcript 310 and therefore not conveyed at all in the speech utterance 123, the output module 190 can use the TTS system 190c. In such a case, there is no opportunity to replay any part of the audio data 122 to convey the information requested by the response 182. For example, referring to Figure 3, the response 182 to the query 112, "How much cumin?", can only be confirmed by the large language model 180 from the creator-provided text 320, as is also evident from the annotated transcript 330. In this example, the output module 190 can receive the text response 182, "half a teaspoon of cumin seeds," generated by the large language model 180, as a response to the query 112, and instruct the TTS system 190c to perform text-to-speech conversion on the text response 182, generating a synthesized speech representation of the response 182 to the query 112. Thus, the media player application 150 can output the synthesized speech representation so that it can be heard from the speaker 18 of the computing device 10. In particular, the playback settings controller 190a can completely pause playback of the audiovisual feed 120 while the synthesized speech representation conveying the response 182 "half a teaspoon of cumin seeds" is being output so that it can be heard from the speaker 18 of the computing device 10. In some examples, the controller 190a pauses playback of the audiovideo feed 120 in response to receiving the query 112.

[0046] Furthermore, the output module 190 can instruct the UI generator 190b to generate an image of the text response 182, and then display the image of the text response 182 on the video interface 400, which is displayed on the display 11 of the computing device 10 during playback of the audiovisual feed 120. Here, user 2 can easily read the image of the text response 182 presented on the video interface 400 while watching the video 120. The text response 182 may include one or more references to source materials related to the requested information. For example, the image of the text response 182 presented on the video interface 400 may have hyperlinks to references to source materials related to the requested information. User 2 can view additional information or move to another source (e.g., a web page) by simply hovering over (e.g., via a mouse) or simply touching a word of interest in the text response presented on the video interface 400.

[0047] Figure 4 shows an exemplary video interface 400 that the media player application 150 displays on the display 11 of the computing device 10 while the audiovisual feed 120 is playing. In this example, the media player application 150 also displays information from a semantically rich structured document 300 generated about the audiovisual feed 120 on the video interface 400, allowing user 2 to interact with the structured document 300 while the audiovisual feed 120 is playing. For example, it may display transcripts 310 of utterances spoken by two different speakers, and further, it may display corresponding speaker labels 204 indicating which parts of the transcript 310 were spoken by each speaker. The structured document 300 may further provide multimodal interaction, such as providing hyperlinks to specific terms or entities that may be relevant to user 2 listed in the transcript 310. For example, user 2 can find additional information about the term "cryptocurrency" by selecting the term or by hovering the mouse over it. The video interface 400 may then add a definition of cryptocurrency or an excerpt from the Wikipedia page about cryptocurrency.

[0048] The structured document 300 may further provide summaries 410 of the relevant chapters / sections / scenes of the audiovisual feed 120 for presentation in the video interface 400. Here, the summaries 410 can be generated by the large language model 180 in Figure 1 based on information extracted from the transcript 310, author-provided text 320, and / or annotated transcript 330. User 2 can select one of the summaries 410, and the video player can proceed to that part of the video.

[0049] The video interface 400 of the media player application 150 also provides a playback settings control 450 that allows user 2 to control the playback of the audiovisual feed 120. For example, the playback settings control 450 may include buttons for play, forward / reverse scan, pause, and a video timeline that user 2 can manipulate to scrub the video.

[0050] Figure 5 provides a flowchart illustrating an exemplary sequence of operations for a method 500 of interacting with a content feed 120 using a structured document 300 while the content feed 120 is being played back. In operation 502, method 500 includes receiving a content feed 120 containing audio data 122, which corresponds to a speech utterance 123. The content feed 120 may also include an audiovisual feed further containing image data 124, which contains a plurality of image frames 125a to n.

[0051] In operation 504, method 500 includes processing a content feed 120 to generate a semantically rich structured document 300, where the structured document 300 includes a transcript 310 of a speech utterance 123. The transcript 310 may include multiple words 312, each of which is aligned with a corresponding audio segment 222 of the audio data 122 indicating the time when the word 312 was recognized in the audio data 122.

[0052] In operation 506, while the content feed 120 is being played back, method 500 includes receiving a query 112 from user 2 requesting information contained in the content feed. In operation 508, while the content feed 120 is being played back, method 500 includes processing the query 112 and the structured document 300 using the large language model 180 to generate a response 182 to the query 112, where the response 182 conveys the requested information contained in the content feed 120. In operation 510, method includes providing the response 182 to the query 112 to the output from user device 10 relating to user 2.

[0053] A software application (i.e., a software resource) can refer to computer software that causes a computing device to perform a task. In some examples, a software application may be called an “application,” “app,” or “program.” Exemplary applications include, but are not limited to, system diagnostic applications, system administration applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and game applications.

[0054] Non-temporary memory can be a physical device used to temporarily or permanently store programs (e.g., sequences of instructions) or data (e.g., program state information) for use by a computing device. Non-temporary memory may also be volatile and / or non-volatile addressable semiconductor memory. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM) / programmable read-only memory (PROM) / erasable programmable read-only memory (EPROM) / electronically erasable programmable read-only memory (EEPROM) (e.g., typically used for firmware such as boot programs). Examples of volatile memory include, but are not limited to, random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), phase-change memory (PCM), and disk or tape.

[0055] Figure 6 is a schematic diagram of an exemplary computing device 600 that can be used to implement the systems and methods described herein. The computing device 600 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown herein, their connections and relationships, and their functions are for illustrative purposes only and are not intended to limit the embodiments of the invention described and / or claimed herein.

[0056] The computing device 600 comprises a processor 610, memory 620, storage device 630, a high-speed interface / controller 640 connected to memory 620 and high-speed expansion port 650, and a low-speed bus 670 and a low-speed interface / controller 660 connected to storage device 630. Each of the components 610, 620, 630, 640, 650, and 660 is interconnected using various buses and can be mounted on a common motherboard or in other configurations as needed. The processor 610 processes instructions for execution within the computing device 600, including instructions stored in memory 620 or storage device 630, and can display graphical information of a graphical user interface (GUI) on an external input / output device such as a display 680 connected to the high-speed interface 640. In other embodiments, multiple processors and / or multiple buses may be used, along with multiple memories and multiple types of memory as needed. Multiple computing devices 600 may also be connected so that each device performs some of the required operations (e.g., as a server bank, a group of blade servers, or a multiprocessor system).

[0057] Memory 620 stores information non-temporarily within the computing device 600. Memory 620 may be a computer-readable medium, a volatile memory unit, or a non-volatile memory unit. Non-temporarily stored memory 620 may be a physical device used to temporarily or permanently store programs (e.g., sequences of instructions) or data (e.g., program state information) for use by the computing device 600. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM) / programmable read-only memory (PROM) / erasable programmable read-only memory (EPROM) / electronically erasable programmable read-only memory (EEPROM) (e.g., typically used for firmware such as boot programs). Examples of volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase-change memory (PCM), and disk or tape.

[0058] The storage device 630 can provide high-capacity storage to the computing device 600. In some embodiments, the storage device 630 is a computer-readable medium. In various different embodiments, the storage device 630 may be an array of devices including a floppy disk device, a hard disk device, an optical disk device, or a tape device, flash memory or other similar solid-state memory device, or a storage area network or other configuration device. In a further embodiment, a computer program product is tangibly embodied as an information medium. The computer program product includes instructions that perform one or more of the above-described methods at runtime. The information medium is a computer-readable or machine-readable medium such as memory 620, the storage device 630, or memory on the processor 610.

[0059] The high-speed controller 640 manages bandwidth-intensive operations for the computing device 600, while the low-speed controller 660 manages low-bandwidth-intensive operations. This assignment of roles is merely an example. In some embodiments, the high-speed controller 640 is connected to memory 620, a display 680 (e.g., via a graphics processor or accelerator), and a high-speed expansion port 650 that can accept various expansion cards (not shown). In some embodiments, the low-speed controller 660 is connected to a storage device 630 and a low-speed expansion port 690. The low-speed expansion port 690 may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet, etc.) and can be connected to one or more input / output devices such as a keyboard, pointing device, scanner, or network devices such as a switch or router, for example, via a network adapter.

[0060] The computing device 600 can be implemented in many different forms, as shown in the figure. For example, it can be implemented as a standard server 600a, or multiple times in a group of such servers 600a, or as a laptop computer 600b, or as part of a rack server system 600c.

[0061] Various embodiments of the systems and technologies described herein can be realized in digital electronic circuits and / or optical circuits, integrated circuits, specially designed ASICs (application-specific integrated circuits), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementation in one or more computer programs. Such one or more computer programs are executable and / or translatable into machine language on a programmable system including at least one programmable processor (which may be dedicated or general-purpose) connected to receive data and instructions from a storage system, at least one input device, and at least one output device, and to transmit data and instructions to the storage system, at least one input device, and at least one output device.

[0062] These computer programs (also called programs, software, software applications, or code) include machine instructions for a programmable processor and can be implemented in a high-level procedural programming language and / or an object-oriented programming language and / or an assembly language / machine language. As used herein, the terms “machine-readable medium” and “computer-readable medium” refer to any computer program product, non-temporary computer-readable medium, apparatus and / or device (e.g., magnetic disks, optical disks, memory, programmable logic circuits (PLDs)) used to provide machine instructions and / or data to a programmable processor that includes a machine-readable medium that receives machine instructions as machine-readable signals. The term “machine-readable signal” refers to any signal used to send machine instructions and / or data to a programmable processor.

[0063] The processes and logical flows described herein can be executed by one or more programmable processors (also called data processing hardware) that perform functions by executing one or more computer programs, acting on input data, and producing outputs. Processes and logical flows can also be executed by dedicated logic circuits, such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits). Processors suitable for executing computer programs include, by example, both general-purpose and dedicated microprocessors, as well as any one or more processors in any type of digital computer. Generally, a processor receives instructions and data from read-only memory, random-access memory, or both. The basic elements of a computer are a processor for executing instructions, and one or more memory devices for storing instructions and data. Generally, a computer also includes one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or is operably connected to one or more such mass storage devices for receiving data from or transferring data to or both. However, a computer is not required to have such devices. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, such as semiconductor memory devices including EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. Processors and memory may be complemented by or incorporated into dedicated logic circuits.

[0064] To provide user interaction, one or more aspects of this disclosure can be implemented in a computer having a display device for displaying information to the user (e.g., a CRT (cathode ray tube), an LCD (liquid crystal display) monitor, or a touchscreen), and optionally a keyboard and pointing device (e.g., a mouse or trackball) that allows the user to input into the computer. Other types of devices can also be used to provide user interaction; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or haptic feedback), and input from the user can be received in any form, such as acoustic, spoken language, or haptic input. Furthermore, the computer can interact with the user by sending and receiving documents to and from the user's device; for example, it can interact with the user by sending a web page to a web browser on the user's client device in response to a request received from a web browser.

[0065] Several embodiments have been described. Needless to say, it is clear that various modifications can be made without departing from the spirit and scope of this disclosure. Therefore, other embodiments are within the scope of the following claims.

Claims

1. A method (500) implemented in a computer, which, when executed on data processing hardware (134), causes the data processing hardware (134) to perform an operation, wherein the operation is Receiving a content feed (120) containing audio data (122), wherein the audio data (122) corresponds to speech utterances (123), The process of generating a semantically rich structured document (300) by processing the content feed (120), wherein the structured document (300) includes a transcript (310) of the speech utterance (123), the transcript (310) includes a plurality of words (312), and each of the plurality of words (312) is aligned with a corresponding audio segment (222) of the audio data (122) indicating the time in which the word (312) was recognized, The aforementioned operation further occurs during playback of the content feed (120): The system receives queries (112) from users requesting information contained in the content feed (120), The process involves using a large-scale language model (180) to process the query (112) and the structured document (300) to generate a response (182) to the query (112), wherein the response (182) conveys the requested information contained in the content feed (120), and the generation of the response (182). The operation further includes providing the response (182) to the query (112) to the output from the user device (10) relating to the user, The aforementioned query (112) includes a question in natural language, The response (182) to the query (112) includes a natural language response (182) to the question, A method for implementation in a computer (500).

2. The aforementioned operation further, Extracting a segment of the transcript (310) containing the requested information transmitted by the response (182) to the query (112), wherein the segment of the transcript (310) is bounded by a start word (312) and an end word (312), Identifying the starting audio segment of the audio data (122) as the corresponding audio segment (222) of the audio data (122) that is aligned with the starting word (312) that defines the segment of the transcript (310), Identifying the ending audio segment of the audio data (122) as the corresponding audio segment (222) of the audio data (122) that is aligned with the ending word (312) that borders the segment of the transcript (310), Includes, Providing the response (182) to the query (112) includes replaying the audio data (122) from the start audio segment to the end audio segment of the audio data (122) from the user device (10) relating to the user. The method (500) described in claim 1, which is implemented in a computer.

3. The content feed (120) further includes image data (124) which includes multiple image frames (125), The operation further includes pausing playback of the plurality of image frames (125) of the image data (124) while replaying the audio data (122) from the start audio segment to the end audio segment of the audio data (122). The method (500) for implementation in a computer, as described in claim 2.

4. The content feed (120) further includes image data (124) which includes multiple image frames (125), The semantically rich structured document (300) further includes author-provided text recognized in one or more of the multiple image frames (125), wherein the author-provided text is aligned with the corresponding audio segment (222) of the audio data (122) to indicate the time the author-provided text was recognized in the one or more image frames (125). The method (500) described in claim 3, which is implemented in a computer.

5. A computer-implemented method (500) according to claim 4, wherein processing the content feed (120) to generate the semantically rich structured document (300) includes annotating the transcript (310) of the speech utterance (123) with the author-provided text by inserting the author-provided text between pairs of adjacent words (312) in the transcript (310) based on the corresponding audio segments (222) of the audio data (122) which are aligned with the author-provided text recognized in one or more image frames (125).

6. The computer-implemented method (500) according to claim 5, wherein the response (182) to the query (112) includes a text response (182) that conveys the requested information as a response (182) to the query (112).

7. The aforementioned operation further, This includes performing text-to-speech conversion on the text response (182) to generate a synthesized speech representation of the response (182) to the query (112), Providing the response (182) to the query (112) to the output from the user device (10) includes outputting the synthesized speech representation of the response (182) to the query (112) so that it is audible from the user device (10). The method (500) for implementation in a computer, as described in claim 6.

8. The operation further includes pausing playback of the content feed (120) while outputting the synthesized speech representation of the response (182) to the query (112) so that it can be heard from the user device (10), the computer-implemented method (500) of claim 7.

9. A computer-implemented method (500) according to any one of claims 6 to 8, wherein the text response (182) to the query (112) further includes one or more references to source material relating to the requested information.

10. The computer-implemented method (500) according to claim 9, comprising a pre-trained large language model (180) and generating the response (182) to the query (112) by performing fusion learning using the structured document (300) as context for the query (112).

11. Processing the content feed (120) to generate the semantically rich structured document (300) is: The audio data (122) is divided into multiple audio segments (222), Performing speaker diarization on the plurality of audio segments (222) and predicting a diarization result (224) including the corresponding speaker label (226) assigned to each audio segment (222), The transcript (310) of the speech utterance (123) is indexed using the corresponding speaker label (226) assigned to each audio segment separated from the audio data (122), A computer-implemented method (500) according to claim 1, including the following:

12. System (100), Data processing hardware (134) and A memory hardware (136) that communicates with the data processing hardware (134), wherein the memory hardware (136) stores instructions, and when the instructions are executed by the data processing hardware (134), the memory hardware (136) causes the data processing hardware (18) to execute an operation. The operation is provided, Receiving a content feed (120) containing audio data (122), wherein the audio data (122) corresponds to speech utterances (123), The process of generating a semantically rich structured document (300) by processing the content feed (120), wherein the structured document (300) includes a transcript (310) of the speech utterance (123), the transcript (310) includes a plurality of words (312), and each of the plurality of words (312) is aligned with a corresponding audio segment (222) of the audio data (122) indicating the time in which the word (312) was recognized, The aforementioned operation further occurs during playback of the content feed (120): The system receives queries (112) from users requesting information contained in the content feed (120), The process involves using a large-scale language model (180) to process the query (112) and the structured document (300) to generate a response (182) to the query (112), wherein the response (182) conveys the requested information contained in the content feed (120), and the generation of the response (182). The operation further includes providing the response (182) to the query (112) to the output from the user device (10) relating to the user, The aforementioned query (112) includes a question in natural language, The response (182) to the query (112) includes a natural language response (182) to the question, System (100).

13. The aforementioned operation further, Extracting a segment of the transcript (310) containing the requested information transmitted by the response (182) to the query (112), wherein the segment of the transcript (310) is bounded by a start word (312) and an end word (312), Identifying the starting audio segment of the audio data (122) as the corresponding audio segment (222) of the audio data (122) that is aligned with the starting word (312) that defines the segment of the transcript (310), Identifying the ending audio segment of the audio data (122) as the corresponding audio segment (222) of the audio data (122) that is aligned with the ending word (312) that borders the segment of the transcript (310), Includes, Providing the response (182) to the query (112) includes replaying the audio data (122) from the start audio segment to the end audio segment of the audio data (122) from the user device (10) relating to the user. The system (100) according to claim 12.

14. The content feed (120) further includes image data (124) which includes multiple image frames (125), The operation further includes pausing playback of the plurality of image frames (125) of the image data (124) while replaying the audio data (122) from the start audio segment to the end audio segment of the audio data (122). The system (100) according to claim 13.

15. The content feed (120) further includes image data (124) which includes multiple image frames (125), The semantically rich structured document (300) further includes author-provided text recognized in one or more of the multiple image frames (125), wherein the author-provided text is aligned with the corresponding audio segment (222) of the audio data (122) to indicate the time the author-provided text was recognized in the one or more image frames (125). The system (100) according to claim 14.

16. The system (100) according to claim 15, wherein processing the content feed (120) to generate the semantically rich structured document (300) includes annotating the transcript (310) of the speech utterance (123) with the author-provided text by inserting the author-provided text between pairs of adjacent words (312) in the transcript (310) based on the corresponding audio segments (222) of the audio data (122) which are aligned with the author-provided text recognized in one or more image frames (125).

17. The system (100) according to claim 16, wherein the response (182) to the query (112) includes a text response (182) that conveys the requested information as a response (182) to the query (112).

18. The aforementioned operation further, This includes performing text-to-speech conversion on the text response (182) to generate a synthesized speech representation of the response (182) to the query (112), Providing the response (182) to the query (112) to the output from the user device (10) includes outputting the synthesized speech representation of the response (182) to the query (112) so that it is audible from the user device (10). The system (100) according to claim 17.

19. The system (100) according to claim 18, further comprising the operation of pausing playback of the content feed (120) while outputting the synthesized speech representation of the response (182) to the query (112) so that it can be heard from the user device (10).

20. The system (100) according to claim 19, wherein the text response (182) to the query (112) further includes one or more references to source material relating to the requested information.

21. The system (100) according to claim 20, wherein the large-scale language model (180) includes a pre-trained large-scale language model (180) and generates the response (182) to the query (112) by performing fusion learning using the structured document (300) as context for the query (112).

22. Processing the content feed (120) to generate the semantically rich structured document (300) is: The audio data (122) is divided into multiple audio segments (222), Performing speaker diarization on the plurality of audio segments (222) and predicting a diarization result (224) including the corresponding speaker label (226) assigned to each audio segment, The transcript (310) of the speech utterance (123) is indexed using the corresponding speaker label (226) assigned to each audio segment separated from the audio data (122), A system (100) according to any one of claims 12 to 21, including the system described in any one of claims 12 to 21.