Question and answer method, device and equipment for video stream and storage medium
By employing multi-modal and text-based task routing, dual-track storage, and asynchronous processing architecture, the problem of balancing high precision and low latency in streaming video understanding agent frameworks is solved. This enables efficient and accurate multi-modal information processing and external knowledge acquisition, meeting the needs of high-IQ responses.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING BAIDU NETCOM SCI & TECH CO LTD
- Filing Date
- 2026-03-05
- Publication Date
- 2026-06-19
Smart Images

Figure CN122240878A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence technology, and in particular to the field of multimodal understanding technologies such as large models, machine question answering, image recognition, and audio recognition. Background Technology
[0002] Streaming video, also known as video streaming, employs a technology that transmits, decodes, and plays video simultaneously. It eliminates the need to download the complete video file to a local device; instead, it transmits video data in segments over the network as a data stream. The device can begin playback after receiving and decoding a small amount of video data, with the remaining data continuously and dynamically supplemented during playback. The streaming video understanding intelligent agent framework can perform low-latency, high-efficiency, and continuous intelligent semantic understanding and analysis on real-time video streams, and perform tasks such as real-time target detection, behavior recognition, anomaly warning, and content analysis on dynamically transmitted video streams. Summary of the Invention
[0003] This disclosure provides question-and-answer methods, apparatus, devices, and storage media for video streaming.
[0004] According to one aspect of this disclosure, a question-answering method for video streams is provided, comprising: Retrieve the first video stream within the start and end interval of the query request; Based on the image-text relevance between the query request and the video frame images in the first video stream, determine whether to execute a multi-mode branch answer task or a text branch answer task. In response to determining to perform the multi-modal branch answering task, at least the multi-modal question answering model is used to generate answer content based on the query request and video frame images in the first video stream; In response to determining to execute the text branch answering task, the text question-answering model is used to generate answer content based on the query request.
[0005] According to another aspect of this disclosure, a question-answering apparatus for a video stream is provided, comprising: The acquisition module is used to acquire the first video stream within the start and end interval of the query request; The determination module is used to determine whether to perform a multi-mode branch answering task or a text branch answering task based on the image-text correlation between the query request and the video frame image in the first video stream. The multi-modal question answering module is used to generate answer content based on the query request and video frame images in the first video stream, at least using a multi-modal question answering model, in response to determining to execute the multi-modal branch answering task. The text question answering module is used to generate answer content based on the query request by using a text question answering model in response to determining to execute the text branch answering task.
[0006] According to another aspect of this disclosure, an electronic device is provided, comprising: At least one processor; and The memory is communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform any of the methods described in the present disclosure.
[0007] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause the computer to perform any of the methods according to embodiments of this disclosure.
[0008] According to another aspect of this disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements any of the methods according to embodiments of this disclosure.
[0009] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0010] The accompanying drawings are provided to better understand the embodiments of this disclosure and are not intended to limit the scope of this disclosure. Wherein: Figure 1 This is a flowchart illustrating a question-and-answer method for video streams according to an embodiment of the present disclosure; Figure 2 This is a schematic diagram of the structure of a streaming video understanding system according to an embodiment of the present disclosure; Figure 3 This is a flowchart illustrating a question-and-answer method for video streams according to another embodiment of this disclosure; Figure 4 This is a schematic diagram of the structure of the perception and storage offloading layer of a streaming video understanding system; Figure 5 This is a schematic diagram of the structure of the decision-making agent in a streaming video understanding system; Figure 6 This is a schematic diagram of the response generation layer of a streaming video understanding system; Figure 7 This is a flowchart illustrating a question-and-answer method for video streams according to another embodiment of this disclosure; Figure 8 This is a flowchart illustrating a question-and-answer method for video streams according to another embodiment of this disclosure; Figure 9This is a schematic diagram of the structure of a question-and-answer device for video streaming according to an embodiment of the present disclosure; Figure 10 This is a schematic diagram of a question-and-answer device for video streaming according to another embodiment of the present disclosure; Figure 11 This is a block diagram of an electronic device used to implement the methods of the embodiments of this disclosure. Detailed Implementation
[0011] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.
[0012] An example of a streaming video understanding agent framework is as follows: 1. Streaming KV Cache: Employs a layered video memory and in-memory storage mechanism, combined with dynamic attention retrieval, to solve the video memory overflow problem of long video streams.
[0013] 2. Proactive Expectation Planning: A lightweight decision-making agent is introduced to generate three types of plans—"reactive," "predictive," and "speculatory"—based on historical memory and the current frame, and selects the optimal action through heuristic evaluation.
[0014] 3. Tool-assisted perception: The intelligent agent can call on visual tools such as zooming and target tracking to obtain high-definition details of future frames.
[0015] 4. Asynchronous processing architecture: Decouples the perception flow from the decision flow, and determines whether to respond to the user immediately or continue to observe the next frame through a triggering mechanism.
[0016] The following problems exist in the streaming video understanding agent framework: 1. The challenge of balancing high precision and low latency: In applications such as autonomous driving, augmented reality (AR) glasses assistants, or intelligent monitoring, users require both millisecond-level real-time response and extremely high inference capabilities (such as inference capabilities on the order of 235B parameters). A single model architecture cannot simultaneously meet the demands of rapid processing of continuous video streams and deep inference for complex problems.
[0017] 2. Inefficient response to non-visual tasks: In video calls or companionship scenarios, users often intersperse questions with plain text (such as chat or weather) or knowledge-based questions requiring internet access. If the system forces all input to be processed jointly through a visual encoder and the video stream context, it leads to inefficient response and introduces unnecessary visual interference.
[0018] 3. Resource accumulation caused by static redundant frames: There are a lot of static or similar frames in the video stream. The lack of an effective input filtering mechanism causes the system to continuously occupy video memory and perform calculations on invalid information.
[0019] 4. Closed domain limitation and knowledge illusion: Streaming video models rely only on internal training knowledge and the current screen, and cannot obtain real-time external information (such as news and guides). They are prone to hallucinations or refusing to answer when faced with questions that exceed the content of the screen.
[0020] Figure 1 This is a flowchart illustrating a question-answering method 100 for a video stream according to an embodiment of the present disclosure. In one embodiment, the method may include: S110. Obtain the first video stream within the start and end interval of the query request; S120. Based on the text-image correlation between the query request and the video frame image in the first video stream, determine whether to execute a multi-mode branch answering task or a text branch answering task. S130. In response to determining to execute the multi-modal branch answering task, at least using a multi-modal question answering model to generate answer content based on the query request and video frame images in the first video stream; S140. In response to determining to execute the text branch answering task, generate answer content based on the query request using a text question-answering model.
[0021] In the embodiments disclosed herein, such as Figure 2As shown, the input processing layer (Layer 0) of the streaming video understanding system can acquire query requests and real-time video streams. The streaming video understanding system can process user-input queries in a split-stream manner. Users can input queries via text, voice, etc. If the query request includes voice, it can be converted into text. The time from the start of inputting the query request to its end can be understood as the start and end interval of the query request. Within this start and end interval, the real-time video stream acquired in parallel by the streaming video understanding system can be called the first video stream. For example, the first video stream acquired within the start and end interval of query request A includes 100 video frames. The first video stream acquired within the start and end interval of query request B includes 150 video frames. The image-text relevance between the query request and the first video stream can be calculated based on the text corresponding to the query request and the multiple video frames included in the first video stream. Alternatively, the multiple video frames included in the first video stream can be deduplicated and quality checked first, and then the remaining video frames can be used to calculate the image-text relevance with the text corresponding to the query request. There are various methods for obtaining image-text relevance. For example, Euclidean distance, cosine similarity, Manhattan distance, and dot product similarity can be used to calculate image-text relevance. Alternatively, deep learning models can be used to predict image-text relevance. Furthermore, the overall image-text relevance can be obtained by averaging, maximizing, or weighted summing the image-text relevance between the query request and multiple video frames used in the first video stream.
[0022] In this embodiment, a text-image relevance threshold can be preset to control task routing. If the text-image relevance between the query request and the first video stream is greater than or equal to the threshold, it indicates that the text in the query request has a high relevance to the currently acquired video stream, and a multi-modal branch answering task can be determined. In response to determining to execute the multi-modal branch answering task, a multi-modal question-answering model is used to generate answer content that matches the query request and the video stream based on these texts and images, sounds, etc., in the video stream. If the text-image relevance between the query request and the first video stream is less than the threshold, it indicates that the text corresponding to the query request has a low relevance to the currently acquired video stream, and a text branch answering task can be determined. In response to determining to execute the text branch answering task, a text question-answering model is used to generate answer content that matches the query, mainly using the text corresponding to the query request. The multi-modal question-answering model and the text question-answering model can employ a large language model. The multi-modal question-answering model can comprehensively process multi-modal data such as text, speech, and images. The text question-answering model can primarily process text, but can also have multi-modal data functionality. The model size of the multi-modal question-answering model and the text question-answering model can be different or the same. For example, multimodal question answering models can use medium-sized language models, such as 10B to 100B, or very large-scale language models, such as those larger than 100B. Text-based question answering models can use small-scale language models, such as those smaller than 10B. Of course, text-based question answering models can also use medium-sized or very large-scale language models.
[0023] By comparing the text and image relevance of query requests with video streams, precise task type segmentation can be achieved. A multi-modal question-answering model is used to process content with high text and image relevance, while a text question-answering model is used to process content with low text and image relevance, thereby reducing unnecessary computing power consumption and improving answer accuracy and speed.
[0024] Figure 3 This is a flowchart illustrating a question-answering method 300 for video streams according to another embodiment of the present disclosure. This method can be used to implement one or more features of the above-described question-answering method for video streams. In one embodiment, the method 300 may further include: S310. If the query request includes historical requirements, the query request is encoded to obtain a query vector. S320. Retrieve historical frame vectors in the feature index library that have a similarity to the query vector that meets a threshold, and obtain the relevant frame index; wherein, the feature index library includes the mapping relationship between the index and the historical frame vector; S330. Retrieve the historical frame image corresponding to the relevant frame index in the original image library to obtain the historical frame image related to the query request; wherein, the original image library includes the mapping relationship between the index and the historical frame image.
[0025] In this embodiment of the disclosure, the streaming video understanding system may employ dual-track storage. For example, such as Figure 2 As shown, the streaming video understanding system can include an original image library and a feature retrieval library in the perception and storage offloading layer (Layer 1). Figure 4 As shown, after encoding some or all video frame images from a real-time video stream using a visual encoder and processing them with intelligent chunk-wise pre-fill, a mapping relationship is constructed between the index and the video frame vectors (also known as video frame feature vectors) in the real-time video stream in the feature index library. The video frame vectors to be stored in the feature index library can be called historical frame vectors. A mapping relationship is also constructed between the index and the video frame images in the real-time video stream in the original image library. The video frame images to be stored in the original image library can be called historical frame images. The index of a certain video frame image in the original image library and the index of the encoded video frame vector in the feature index library can be the same or have a conversion relationship. For example, the mapping relationship between video frame image M1 and index ID1 in the real-time video stream is stored in the original image library, and the mapping relationship between the encoded video frame vector F1 of video frame image M1 and index ID1 is stored in the feature index library.
[0026] In this embodiment of the disclosure, some query requests may include historical requirements. For example, "What animals appeared 10 minutes ago?" or "Please help find the red vehicle that passed through this intersection 1 hour ago." If the query request explicitly or implicitly raises a need for historical information analysis, historical frame images associated with the query request can be dynamically retrieved through the original image library and the feature index library. The feature index library, also known as a KV cache, can include short-term and long-term KV caches. The mapping relationships in the short-term KV cache can be offloaded to the long-term KV cache after a period of time. For example, the short-term KV cache can be in the GPU to store short-term historical frame vectors such as one day, one week, or one month, while the long-term KV cache can be in the CPU to store long-term historical frame vectors such as those longer than one month or three months. The original image library, also known as an original image buffer, can be located on the hard disk to support the high-resolution frame requirements of ultra-large-scale language models.
[0027] After receiving a user's query request (Q), the streaming video understanding system can calculate an attention score. If the score exceeds a threshold (Max-a), it performs filtering and dynamic retrieval. First, the query request is encoded into a query vector using a visual encoder, which can be the same as the one used to update the original image library and feature index library. The similarity or relevance of this query vector to historical frame vectors in the feature index library is compared. For example, similarity or relevance can be calculated using Euclidean distance, cosine similarity, Manhattan distance, dot product similarity, etc., or a model can predict the similarity or relevance of different vectors. The short-term key-value (KV) cache can be reloaded, comparing the query vector with each vector in the short-term KV cache. If a similar vector is found, the search of the long-term KV cache can be skipped. If no similar vector is found, the long-term KV cache can be searched further. The indexes corresponding to vectors similar to the query vector are called relevant frame indexes or similar frame indexes. If multiple similar vectors are found, a list of relevant frame indexes can be obtained. The list of relevant frame indices is sent to the model that needs to use the original image. The model can find the original image, i.e. the historical frame image, corresponding to each relevant frame index in the original image library based on a single similar frame index or a list of similar frame indices.
[0028] In this embodiment of the disclosure, the processing of real-time video streams and the processing of query requests can be carried out in parallel. In other words, while processing query requests, real-time video streams can continue to be collected, and the feature index library and the original image library can be updated.
[0029] By conducting dual searches in the original image library and the index feature library, it is beneficial to retrieve relevant historical frame images for query requests that include historical needs, and to obtain the original high-resolution frames on demand. Using these historical frame images to generate answer content can enable the generation of answer content to obtain the most original visual details, meet the needs of large models for high-quality input, and improve the accuracy of the generated answer content.
[0030] In one embodiment, the method may further include: Historical frame images from the second video stream are input into the transformation model to generate a key-value cache matrix; The key-value cache matrix is pooled to obtain the historical frame vector; The mapping relationship between the index and the historical frame vector is stored in the vector index library; The mapping relationship between the index and the historical frame image is stored in the original image library.
[0031] In this embodiment, all video frame images in the real-time video stream (or second video stream) can be processed and then used to update the feature index library and the original image library respectively. Alternatively, keyframe images in the real-time video stream can be processed and then used to update the feature index library and the original image library respectively. Keyframe images may include the remaining video frame images after deduplication and / or de-saturation of the real-time video stream. For ease of distinction, the video frame images to be added to the library can be referred to as historical frame images.
[0032] Historical frame images obtained from the second video stream are input into a transformation model, such as a small-scale language model. This model utilizes attention mechanisms to generate a key-value (KV) cache matrix based on the pixels in the image. Global pooling is then performed on the KV cache matrix to obtain the image feature vector, i.e., the historical frame vector. This process compresses a high-dimensional token sequence into a low-dimensional global feature vector. A mapping relationship between indices and historical frame vectors is constructed in a feature index library. Similarly, a mapping relationship between indices and historical frame images is constructed in the original image library. The index of a historical frame image in the original image library and the index of the historical frame vector encoded from that historical frame image in the feature index library can be the same or have a transformation relationship.
[0033] Dual-track storage facilitates the on-demand retrieval of original high-resolution historical frame images, enabling the generation of responses to capture the most original visual details, meeting the high-quality input requirements of large models, and improving the accuracy of the generated responses.
[0034] In one implementation, the method may further include one or more of the following: Differential detection is performed between different video frame images of the second video stream to remove duplicate video frame images from the video frame images of the second video stream based on the differential detection results, thereby obtaining historical frame images; The video frame images of the second video stream are subjected to quality detection to filter out video frame images that are below the quality requirements, thereby obtaining historical frame images.
[0035] In the embodiments disclosed herein, such as Figure 4 As shown, the real-time video stream acquired by the input processing layer (Layer 0) of the streaming video understanding system can be referred to as the second video stream. An opening statement can be provided using an opening statement toolkit to guide the acquisition of the real-time video stream or to prompt the user for a query. The second video stream is typically acquired continuously; therefore, it may include the first video stream within the start and end periods of the query request within certain timeframes. Video streams acquired before, during, or after the user's query request can all be processed in the same manner as second video streams.
[0036] For example, a gatekeeper can be used to perform differential deduplication on video frames in the second video stream. Differential detection between different video frames in the second video stream can include pixel-level differential detection and / or feature-level differential detection. If the differential detection result between two video frames indicates that the difference between the two frames is small, such as less than a similarity threshold, the two frames can be determined to be duplicates, and one of the frames can be removed, thus filtering out invalid static frame images. Alternatively, a frame can be randomly removed from the two frames, or further quality checks can be performed to remove low-quality video frames from the two frames.
[0037] Quality inspection of the video frames in the second video stream can filter out low-quality frames such as solid color images and blurred images. The remaining valid frames (v_t) can be distributed to different modules of the decision-making agent for processing. See [link to documentation]. Figure 4 and Figure 5 Arrows L1-1 to L1-6.
[0038] Differential detection and quality inspection can be performed individually or both. For example, first, differential detection is performed on 1000 video frames from the video stream, filtering out 400 invalid static frames. Then, quality inspection is performed on the remaining 600 video frames, filtering out 300 low-quality frames, leaving 300 video frames. Alternatively, quality inspection is performed on 1500 video frames from the video stream, filtering out 500 low-quality frames. Then, differential detection is performed on the remaining 1000 video frames, filtering out 800 invalid static frames, leaving 200 video frames. These remaining video frames are then used as historical frame images for storage in the database and stored in a dual-track system.
[0039] Differential detection can filter out invalid static frame images, and quality detection can filter out low-quality images, thereby retaining valid and high-quality images, reducing resource waste caused by invalid frames, and lowering resource consumption.
[0040] In one implementation, in S130, at least a multimodal question-answering model is used to generate answer content based on the query request and video frame images in the first video stream, including: S340. Using a multi-modal question-answering model, generate answer content based on the query request, video frame images in the first video stream, historical frame images related to the query request, visual tool call results, online search results, text memory, and one or more of historical questions and answers.
[0041] In this embodiment of the disclosure, the streaming video understanding system can integrate multiple types of information to generate response content. For example... Figure 2As shown, the response generation layer (Layer 3) of the streaming video understanding system can receive query requests. If the query request has no historical requirements, it can be input into the multi-modal question-answering model along with video frame images from the first video stream to generate the answer. If the query request has historical requirements, it can first retrieve historical frame images related to the query request using a feature index library and the original image, and then input the query request, video frame images from the first video stream, and the retrieved historical frame images into the multi-modal question-answering model to generate the answer. Figure 6 As shown, the response generation layer can first perform route discrimination, see steps S110 to S140. The route discrimination module determines whether to execute a multi-mode branch answering task or a text branch answering task based on the relevance between the image and text. If the image and text relevance is high, it means that the query request is related to the image and text of the video stream, and the multi-mode branch answering task (also known as the visual task) can be executed. The multi-mode question answering model combines various information such as the query request, images in the first video stream, historical frame images, historical questions and answers, etc., to generate the final answer content. If the image and text are unrelated, the text branch answering task (also known as the plain text task) can be executed. The text corresponding to the query request and historical questions and answers are packaged in an LLM context (LLM ContextPacking), and then the final answer content is generated through a text question answering model. Previously generated historical questions and answers can be stored in a historical database for use by the multi-mode question answering model and / or the text question answering model.
[0042] Multimodal question-answering models can comprehensively utilize information from multiple modalities. Among them, historical frame images have high resolution, which can make the generated answer content reflect the original visual details and improve the accuracy of the generated answer content.
[0043] In one embodiment, the method may further include: The visual tool invocation model is used to analyze one or more of the query request, video frame images in the first video stream, and retrieved historical frame images to determine the visual tool to be invoked; wherein, the visual tool is used to process the video frame images in the first video stream and / or historical frame images related to the query request to obtain the visual tool invocation result.
[0044] In one implementation, in S130, at least a multi-modal question-answering model is used to generate answer content based on the query request and video frame images in the first video stream, including: using the multi-modal question-answering model to generate answer content based on the query request, video frame images in the first video stream, and visual tool call results.
[0045] In embodiments of this disclosure, if the query request includes a need to invoke visual tools, the streaming video understanding system can invoke various visual tools from a visual tool library. For example, such as Figure 2As shown, a streaming video understanding system can include a decision-making agent (Layer 2) that can select visual tools through a visual tool invocation model. The visual tool library can include visual tools (or visual operators) such as Optical Character Recognition (OCR), plant and animal recognition, celebrity recognition, artifact recognition, image search, and logo recognition. If the query request has no historical requirements, the visual tool invocation model can analyze the query request and video frame images in the first video stream to determine the visual tool to be invoked. If the query request has historical requirements, the visual tool invocation model can analyze the query request, video frame images in the first video stream, and retrieved historical frame images to determine the visual tool to be invoked. For example, invoking the plant and animal recognition operator can identify plant and animal information from frame images, and invoking the OCR operator can identify text from frame images. The results after processing with the visual tools, such as text recognition results, plant and animal recognition results, and image search results, can be called visual tool invocation results. Figure 5 As shown, the results of the visual tool call can be sent to the planning generation model for later use, or they can be sent to the LLM context packaging module of the response generation layer for later use. See [link to relevant documentation]. Figure 5 and Figure 6 Arrow L2-5.
[0046] By calling various visual tools on demand through the model, visual knowledge can be expanded and redundant calls can be avoided.
[0047] Figure 7 This is a flowchart illustrating a question-answering method 700 for video streams according to another embodiment of the present disclosure. This method can be used to implement one or more features of the above-described question-answering method for video streams. In one embodiment, the method 700 may further include: S710. If it is determined that there is a search intent based on one or more of the query request, the video frame images in the first video stream, and the historical frame images related to the query request, the query request is rewritten to obtain rewritten text. S720. Invoke the search tool to perform an online search based on the rewritten text and obtain the online search results.
[0048] In one implementation, in S130, at least a multi-modal question-answering model is used to generate answer content based on the query request and video frame images in the first video stream, including: using the multi-modal question-answering model to generate answer content based on the query request, video frame images in the first video stream, and online search results.
[0049] In this embodiment of the disclosure, if the query request includes a search requirement, the streaming video understanding system can initiate a network search function. For example, such as Figure 5 As shown, the decision-making agent can include a classification operator, which can determine whether a query request has search intent. The classification operator can analyze the input content (Input): query request Q, separately to determine if it has search intent, or it can perform a comprehensive analysis of the query request and valid frames (Q+v_t). Valid frames (v_t) can include video frame images retained after deduplication and low-quality removal operations on the first video stream. If the query request has no historical demand, the classification operator can analyze the query request and the video frame images in the first video stream to determine if it has search intent. If the query request has historical demand, the classification operator can analyze the query request, the video frame images in the first video stream, and the retrieved historical frame images to determine if it has search intent. For example, if the query request includes "Please explain the maintenance methods for plant P1 in the video," it indicates a demand for searching plant maintenance methods. Similarly, if the query request includes "The latest travel guides for the attractions in the video," it indicates a demand for searching for travel guides for the attractions.
[0050] If there is search intent, the rewriting operator can be used to rewrite the original text corresponding to the query request, resulting in rewritten text. If there is no search intent, the search can be skipped. Rewritten text is usually more suitable for searching than the original text and can be rewritten using a rewriting template or model. After rewriting, a second determination can be made based on the rewritten text to determine if there is search intent. If the second determination shows no search intent, the search can be skipped. If the second determination shows search intent, a text search tool can be used to perform an online search using the rewritten text to obtain online search results. Figure 5 As shown, online search results can be sent to the decision generation model (planning generation model) for further processing, or they can be distributed to the LLM context packaging module of the multi-modal or text branch of the response generation layer for later use. See [link to relevant documentation]. Figure 5 and Figure 6 Arrows L2-1 and L2-2.
[0051] like Figure 5 As shown, the decision agent of the streaming video understanding system can also include an asynchronous semantic memory loop for generating captions / text descriptions, incrementally updating m_t (Markov process), and using the semantic m_t (textual) for planning.
[0052] Figure 8This is a flowchart illustrating a question-answering method 900 for video streams according to another embodiment of this disclosure. This method can be used to implement one or more features of the above-described question-answering method for video streams. In one embodiment, the method 800 may further include: S810. Using a decision generation model, based on one or more of the visual tool call results, online search results, text memory, query request, and image-text relevance, determine whether the current information can answer the query request. S820: If the current information can answer the query request, perform the step of determining a multi-mode branch answer task or a text branch answer task. This step can be performed before S120.
[0053] In this embodiment of the disclosure, the decision generation model (or planning generation module) of the streaming video understanding system can determine whether the current information is sufficient to answer the query request. If sufficient, it then executes step S120 to determine whether to execute the multi-mode branch answering task or the text branch answering task. If insufficient, it can continue to wait. Figure 5 As shown, the planning generation module integrates the query request, online search results, tool call results, historical frame images, and text memory (or semantic memory) to determine if it is sufficient to answer the query request. If so, a response plan is prepared (PLAN A: Respond); otherwise, a wait plan is continued (PLAN B: Wait). Heuristic evaluation is performed on the prepared response plan and the wait plan, further determining whether to respond or wait based on preset heuristic principles. The heuristic evaluation result determines whether to trigger a response (Yes (Respond)) or trigger a wait (No (Wait)). If a response is triggered, see [link to documentation]. Figure 5 and Figure 6 Arrows L2-3 and L2-4 can be used by the routing discrimination model of the response generation layer to determine task routing based on the relevance of the image and text, and generate the final answer content according to the task routing result. The scale of the decision generation model can be smaller than the question answering model used subsequently.
[0054] Decision generation models and question-answering models can balance real-time response and deep reasoning, respectively. For example, an asymmetric decoupling architecture that combines high-frequency decision-making by a small model with deep reasoning by a large model can achieve both rapid response and high-IQ answers.
[0055] In one implementation, in S130, at least a multi-modal question-answering model is used to generate answer content based on the query request and video frame images in the first video stream. The method further includes: using the multi-modal question-answering model to generate answer content based on one or more of the query request, video frame images in the first video stream, retrieved historical frame images, visual tool call results, online search results, text memory, and historical questions and answers.
[0056] In this embodiment of the disclosure, the multi-modal question-answering model of the streaming video understanding system can generate answer content by integrating various types of information. For example, if the query request has no historical requirements, the multi-modal question-answering model can generate answer content based on the query request, video frame images in the first video stream, visual tool call results, online search results, text memory, and historical questions and answers. If the query request has historical requirements, the multi-modal question-answering model can generate answer content based on the query request, video frame images in the first video stream, retrieved historical frame images, visual tool call results, online search results, text memory, and historical questions and answers. Of course, if there is no search requirement, online search results are not required when generating an answer; if there is no visual tool call requirement, visual tool call results are not required when generating an answer.
[0057] By integrating rich information through multimodal question-answering models, the accuracy of generated answers can be improved. Frame images are beneficial for revealing visual details, results from visual tool calls and online search results help broaden the knowledge boundaries of large models, and historical question answers help improve the reasonableness of answers by combining context.
[0058] Some streaming video understanding agent frameworks can be improved in the following aspects: 1. Lack of a "text-image relevance routing" mechanism leads to low efficiency in processing pure text tasks: The architecture of streaming video understanding agent frameworks defaults to treating all user queries as video-related, forcibly calling the visual perception module and key-value retrieval. When users ask questions unrelated to the video (such as greetings or general knowledge questions), it causes unnecessary inference delays. This disclosure introduces a classification operator that can accurately identify "pure text tasks" and follow the graph-free branch, directly calling the LLM for processing.
[0059] 2. Lack of "external knowledge acquisition" capability, limited to a closed visual domain: The tool library of the streaming video understanding agent framework is limited to visual perception tools and supports a limited variety, making it unable to handle problems requiring external information (such as "querying the rating of the movie mentioned earlier"). This disclosure, however, integrates rewriting and text search operators, using internet connectivity for searching, thus breaking down the knowledge boundaries of video models.
[0060] 3. Lack of sophisticated deduplication and differentiation mechanisms at the input end: Although the streaming video understanding agent framework has a key-value (KV) cache eviction mechanism, it lacks preprocessing of the original video stream at the input. This means that even for completely still images, the visual encoder still needs to work continuously and generate tokens, increasing the system's base load. This disclosure effectively filters invalid frames through a gatekeeper, reducing the downstream load.
[0061] 4. The final response capability is limited by the model size, making it difficult to meet high intelligence requirements: The streaming video understanding agent framework uses a small-scale model, such as a 7B model, for the final response generation. Although it is an improvement over the 3B model, the 7B model still lags significantly behind ultra-large-scale language models (such as the 235B model) in terms of reasoning ability, knowledge breadth, and generation quality when dealing with complex and open-ended questions. This makes it difficult for its final response to reach top-level depth, accuracy, and fluency, limiting its performance ceiling in advanced application scenarios. This disclosure can introduce a larger-scale model to overcome this "intelligence bottleneck."
[0062] 5. The singularity of the KV caching strategy cannot support the on-demand retrieval of raw visual information for ultra-large models during inference: The KV cache design of streaming video understanding agent frameworks mainly serves its own backtracking. This disclosure, however, designs a dual-track storage system of raw buffer and streaming KV. The raw buffer is specifically designed to provide ultra-large models with raw, high-resolution frames that can be retrieved on demand, rather than relying on compressed streaming KV caches. This ensures that the most original visual details can be obtained during the final response, meeting the high-quality input requirements of large models.
[0063] This disclosure proposes a solution applicable to smart glasses (AR / VR), avatar robots, intelligent security monitoring systems, and real-time multimodal AI assistants that require continuous processing of video streams and complex multi-round interactions with users.
[0064] This disclosure includes a streaming video understanding method and system based on asymmetric model collaboration, global feature index retrieval, and dynamic visual tool invocation. The solution constructs a four-in-one processing chain, specifically including: 1. Asymmetric architecture: Decoupling the high-frequency decision-making ability of small models (such as 3B language models) from the deep reasoning ability of large models (such as 235B language models); 2. Global Index Retrieval: A lightweight index is built using global pooling features generated by a small model, guiding the large model to trace back to the original high-resolution image; 3. Expanding the knowledge boundaries of large models: dynamic vision tool access and online search functionality; 4. Large-scale model reasoning generation that integrates original images and tool call results: decision generation and multi-model / text branching response.
[0065] In one example, the specific technical solution and algorithm logic are detailed below: (a) Differential gating and dual-track storage of the input video stream, see [link / reference] Figure 4The system performs splitting processing on the input video stream, constructing physically isolated "original image library" and "feature index library," as described below: The Gatekeeper removes invalid frames: it filters invalid static frames based on a similarity threshold and filters low-quality video frames, such as solid color images and blurry images, based on image quality detection.
[0066] Dual-track storage: The original image buffer (i.e., the "original image library") stores the original images to disk, and constructs IDs. t →Image t Mapping.
[0067] KV storage (i.e., "feature index library"): The original image is input into a small model (e.g., Qwen2.5 VL 3B) to generate a KVCache. t And perform global pooling to compress the high-dimensional token sequence into a global feature vector Vec. t , construct ID t →Vec t Mapping.
[0068] (II) Feature Index Construction and ID Retrieval Based on Global Pooling. This embodiment utilizes a KV Cache to generate compressed "fingerprints" for retrieval, and the retrieval results are used to extract the original image. The technical description is as follows: Dynamic retrieval: Input user query and encode it into a query vector Vec Q and compared with all historical frames in the "Feature Index Library" Vec t Perform similarity calculations and select the top-k relevant frames; Output Retrieval IDs: Outputs a list of retrieved keyframe IDs. ID ={id1,id2,…,id k}; These index IDs can be used to retrieve the original images from the "Original Image Library".
[0069] (III) Expanding the knowledge boundaries of large models based on small models. See also Figure 5 This disclosure embodiment acquires external information in real time through visual tools and online search tools, thereby expanding the knowledge boundary of the large model and obtaining more accurate answers. The technical description is as follows: Dynamic visual tool invocation: Utilizing large models such as the 7B model to analyze user queries and input video frames enables on-demand invocation of visual tools, acquiring external visual knowledge while avoiding redundant calls. The visual tool library includes numerous visual operators such as OCR, plant and animal recognition, celebrity recognition, cultural relic recognition, image search, and logo recognition.
[0070] Internet search tool: Determines the search intent based on the current user query and input video frames. If a search is required, it obtains internet knowledge by rewriting the query and using text search tools.
[0071] (iv) Large-scale model inference generation by integrating the original image and the results of tool calls, see [link to documentation]. Figure 6 This disclosure first uses a decision module to determine whether the current information is sufficient to support an answer to the user's query. Then, it combines video frames within the current user query range, relevant historical frames, and tool call results to generate the final answer. The technical description is as follows: Decision generation: The decision module can be a large model, such as the 7B model. It uses input visual tool results, online search results, text memory, user query, and image-text relevance to determine whether all the information that has appeared so far can answer the current user query.
[0072] Generating the final answer: The system determines whether to use a multi-modal or text-based approach based on image relevance. If the current user's query involves historical visual information, it is added to the retrieved list. ID To ensure the accuracy of the generated answers. Based on the List ID You can obtain the original image from the original image library.
[0073] One application scenario example disclosed herein may include intelligent traffic monitoring, the specific process of which is as follows: Streaming Index: Surveillance cameras can capture images 24 / 7. Gatekeeper filters still images. Valid frames are stored on disk (original image), and after being encoded in a small model and globally pooled, they are stored in the index database as vectors (each frame occupies only a tiny number of bytes).
[0074] User query: The traffic police asked, "Tell me the license plate number of that red sports car three minutes ago, and what brand it was?" Retrieval and Original Image Loading: The system vectorizes the "red sports car" in the query and quickly retrieves the most similar frame images, such as frames 1054 and 1055 (ID), from the feature index library. The system then reads the original monitoring screenshots of frames 1054 and 1055 from the disk based on the ID.
[0075] Dynamic visual tool application / online search: The logo recognition operator effectively identifies sports car brands, the OCR operator recognizes license plates, and online search can provide more information about the sports car. Large-scale model inference: The decision-making module generates the currently triggerable answer information, and then inputs the historical relevant frames (frames 1054 and 1055), video frames during the user's query, visual tool call results, and online search results into a large-scale language model, such as the 235B model, to generate the final answer: "The red sports car just now was a Ferrari, and its license plate number is Su A•88." ".
[0076] This disclosure utilizes an asymmetric collaborative architecture, global feature indexing, dynamic tool invocation, and dual-track storage to build core competitiveness for the product across four dimensions: performance, user experience, cost, and scenario adaptation. Compared to general streaming agents and streaming video models, it offers one or more of the following advantages: Breaking through the core contradiction of "high precision - low latency" and expanding the ability to be applied in high-end scenarios: If the scale of a single model is small, it is impossible to balance real-time response and deep reasoning. However, the embodiments disclosed in this publication achieve rapid response and high-IQ answers through an asymmetric decoupling architecture of "small model high-frequency decision-making + large model deep reasoning".
[0077] Task response efficiency is doubled: To address the inefficiency of forcing all tasks to follow the visual processing path, this embodiment of the disclosure achieves precise task type routing through the "image-text relevance routing" mechanism: pure text tasks (such as chat and weather check) directly call the LLM text branch without loading the visual encoder and KV retrieval module.
[0078] Significantly reduced resource consumption: This embodiment of the disclosure optimizes the input end through a gatekeeper, reducing resource waste caused by invalid frames.
[0079] Breaking the limitations of closed domains and significantly reducing knowledge illusion: If relying solely on internal training knowledge and video footage, it is easy to experience illusions when faced with external information queries (such as "the latest travel guides for the attractions in the video" or "the ratings of the mentioned movies"). This disclosure expands the boundaries of knowledge through a "dynamic visual tool library + networked search".
[0080] Improved visual detail fidelity and higher accuracy for complex visual tasks: Relying on compressed KV cache for inference can easily lead to the loss of original visual details, resulting in insufficient accuracy for complex tasks (such as license plate recognition and small target analysis). The dual-track storage design (Raw Buffer + KV Cache) of this disclosure allows large models to retrieve original high-definition frames on demand and combine them with dynamic vision tools for detail enhancement.
[0081] Figure 9 This is a schematic diagram of a question-and-answer device 900 for a video stream according to an embodiment of the present disclosure. In one embodiment, the device 900 may include: The acquisition module 901 is used to acquire the first video stream within the start and end interval of the query request; The determining module 902 is used to determine whether to perform a multi-mode branch answering task or a text branch answering task based on the image-text correlation between the query request and the video frame image in the first video stream. Multimodal question answering module 903 is used to generate answer content based on the query request and video frame images in the first video stream, at least using a multimodal question answering model, in response to determining to execute the multimodal branch answering task; The text question answering module 904 is used to generate answer content based on the query request by using a text question answering model in response to determining to execute the text branch answering task.
[0082] Figure 10 This is a schematic diagram of a question-and-answer device 1000 for video streaming according to another embodiment of this disclosure. The device may include: an acquisition module 1001, a determination module 1002, a multi-modal question-and-answer module 1003, and a text question-and-answer module 1004. The functions and roles of each module can be found in the functions and roles of the modules in the aforementioned question-and-answer device 900 for video streaming.
[0083] In one implementation, the multi-modal question-answering module 1003 is further configured to, in response to determining to execute the multi-modal branch answering task, generate answer content using a multi-modal question-answering model based on one or more of the query request, video frame images in the first video stream, historical frame images related to the query request, visual tool call results, online search results, text memory, and historical questions and answers.
[0084] In one embodiment, the device 1000 may further include: Encoding module 1005 is used to encode the query request into a query vector when the query request includes historical requirements; The first retrieval module 1006 is used to retrieve historical frame vectors in the feature index library that have a similarity to the query vector that meets a threshold, and obtain a relevant frame index; wherein, the feature index library includes a mapping relationship between the index and the historical frame vector; The second retrieval module 1007 is used to retrieve the historical frame image corresponding to the relevant frame index in the original image library to obtain the historical frame image related to the query request; wherein, the original image library includes the mapping relationship between the index and the historical frame image.
[0085] In one implementation, such as Figure 10 As shown, the device 1000 may further include: The key value processing module 1008 is used to input historical frame images from the second video stream into the conversion model to generate a key value cache matrix; Pooling module 1009 is used to perform pooling processing on the key-value cache matrix to obtain historical frame vectors; The first storage module 1010 is used to store the mapping relationship between the index and the historical frame vector into the vector index library; The second storage module 1011 is used to store the mapping relationship between the index and the historical frame image into the original image library.
[0086] In one implementation, such as Figure 10 As shown, the device 1000 may also include one or more of the following: The differential detection module 1012 is used to perform differential detection between different video frame images of the second video stream, so as to remove duplicate video frame images from the video frame images of the second video stream according to the differential detection results, and obtain historical frame images. The quality detection module 1013 is used to perform quality detection on the video frame images of the second video stream to filter out video frame images that are below the quality requirements in the video frame images of the second video stream, so as to obtain historical frame images.
[0087] In one implementation, the multimodal question answering module is further configured to, in response to determining to perform the multimodal branch answering task, generate answer content using a multimodal question answering model based on the query request, video frame images in the first video stream, and historical frame images related to the query request.
[0088] In one implementation, such as Figure 10 As shown, the device 1000 may further include: The visual tool invocation module 1014 is used to analyze one or more of the query request, video frame images in the first video stream, and historical frame images related to the query request using a visual tool invocation model to determine the visual tool to be invoked; wherein, the visual tool is used to process the video frame images in the first video stream and / or the retrieved historical frame images to obtain the visual tool invocation result.
[0089] In one implementation, the multi-modal question-answering module is further configured to, in response to determining to execute the multi-modal branch answering task, generate answer content using a multi-modal question-answering model based on the query request, video frame images in the first video stream, and visual tool call results.
[0090] In one implementation, such as Figure 10 As shown, the device 1000 may further include: The rewriting module 1015 is used to rewrite the query request to obtain rewritten text when it is determined that there is a search intent based on one or more of the query request, the video frame images in the first video stream, and the retrieved historical frame images. The search module 1016 is used to call a search tool to perform an online search based on the rewritten text and obtain online search results.
[0091] In one implementation, the multi-modal question-answering module is further configured to, in response to determining to perform the multi-modal branch answering task, generate answer content using a multi-modal question-answering model based on the query request, video frame images in the first video stream, and online search results.
[0092] In one implementation, such as Figure 10 As shown, the device 1000 may further include: The decision generation module 1017 is used to determine whether the current information can answer the query request based on one or more of the results of visual tool invocation, online search results, text memory, query request, and image-text relevance; if the current information can answer the query request, it performs the step of determining a multi-mode branch answer task or a text branch answer task.
[0093] In one implementation, the multimodal question-answering module is further configured to generate answer content using a multimodal question-answering model based on the query request, video frame images in the first video stream, retrieved historical frame images, visual tool call results, and online search results.
[0094] The specific functions and examples of each module and submodule of the apparatus in this disclosure can be found in the relevant descriptions of the corresponding steps in the above method embodiments, and will not be repeated here.
[0095] The acquisition, storage, and application of any type of information, such as user personal information, involved in the technical solutions disclosed herein comply with relevant laws and regulations and do not violate public order and good morals.
[0096] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.
[0097] Figure 11 A schematic block diagram of an example electronic device 1100 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0098] like Figure 11As shown, device 1100 includes a computing unit 1101, which can perform various appropriate actions and processes according to a computer program stored in read-only memory (ROM) 1102 or a computer program loaded into random access memory (RAM) 1103 from storage unit 1108. The RAM 1103 may also store various programs and data required for the operation of device 1100. The computing unit 1101, ROM 1102, and RAM 1103 are interconnected via bus 1104. Input / output (I / O) interface 1105 is also connected to bus 1104.
[0099] Multiple components in device 1100 are connected to I / O interface 1105, including: input unit 1106, such as keyboard, mouse, etc.; output unit 1107, such as various types of monitors, speakers, etc.; storage unit 1108, such as disk, optical disk, etc.; and communication unit 1109, such as network card, modem, wireless transceiver, etc. Communication unit 1109 allows device 1100 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0100] The computing unit 1101 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 1101 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1101 performs the various methods and processes described above, such as a question-and-answer method for video streaming. For example, in some embodiments, the question-and-answer method for video streaming may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 1108. In some embodiments, part or all of the computer program may be loaded and / or installed on device 1100 via ROM 1102 and / or communication unit 1109. When the computer program is loaded into RAM 1103 and executed by the computing unit 1101, one or more steps of the question-and-answer method for video streaming described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform a question-and-answer method for video streaming by any other suitable means (e.g., by means of firmware).
[0101] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0102] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0103] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0104] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0105] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.
[0106] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.
[0107] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.
[0108] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the principles of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A question-answering method for video streams, comprising: Retrieve the first video stream within the start and end interval of the query request; Based on the image-text relevance between the query request and the video frame images in the first video stream, determine whether to execute a multi-mode branch answer task or a text branch answer task. In response to determining to perform the multi-modal branch answering task, at least the multi-modal question answering model is used to generate answer content based on the query request and video frame images in the first video stream; In response to determining to execute the text branch answering task, the text question-answering model is used to generate answer content based on the query request.
2. The method according to claim 1, wherein generating answer content based on the query request and video frame images in the first video stream using at least a multimodal question-answering model includes: The multimodal question-answering model is used to generate answer content based on the query request, video frame images in the first video stream, historical frame images related to the query request, visual tool call results, online search results, text memory, and one or more of historical questions and answers.
3. The method according to claim 2, further comprising: If the query request includes historical requirements, the query request is encoded to obtain a query vector; Retrieve historical frame vectors in the feature index library that have a similarity to the query vector that meets a threshold to obtain the relevant frame index; wherein, the feature index library includes the mapping relationship between the index and the historical frame vector; Retrieve the historical frame image corresponding to the relevant frame index in the original image library to obtain the historical frame image related to the query request; wherein, the original image library includes the mapping relationship between indexes and historical frame images.
4. The method according to claim 3, further comprising: Historical frame images from the second video stream are input into the transformation model to generate a key-value cache matrix; The key-value cache matrix is pooled to obtain the historical frame vector; The mapping relationship between the index and the historical frame vector is stored in the vector index library; The mapping relationship between the index and the historical frame image is stored in the original image library.
5. The method according to claim 4, further comprising one or more of the following: Differential detection is performed between different video frame images of the second video stream to remove duplicate video frame images from the video frame images of the second video stream based on the differential detection results, thereby obtaining historical frame images; The video frame images of the second video stream are subjected to quality detection to filter out video frame images that are below the quality requirements, thereby obtaining historical frame images.
6. The method according to any one of claims 2 to 5, further comprising: The visual tool invocation model is used to analyze one or more of the query request, video frame images in the first video stream, and retrieved historical frame images to determine the visual tool to be invoked; wherein, the visual tool is used to process the video frame images in the first video stream and / or historical frame images related to the query request to obtain the visual tool invocation result.
7. The method according to any one of claims 2 to 5, further comprising: If a search intent is determined based on one or more of the query request, video frame images in the first video stream, and historical frame images related to the query request, the query request is rewritten to obtain rewritten text. The search tool is invoked to perform an online search based on the rewritten text, and the online search results are obtained.
8. The method according to claim 7, further comprising: The decision generation model uses one or more of the following factors—visual tool call results, online search results, text memory, query request, and image-text relevance—to determine whether the current information can answer the query request. If the current information is sufficient to answer the query request, proceed with the step of determining a multi-mode branch answer task or a text branch answer task.
9. A question-and-answer device for a video stream, comprising: The acquisition module is used to acquire the first video stream within the start and end interval of the query request; The determination module is used to determine whether to perform a multi-mode branch answering task or a text branch answering task based on the image-text correlation between the query request and the video frame image in the first video stream. The multi-modal question answering module is used to generate answer content based on the query request and video frame images in the first video stream, at least using a multi-modal question answering model, in response to determining to execute the multi-modal branch answering task. The text question answering module is used to generate answer content based on the query request by using a text question answering model in response to determining to execute the text branch answering task.
10. An electronic device, comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 8.
11. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1 to 8.
12. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1 to 8.