Online memory and feature aggregation for long video understanding in multimodal large language models

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The online memory mechanism with internal memories and feature aggregation addresses the limitations of multimodal LLMs in handling long video sequences, enabling efficient processing and real-time inference by reducing GPU memory and computational demands.

WO2026129128A1PCT designated stage Publication Date: 2026-06-25INTEL CORP +2

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: INTEL CORP
Filing Date: 2024-12-17
Publication Date: 2026-06-25

Application Information

Patent Timeline

17 Dec 2024

Application

25 Jun 2026

Publication

WO2026129128A1

IPC: G06N3/0455

AI Tagging

Application Domain

Biological models

Technology Topics

Internal memoryLinguistic model

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Golf assistance device
DE102025143479A1Golfing accessoriesMulti bandInternal memory
Video detection model training method, video anomaly detection method, and electronic device
CN122067051BInternal memoryAnomaly detection
Eight-channel dram
WO2026136046A1Memory adressing/allocation/relocation Digital storageInternal memorySoftware engineering
Timing controller storing log data and display device including the same, and method of operating timing controller
US20260179525A1Static indicating devices Communication interfaceInternal memory
A memory leak detection processing method and device for a storage system
CN117743005BFault responseInternal memoryTerm memory

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Multimodal large language models struggle with long video sequences due to limited context length and high GPU memory and computational demands, particularly in transformer-based neural networks, which incur exponential memory and processing requirements as context length increases.

Method used

Implement an online memory mechanism with internal memories and feature aggregation using online kernels and dilated convolutions to capture long-term contextual information efficiently, reducing GPU memory costs and enabling processing of longer video sequences.

Benefits of technology

The solution allows multimodal LLMs to handle video inputs many times longer than existing models while maintaining similar GPU memory costs, supporting real-time inference on streaming inputs and capturing long-contextual dependencies effectively.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2024139841_25062026_PF_FP_ABST

Patent Text Reader

Abstract

Multimodal large language models can be used to understand information in different modalities, such as video and text. Visual encoders are implemented to produce input tokens to the multimodal large language model based on the video, and an input encoder can produce input tokens to the multimodal large language model based on the text. When the length of the video grows, such systems can demand significantly more memory and computational resources. To address this, a visual encoder is modified to include internal memories and a feature memory to store the features generated by the visual encoder. In addition, the visual encoder includes online kernels having different contextual ranges to learn long-context features. An online feature aggregator is implemented to aggregate windows of features stored in the feature memory. In some cases, the online feature aggregator can take tokens produced by the further input encoder into account.

Need to check novelty before this filing date? Find Prior Art

Description

ONLINE MEMORY AND FEATURE AGGREGATION FOR LONG VIDEO UNDERSTANDING IN MULTIMODAL LARGE LANGUAGE MODELSBackground

[0001] Deep neural networks (DNNs) are a type of machine learning model used extensively for a variety of artificial intelligence applications ranging from computer vision to speech recognition and natural language processing due to their ability to achieve high accuracy. However, the high accuracy comes at the expense of significant computation cost, especially during training or learning. DNNs have extremely high computing demands as there can be a large number of operations as well as a large amount of data to read and write.Brief Description of the Drawings

[0002] Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.

[0003] FIG. 1 depicts a multimodal large language model having input encoders for different modalities and a large language model to process tokens generated by the input encoders, according to some embodiments of the disclosure.

[0004] FIG. 2 depicts a multimodal large language model having input encoders for different modalities and a large language model to process tokens generated by the input encoders, according to some embodiments of the disclosure.

[0005] FIG. 3 depicts a multimodal large language model having input encoders for different modalities and a large language model to process tokens generated by the input encoders, according to some embodiments of the disclosure.

[0006] FIG. 4 depicts a visual encoder having one or more internal memories, a feature memory, a memory initialization process, and a memory update process, according to some embodiments of the disclosure.

[0007] FIG. 5 depicts a feature memory, an online feature aggregator, a further input encoder, and a large language model, according to some embodiments of the disclosure.

[0008] FIG. 6 depicts a flowchart illustrating a method for generating tokens for inputs with long-context, according to some embodiments of the disclosure.

[0009] FIG. 7 is a block diagram of an exemplary computing device, according to some embodiments of the disclosure.Detailed DescriptionOverview

[0010] A large language model (LLM) is an advanced artificial intelligence system trained on massive amounts of text data to understand, generate, and process human language. These models, typically built using DNNs like transformers, learn complex patterns, grammar, context, and semantic relationships from diverse text sources including books, websites, articles, and databases. By processing billions of parameters, LLMs can perform a wide range of language tasks such as text generation, translation, summarization, question-answering, and even complex reasoning. Unlike other natural language processing models, LLMs can understand nuanced context, generate human-like text, and adapt to different writing styles and domains. LLMs are often utilize massive computational resources. LLMs can have sophisticated language artificial intelligence that can engage in detailed, contextually appropriate conversations and assist with various intellectual and creative tasks.

[0011] Multimodal LLMs, a variation of an LLM, can be used to understand information in different modalities, such as text, images, video, audio, sensor data, signals, etc. For simplicity, various examples herein are described where a multimodal LLM processes and understands information in video and text. Examples of multimodal LLMs include Video-ChatGPT, VideoLLaMA, and VideoLLaVA. Video-ChatGPT is a vision-language model that can understand and engage in conversations about video content, analyzing visual details, actions, and context within video clips. VideoLLaMA is an open-source multimodal model designed to process and understand video data, enabling tasks like video question-answering, description, and analysis. VideoLLaVA is a large language and vision assistant model that extends the capabilities of LLMs to video understanding, allowing for more advanced reasoning and conversational interactions about video content.

[0012] Multimodal LLMs build on the strengths of existing LLMs, such as LLaMA and Vicuna, and combine them with input encoders to understand the relationship between images and language. A multimodal LLM may include (separate) input encoders to generate input tokens based on respective inputs having different modalities. Herein, input encoders are also referred to as input encoders that can extract features from the input and generate the input tokens. Visual encoders can be used as an input encoder or a feature extraction network for generating input tokens to the multimodal LLM based on input video. Visual encoders may include vision transformers (ViTs) and / or convolutional neural networks. A language tokenizer can be used as a further input encoder or a further feature extraction network for generating input tokens to the multimodal LLM based on input text. The multimodal LLM further includes an LLM that would process the input tokens generated by the input encoders or feature extraction networks. The LLM, implementing transformer-based processing, may serve as a backbone that can offer complex reasoning across different input types and enable the LLM to generate contextually rich responses that integrate information from the different modalities. The multimodal LLM may further include an output generation layer to produce outputs such as text, classifications, or other task-specific responses, drawing on the understanding formed by the LLM that is processing the input tokens produced by the input encoders / feature extraction networks.

[0013] Multimodal LLMs struggle with long video sequences. Some issues relate to the limited context length and high demand on graphical processing unit (GPU) memory. These limitations are due to the transformer architecture's self-attention mechanism, which involves comparing each token to every other token. This quadratic computational complexity means that as the context length increases, the memory and processing requirements grow exponentially, making it computationally intensive to handle very long-context inputs. When the length of the input video grows, such systems can demand significantly more memory and computational resources. Transformer-based neural networks can be found in the visual encoder, which can demand high GPU memory and compute resources when producing input tokens for long video sequences, and in the backbone LLM, which can demand high GPU memory and computer resources when processing a large number of input tokens produced from the long video sequences.

[0014] For example, LLaMA processes up to 2048 tokens by default. Visual encoders such as a ViT to produce input tokens based on video can be trained to handle 32-256 tokens per image. For long video input, the generated visual tokens by the ViT would be too big to be directly fed into a LLM serving as a backbone of a multimodal LLM. As a result, the backbone LLM either processes tokens produced based on short video sequences only or incur the cost of huge GPU memory when dealing large input tokens produced from a long video sequence input. In one system, longer visual tokens, e.g., up to 1 million tokens, can result in high GPU memory requirement and extensive computation cost due to the quadratic scaling nature of the attention mechanism.

[0015] To address these issues relating to handling long video inputs, a visual encoder, such as a ViT, is modified to include internal memories and a feature memory to store the features generated by the visual encoder. The visual encoder, which acts as an input encoder for processing input video, can include successive machine learning blocks or layers, such as machine learning blocks or layers of a ViT transformer-encoder. Herein, the vision encoder may include a first machine learning block, followed by a second machine learning block, followed by a third machine learning block, …, and followed by a Kth machine learning block. One or more internal memories may be introduced at one or more outputs of the blocks or layers of the visual encoder to store outputs generated by the blocks or layers of the visual encoder. An internal memory can store outputs produced for different frames of the input video by a block in the visual encoder. An internal memory can be provided for one or more blocks in the visual encoder to capture the outputs produced by the one or more blocks. A feature memory can store (final) outputs produced by the visual encoder produced for the different frames of the input video. The online memory mechanism, including the internal memories and the feature memory, allows the visual encoder to support dynamic video input length, meaning that the video encoder does not require fixed input video lengths, while keeping GPU memory costs low. The online memory mechanism having fixed-sized memories can be initialized during an initialization process and subsequently updated frame by frame without losing long-term contextual information. The online memory mechanism also means that GPU memory usage can be flexibly decided ahead of time, and the GPU memory usage would be determined by the memory length set for the fixed-sized memories, and not by video length.

[0016] In addition, the visual encoder includes online kernels having different contextual ranges to learn long-context features. The online kernels can serve as multi-scale temporal adapters for the visual encoder to allow the visual encoder to capture long-term contextual information. Specifically, an online kernel can be applied to process multiple outputs of a block of the visual encoder, where the multiple outputs have a certain contextual distance. The multiple outputs may be associated with time instances or frames, and the temporal distance of the multiple outputs can correspond to a contextual distance. The online kernel may include a dilated convolution kernel, or a kernel that can be applied to temporally dilated inputs. An online kernel may include trainable parameters or weights that can be used to operate on a plurality of inputs having a particular temporal receptive field. The online kernel operating on multiple inputs means that cross-frame interactions learned from training can be applied to produce outputs which capture information over a wide context. In some cases, the contextual distances may vary the temporal receptive field at different machine learning blocks or layers of the backbone network. For example, the contextual distance may increase to widen the temporal receptive field gradually and progressively as the kernel is applied deeper in the visual encoder, or as a block number increases.

[0017] An online feature aggregator is implemented to aggregate windows of features stored in the feature memory. The online feature aggregator addresses the GPU memory and compute usage issues by aggregating features generated by the visual encoder and stored in the feature memory and generating a consistent number of visual tokens that are progressively updated over time. The LLM serving as a backbone would no longer need to process a large number of input visual tokens (where short visual tokens for many video frames are collated into long visual tokens) , thereby reducing GPU memory and compute costs in the backbone LLM. In particular, the online feature aggregator may operate on a sliding window of features which are stored in the feature memory and updates the visual tokens based on one window of features in the feature memory at a time. The resulting visual tokens can still capture long-contextual dependencies and information across a long duration video, while keeping memory costs low by only operating on a window of features at a time and updating the visual tokens progressively based on one window of features at a time. In other words, the online feature aggregator can be memory efficient and can perform effective feature aggregation over a long duration video. In some embodiments, the online feature aggregator can be implemented based on transformer-based neural networks to effectively aggregate features contextually. The online feature aggregator can implement a self-attention mechanism based on query tokens stored in a query memory, and a cross-attention mechanism based on attention weights produced by the self-attention mechanism and a window of features in the feature memory. In some embodiments, key-value caching in the online feature aggregator can be implemented to reduce computational resources in attention mechanisms of the online feature aggregator.

[0018] In some cases, the online feature aggregator can take tokens produced by the further input encoder into account. Because the online feature aggregator is implemented efficiently (the online feature aggregator processes a window of features at a time along with at least a part of the query memory) , fusion of multimodal information, such as fusing features generated by the visual encoder with input tokens generated by one or more other input encoders and / or feature extraction networks, can be performed as the feature aggregator operates on the windows of features stored in the feature memory. The fusion mechanism can be added to the online feature aggregator without incurring significant memory and computational resources. In some embodiments, the online feature aggregator can include a cross-attention mechanism based on attention weights produced by the cross-attention mechanism and the input tokens produced by a further input encoder or a further feature extraction network.

[0019] By implementing some of one or more of these features, the resulting multimodal LLM can handle video inputs many times (e.g., 100 times) longer than models, such as VideoLLaVA and VideoLLaMA (which require fixed video lengths of 8 frames or 16 frames) , while maintaining similar GPU memory costs. The following table shows results from an experiment that compares one implementation versus VideoLLaVA and VideoLLaMA: Table 1. Comparison of one implementation of a disclosed embodiment with VideoLLaVA and VideoLLaMA

[0020] Efficient implementation of a multimodal LLM having one or more of the features described herein can mean that the multimodal LLM can accept streaming inputs (e.g., live video or live sequence of frames) and perform inference almost in real-time.

[0021] While many examples herein refer to a multimodal LLM that processes long video input (e.g., a video sequence of video frames having thousands of frames or more) and text input, it is envisioned that the teachings can also apply to multimodal LLMs that processes long duration input of other modalities or long sequence of frames of other modalities. Examples include a sequence of audio signal data frames, a sequence of sensor signal data frames, a sequence of point cloud frames, a sequence of measurement data frames, a sequence of time-series data frames, etc. It is also envisioned that the teachings can also apply to multimodal LLMs that process several long duration inputs of various modalities. It is also envisioned that the teachings can also apply to multimodal LLMs that process more than two inputs with different modalities.Shortcomings of some methods for dealing with long input video sequences

[0022] FIG. 1 depicts multimodal LLM 100 having input encoders for different modalities and LLM 106 to process tokens generated by the input encoders, according to some embodiments of the disclosure. Multimodal LLM 100 may include a plurality of input encoders, such as an input encoder to process video input 180 and further input encoder to process language input 108. Video input 180 may include a sequence of frames, such as a video sequence of image frames. Language input 108 may include natural language text, e.g., “Can you describe what the video is about, where it was filmed, and what actions happened in the video? ” The input encoder to process video input 180 may produce input tokens, e.g., visual tokens 120, and the further input encoder to process language input 108 may produce input tokens, e.g., language tokens 124. The generated input tokens from the input encoders can be input into LLM 106, so that LLM 106 can extract information from the different modalities.

[0023] The input encoder to process video input 180 may produce input tokens, e.g., visual tokens 120, based on video input 180. The input encoder may include visual encoder 102, which may extract features based on one or more frames in video input 180. Visual encoder 102 may include a transformer-based neural network. Visual encoder 102 may include a convolutional neural network. Visual encoder 102 may include one or more other suitable neural networks or machine learning models to process video input 180.

[0024] The further feature extraction work to process language input 108 may produce input tokens, e.g., language tokens 124, based on language input 108. The further input encoder may include tokenizer 110. Tokenizer 110 may convert language input 108 into input tokens that LLM 106 can process by breaking down language input 108 into meaningful units. Tokenizer 110 can split language input 108 into tokens, where tokens can include words, sub-words, or character-level components. Each unique token can be mapped to a unique integer identifier in a vocabulary, allowing the components to be represented numerically. Tokenizer 110 can convert language input 108 into processable input tokens, or token sequence, having language tokens 124. Tokenizer 110 may facilitate input encoding and / or feature extraction by translating language input 108 into language tokens 124, by transforming language input 108 into token embeddings.

[0025] There are two groups of techniques to deal with long sequence input: optimizing attention computation and compressing token representation. In optimizing attention computation, the computational efficiency of attention mechanisms can be optimized to operate with linear complexity, or dynamic token attention mechanism that utilizes multiple experts can be implemented to reduce the overall complexity of the attention process. For example, visual encoder 102 can include one or more techniques for optimizing the attention computation. In compressing token representation, the length of input tokens, e.g., video sequences, can be condensed. For example, 30 minutes of video sampled at standard rates may result in half a million tokens, which is more than what state-of-the-art LLM architectures using optimized attention algorithms can process. Visual inputs can have redundancy and do not require the entirety of visual information to be processed. Compressing token representation can remove the redundancy and extract compact visual tokens. For example, feature aggregation 104 can be added to the output of visual encoder 102 to make visual tokens 120 more compact or compressed. In some applications, both techniques can be used together.

[0026] Multimodal LLM 100 can implement sparse sampling to make visual tokens 120 more compact. For example, visual encoder 102 can implement temporal and / or spatial sampling of frames of video input 180 and process fewer input data. In another example, feature aggregation 104 can implement temporal or spatial sampling of features generated by visual encoder 102 and disregard some of the features. One multimodal LLM implementation may sample 8 frames from a video. While sparse sampling enhances efficiency, it often results in suboptimal performance due to the sparsity of information and the potential for missing important information.

[0027] In some cases, feature aggregation 104 may implement average pooling to combine features generated by visual encoder 102. Average pooling can result in many details being lost. Implementing importance sampling in feature aggregation 104 would still risk certain details being lost. In some cases, feature aggregation 104 can implement the Q-Former technique, which can aggregate features by performing cross-attention between features extracted from different frames. Unfortunately, computational complexity can increase significantly with the increase of video length, which means that feature aggregation 104 implementing the Q-Former technique is not suitable for processing extremely long video inputs.

[0028] FIG. 2 depicts multimodal LLM 200 having input encoders for different modalities and LLM 106 to process tokens generated by the input encoders, according to some embodiments of the disclosure. Visual encoder 102 may process sliding windows of video frames (e.g., a window of 8 video frames at a time, sliding by a certain sliding stride amount) to extract respective features from the sliding windows of video frames. Sliding window of frames 222 may be processed by visual encoder 102 to generate a set of features to be stored in feature memory 202. A further sliding window of frames 224 may be processed by visual encoder 102 to generate a further set of features to be stored in feature memory 202. Feature memory 202 is included at the output of visual encoder 102 to maintain a history of the respective features generated by visual encoder 102. Feature memory 202 can update a fixed length set of features based on the newly generated features as a form of feature aggregation. Feature memory 202 can learn a compact representation for visual tokens 120 based on historical features generated by visual encoder 102. Even though feature memory 202 can compress visual tokens 120, visual encoder 102 remains memory and computationally intensive and struggle to handle longer input sequences because resource requirements would still increase when the length of the window of video frames increases.

[0029] Moreover, multimodal LLM 100 of FIG. 1 and multimodal LLM 200 of FIG. 2 do not implement fusion of modalities before the input tokens (visual tokens 120 and language tokens 124) are finally input into LLM 106.Exemplary multimodal LLM for processing long duration inputs and capturing long-contextual information without incurring significant penalties in memory and computation costs

[0030] In view of the issues mentioned above, the architecture illustrated for multimodal LLM 200 of FIG. 2 can be augmented and modified. An online memory mechanism having internal memories are inserted at the output of blocks within the visual encoder. Also, online kernels performing temporally dilated convolutions are added as multi-scale temporal adapters to enable the visual encoder to better capture long-term contextual information. An online feature aggregator is added to better perform feature aggregation, compress the visual tokens before the visual tokens are fed into the backbone LLM, and effectively extract long-contextual information from the feature memory. The online feature aggregator can efficiently and effectively implement fusion with input tokens produced by one or more other input encoders, such as the language tokens, to better capture cross-modality relationships, such as the visual-language relationship.

[0031] FIG. 3 depicts multimodal LLM 300 having input encoders for different modalities and LLM 106 to process tokens generated by the input encoders, according to some embodiments of the disclosure. Multimodal LLM 300 includes an input encoder to process video input 180. Video input 180 is an example of a sequence of frames. Frames have respective frame indices indicating timing and / or position within the sequence of frames. The input encoder includes visual encoder 302, feature memory 370, and online feature aggregator 376. The input encoder produces visual tokens 388. Multimodal LLM 300 includes a further input encoder to process language input 108. The further input encoder includes tokenizer 110.

[0032] Visual encoder 302 may include a ViT. A ViT is an adaptation of the transformer-based neural network architecture, which is designed for sequential data processing. A ViT adapts the transformer-based neural network to perform computer vision tasks on sequential video frames. A ViT may include a patch embedding component, which can divide an input image into fixed-size patches (e.g., 16x16 pixels) . Each patch may be flattened and linearly projected to a lower-dimensional space referred to as a patch embedding. The ViT may include a position embedding component, which may add a learnable position embedding to each patch embedding to retain spatial information. A learnable embedding (e.g., class token) can be prepended to the sequence of patch embeddings. This token's final state can be used for classification tasks. The ViT may include a transformer-encoder, which includes multiple transformer-based encoder machine learning blocks or layers. Herein, the blocks or layers are depicted as ViT blocks. As depicted, visual encoder 302 includes ViT block 310, ViT block 320, and so on. ViT block 310 may have a block number that is equal to 1 (B=1) . ViT block 320 may have a block number that is equal to 2 (B=2) . The ViT blocks may be provided as layers and in series. The block number B increases as visual encoder 302 is deeper. A ViT block may include a multi-head self-attention (MSA) block and a feed-forward network (FFN) block. An MSA block allows each patch to attend to other patches and capture relationships between patches. The MSA block can include multiple attention heads operating in parallel, where each attention head can compute query, key and value projections. The FFN block may follow the MSA block, which can include multi-layer perceptron having, e.g., two linear transformations and an activation in between. In some implementations, a ViT block or layer may include layer normalization (LN) , MSA block, a residual connection that adds the input to the MSA output, further LN, FFN block, and a further residual connection that adds the input to the FFN output.

[0033] Visual encoder 302 further includes one or more internal memories, e.g., memory 312, memory 322, and so on. The internal memories may be fixed length memories, or fixed length memory banks, that are allocated to store historical outputs generated by a block in visual encoder 302. The internal memories may be first-in-first-out memories. As an illustration, the internal memories may have a length or size R for storing R historical outputs generated by the block to which an internal memory is added. The length or size R may correspond to a number of outputs produced for a number of frames. The total size of an internal memory may be equal to R x vector length of an output generated by a block. An internal memory may be allocated to each block or a subset of the blocks in visual encoder 302 (e.g., every other block in visual encoder 302) . The length or size R for the internal memories may differ depending on the depth at which the memory is used. Using internal memories to store a fixed number of latest historical outputs allows visual encoder 302 to maintain contextual information across frames but yet process each frame of video input 180 one at a time. A more detailed example of the memories is illustrated in FIG. 4.

[0034] Visual encoder 302 further includes one or more online kernels, e.g., online kernel 314, online kernel 324, and so on. The online kernels may be inserted between the ViT blocks to process one or more parts or elements of a memory. The online kernels may have associated contextual distances and operate on multiple outputs produced by a ViT block for different frames of video input 180 or time instances that was stored in the memory. The online kernels may have associated temporal receptive fields. The online kernels may have associated temporal dilation rates, because the inputs to the online kernel may be temporally dilated. A temporal dilation rate of an online kernel may correspond to the frame distance or spacing between the multiple outputs produced by the ViT block. For example, online kernel 314 may be applied to one or more parts of memory 312. Online kernel 324 may be applied to one or more parts of memory 322. The online kernels can apply temporal dilated convolutions after the ViT blocks, with varying dilation rates or contextual distances to control the temporal scope or temporal receptive field at each block or layer. The dilation rates can be varied using a sparse temporal context approach, where the dilation rates may incrementally increase as the block number increases (or the online kernel is deeper in visual encoder 302) . The online kernels can serve as a multi-scale temporal adapter that captures long-range cross-frame features by combining and / or transforming the multiple outputs produced by a ViT block for different frames of video input 180. A more detailed example of the online kernels is illustrated in FIG. 4.

[0035] An online kernel can include one or more kernels having trainable or learnable parameters or weights. In some cases, an online kernel may include a lightweight neural network having trainable or learnable parameters (e.g., two convolutional neural network layers) . As used herein, a kernel may include one or more operations that can be applied to the kernel’s inputs to produce an output. The operations may be applied according to trainable or learnable parameters. An example of an operation is a filter with one or more filter parameters. Another example of an operation is a convolution operation with one or more convolution matrix parameters.

[0036] As part of the online memory mechanism, feature memory 370 may include a fixed length memory, or a fixed length memory bank, that is allocated to store historical outputs generated by visual encoder 302 (e.g., a last block or layer of visual encoder 302, features extracted by visual encoder 302 for a frame in video input 180) . Feature memory 370 may be a first-in-first-out memory. As an illustration, feature memory 370 may have a length or size F for storing F historical outputs generated by visual encoder 302. The length or size F may correspond to a number of outputs produced by visual encoder 302 for a number of frames. The total size of feature memory 370 may be equal to F x vector length of an output generated by visual encoder 302. Using feature memory 370 to store a fixed number of latest historical outputs of visual encoder 302 allows multimodal LLM 300 to online feature aggregator 376 to aggregate features extracted by visual encoder 302 over a long-context but yet allow online feature aggregator 376 to perform feature aggregation one set of features at a time. A more detailed example of feature memory 370 is illustrated in FIG. 5.

[0037] Visual encoder 302 can process each frame of video input 180, one frame at a time, through successive blocks of visual encoder 302. ViT block 310 may process a frame of a sequence of frames to produce an output. The output generated by ViT block 310 may be stored in memory 312. Online kernel 314 may be applied to one or more parts of memory 312 (which includes one or more historical outputs of ViT block 310 that is stored in memory 312) and generates an output. ViT block 320 may process the output from online kernel 314 and generate an output. The output from ViT block 320 may be stored in memory 322. Online kernel 324 may be applied to one or more parts of memory 322 (which includes one or more historical outputs of ViT block 320 that is stored in memory 322) and generates an output. The processing for the frame may continue similarly for additional blocks / layers. Visual encoder 302 may generate one or more features based on the output of online kernel 324. The one or more features generated by visual encoder 302 may be stored in feature memory 370. Visual encoder 302 may repeat the processing through the successive blocks / layers of visual encoder 302 and online kernels, while updating the internal memories, for each frame of the sequence of frames and produce one or more features for each frame. in other words, visual encoder 302 may repeat the processing for a further frame in the sequence of frames and generate one or more further features for the further frame. The one or more features produced for each frame may be stored in feature memory 370 (for a period of time) .

[0038] Online feature aggregator 376 may process the one or more features generated by visual encoder 302, which are stored in feature memory 370. Online feature aggregator 376 may include query memory 366, and feature aggregator 380. Query memory 366 may include a fixed length memory, or a fixed length memory bank, that is allocated to store historical outputs, or a latest aggregated output generated by feature aggregator 380. In some cases, query memory 366 may be a first-in-first-out memory. As an illustration, feature memory 370 may have a length or size Q for storing Q historical outputs generated by visual encoder 302. Feature aggregator 380 may process and perform feature aggregation on a sliding window of features or a subset of features stored in feature memory 370, one sliding window or one subset at a time. Query memory 366 may be updated progressively or iteratively as feature aggregator 380 performs feature aggregation, one sliding window or one subset of features at a time. Tokens produced by feature aggregator 380 may be added to query memory 366 while one or more old tokens in query memory 366 may be discarded. Query memory 366 may be initialized randomly, or with a random set of values, and feature memory 370 may be updated progressively at each iteration of feature aggregator 380.

[0039] Feature aggregator 380 may process one or more parts in feature memory 370. An output generated by feature aggregator 380 may be stored in query memory 366. Feature aggregator 380 may process at least a part of query memory 366 and one or more parts in feature memory 370 may be processed by feature aggregator 380 to generate a further output. The further output generated by feature aggregator 380 may be stored in query memory 366. After storing the further output in query memory 366, feature aggregator 380 may process at least a further part of query memory 366 and one or more yet further parts of feature memory 370 to generate a yet further output. The yet further output of feature aggregator 380 may be stored in feature memory 370.

[0040] Using query memory 366 to store latest historical outputs of feature aggregator 380 and the progressive / iterative nature of feature aggregator 380 allow online feature aggregator 376 to aggregate features extracted by visual encoder 302 over a long-context but yet allow online feature aggregator 376 to perform feature aggregation one set of features at a time and maintain a fixed number of visual tokens to ensure compact representation of information over the long-context. A more detailed example of online feature aggregator 376 is illustrated in FIG. 5.

[0041] In some embodiments, feature aggregator 380 may process one or more tokens extracted by a further input encoder of multimodal LLM 300 to implement fusion of modalities. For example, feature aggregator 380 may receive one or more tokens generated by tokenizer 110, e.g., language tokens 124, as input in addition to a part of query memory 366 and one or more parts in feature memory 370. Feature aggregator 380 can take information from another modality into account. A more detailed example of online feature aggregator 376 performing fusion of modalities is illustrated in FIG. 5.

[0042] The input encoder having visual encoder 302, feature memory 370 and online feature aggregator 376 may produce visual tokens 388 based on video input 180. The further input encoder having tokenizer 110 may produce language tokens 124 based on language input 108. LLM 106 may process one or more tokens extracted by the input encoder (e.g., visual tokens 388) and one or more further tokens extracted by the further input encoder (e.g., language tokens 124) . LLM 106 may include a machine learning model, such as a transformer-based neural network to process a sequence of input tokens and generate an output. Because visual tokens 388 has a limited number of tokens, the memory and computing resources demands of LLM 106 can be made more manageable.Exemplary input encoder having an online memory mechanism

[0043] The online memory mechanism includes a multi-scale online latent memory that captures long-range cross-frame features. To optimize memory usage and computational efficiency, a sparse temporal context approach is used. This involves the integration of an internal memory and an online kernel, e.g., a dilated convolution layer, after each ViT block in visual encoder. In some cases, the internal memory and the online kernel is added to a subset of the ViT blocks in the visual encoder. The online kernels have varying dilation rates to control the temporal scope. As the block number increases, or as it gets deeper in visual encoder 302, the context range can widen due to incrementally increased dilation rates. The dilation rate for each layer is a hyperparameter that can be fine-tuned based on one or more metrics. The outputs of the online kernel can capture features at different time scales and can be processed by a further ViT block. Historical outputs can be stored in a fixed length memory bank, such as the internal memories described herein. The online kernels, e.g., an online dilated convolution, can facilitate the efficient learning of long-context visual features. The online kernels with varying dilation rates are complemented by internal memories, which can store historical outputs of a respective ViT block so that the outputs having a particular contextual range can be processed by the online kernels. Moreover, a feature memory can serve as the output repository for the visual encoder.

[0044] An online kernel can receive, and process inputs associated with a current frame index and one or more further inputs associated with one or more further frame indices. The frame indices of the inputs of the online kernel, such as the difference between a smallest frame index and the largest frame index, may define the contextual distance or temporal contextual range of the online kernel. The different frame indices may define a temporal receptive field of the online kernel. The online kernel may have a specific contextual distance or temporal contextual range, which may be defined by the difference in frame indices associated with the inputs. The contextual distance or temporal contextual range may be based on a dilation rate of the online kernel. The dilation rate of the online kernel may specify the distances or spacing of the frame indices. The spacing or distance between the inputs means that the online kernel can perform temporally dilated operations on its inputs. In some embodiments, a multi-scale temporal adapter performs temporally dilated operations, or operations on temporally dilated inputs according to a specific dilation rate or contextual distance. The operations may be performed based on parameters that may extract cross-frame interactions.

[0045] In some embodiments, an online kernel may include a dilated convolution layer, or a kernel that can be applied to temporally dilated inputs. The dilation rates of the online kernels may vary or incrementally increase as the block number increases. One illustration is depicted in FIG. 4As the dilation rates increase, the contextual distance or temporal contextual range of the online kernel also increases. The maximum temporal contextual distance set for the online kernels may correspond to a number of frames or a temporal receptive field where cross-frame interaction is from the input sequence. The maximum temporal contextual distance may be a hyperparameter, which may be set to optimize for one or more metrics. In some embodiments, the contextual distances or temporal contextual ranges of online kernels may increase linearly. In some embodiments, the contextual distances or temporal contextual ranges of online kernels may increase exponentially. The dilation rates (and thus the contextual distances or temporal contextual ranges) of the online kernels may be hyperparameters. In some cases, the dilation rates, and the rates at which the dilation rates increase may be set to optimize for one or more metrics.

[0046] The kernel sizes (e.g., the number of inputs processed by the kernel) of the online kernels may be hyperparameters. A larger kernel size may enable an online kernel to extract more cross-frame interaction, but at the cost of higher complexity (e.g., more parameters and operations) for the visual encoder 302. In some cases, the kernel sizes may be set to optimize for one or more metrics. The kernel may have a size S, e.g., S = 2, 3, 4, 5, 6, 7 or 8. In some cases, kernel sizes are the same for each B, or the block or layer number to which the online kernel is added. In some cases, kernel sizes may increase as block number increases (or as the temporal receptive field widens) . If an input to an online kernel is unavailable (e.g., at or near the beginning of the input sequence) , the input may be padded with one or more zeros, and the online kernel may be applied further to an input padded with one or more zeros.

[0047] Given a streaming video input X=x0, ……, xT, an online kernel (e.g., a dilated convolution layer) is a sequence model having an output represented by: yt= g (xt, Mt) where represents the internal memory at the output of the lth ViT block having latent features extracted at the lth layer. The online kernel function g (·) represents a stack of 1D convolutions with filters. A feature output yt of the visual encoder having L ViT blocks can be represented by:

[0048] Although each internal memory has a fixed capacity, specifically a length denoted by R, the temporal contextual range that the internal memory covers can extend well beyond R. This is because the latent features at a particular layer represents an aggregation of historical features of the current block and one or more earlier block, with its contextual breadth determined by the contextual ranges of the current block and one or more earlier blocks. Suppose the depth of ViT block is L (B=L) , and the dilation range is set to 2l at each layer, the contextual range of the online kernel inserted at the output of the ViT block with block number B=L is approximate to 2L+R.

[0049] The update process is streamlined, involving the passage of a single frame through the visual encoder at a time. The design of the internal memories facilitates progressive computation of the dilated convolution, ensuring that at each time step, only operations pertaining to the current frame are executed. The dilated convolution is causal, meaning that the dilated convolution may operate on outputs produced for a current frame and one or more earlier / past frames. Applying a causal operation means that the online kernel can operate in an online manner, one frame at a time, and only depend on data already present in the internal memory. The internal memory is sized to ensure that the online kernel can access the internal memory for outputs of the ViT block having the particular contextual range or dilation rate.

[0050] FIG. 4 depicts visual encoder 302 having one or more internal memories, a feature memory 370, a memory initialization process, and a memory update process, according to some embodiments of the disclosure. The memory initialization process is implemented to fill up the internal memories, one block / layer at a time, and for a batch of frames at different frame indices. The memory update process is implemented to update the internal memories, frame by frame (e.g., one frame index at a time) .

[0051] In the memory initialization process, a batch of frames are processed by ViT block 310 to produce respective outputs, and the respective outputs may be stored in memory 312. As discussed with FIG. 3, ViT block 310 may process a frame of a sequence of frames to produce an output. The output generated by ViT block 310 may be stored in memory 312. ViT block 310 at block number B=1 may perform processing of additional frames in the batch of frames and produce additional outputs for the additional frames which are stored in memory 312. Memory 312 may store historical outputs of ViT block 310 produced by processing the batch of frames.

[0052] Once memory 312 has sufficient data, online kernel 314 may be applied to one or more parts of memory 312 (which includes one or more historical outputs of ViT block 310 that is stored in memory 312) and generates an output. Online kernel 314 may operate on two outputs of ViT block 310 stored in memory 312, where the two outputs have a certain contextual distance D. Online kernel 314 may operate on three outputs of ViT block, where the three outputs have a certain contextual distance D. Online kernel 314 may operate on four or more outputs of ViT block, where the four or more outputs have a certain contextual distance D. The contextual distance D may measure the temporal or frame index difference across the outputs processed by online kernel 314. The contextual distance D may be based on a dilation rate of online kernel 314 that specifies a spacing of frame indices between the inputs of online kernel 314. In some embodiments, the contextual range of online kernel 314 is 3 (D=3) , where the three outputs span across three frame indices.

[0053] Continuing in the memory initialization process, ViT block 320 at block number B=2 may process the output from online kernel 314 and generate an output. The output from ViT block 320 may be stored in memory 322. ViT block 320 may process additional outputs generated by online kernel 314 for additional frame indices and produce additional outputs for the additional frame indices. The additional outputs are stored in memory 322. Memory 322 may store historical outputs of ViT block 320.

[0054] Once memory 322 has sufficient data, online kernel 324 may be applied to one or more parts of memory 322 (which includes one or more historical outputs of ViT block 320 that is stored in memory 322) and generates an output. Online kernel 324 may operate on two outputs of ViT block 320 stored in memory 322, where the two outputs have a certain contextual distance D. Online kernel 324 may operate on three outputs of ViT block, where the three outputs have a certain contextual distance D. Online kernel 324 may operate on four or more outputs of ViT block, where the four or more outputs have a certain contextual distance D. The contextual distance D may measure the temporal or frame index difference across the outputs processed by online kernel 324. The contextual distance D may be based on a dilation rate of online kernel 324 that specifies a spacing of frame indices between the inputs of online kernel 324. In some embodiments, the contextual range of online kernel 324 is 5 (D=5) , where the three outputs span across five frame indices.

[0055] Continuing in the memory initialization process, ViT block 330 at block number B=3 may process the output from online kernel 324 and generate an output. The output from ViT block 330 may be stored in memory 332. ViT block 330 may process additional outputs generated by online kernel 324 for additional frame indices and produce additional outputs for the additional frame indices. The additional outputs are stored in memory 332. Memory 332 may store historical outputs of ViT block 330.

[0056] Once memory 332 has sufficient data, online kernel 334 may be applied to one or more parts of memory 332 (which includes one or more historical outputs of ViT block 330 that is stored in memory 332) and generates an output. Online kernel 334 may operate on two outputs of ViT block 330 stored in memory 332, where the two outputs have a certain contextual distance D. Online kernel 334 may operate on three outputs of ViT block, where the three outputs have a certain contextual distance D. Online kernel 334 may operate on four or more outputs of ViT block, where the four or more outputs have a certain contextual distance D. The contextual distance D may measure the temporal or frame index difference across the outputs processed by online kernel 334. The contextual distance D may be based on a dilation rate of online kernel 334 that specifies a spacing of frame indices between the inputs of online kernel 334. In some embodiments, the contextual range of online kernel 334 is 9 (D=9) , where the three outputs span across nine frame indices.

[0057] Continuing in the memory initialization process, ViT block 340 at block number B=4 may process the output from online kernel 334 and generate an output. The output from ViT block 340 may be stored in a further internal memory. ViT block 340 may process additional outputs generated by online kernel 334 for additional frame indices and produce additional outputs for the additional frame indices. The additional outputs are stored in the further memory. The further memory may store historical outputs of ViT block 340.

[0058] Features produced by the last ViT block for a frame in the batch of frames may be stored in feature memory 370. Features produced by the last ViT block for additional frames in the batch of frames may be stored in feature memory 370. The memory initialization process may continue block by block or layer by layer until the internal memories and feature memory 370 are initialized.

[0059] In some cases, the contextual range of an online kernel is 2B, where B is the block number of the ViT block directly after which the online kernel is inserted. In some cases, the parts of an internal memory on which an online kernel operates have a contextual distance. Therefore, the online kernels at different block numbers may operate on outputs having the same contextual distances or different contextual distances depending on the depth of the online kernel. In some cases, the contextual distances monotonically increase as the block number increases. In some cases, the contextual distances increase in an exponential manner as the block number increases. The contextual distance associated with online kernel 314 may be smaller than the contextual distance associated with online kernel 324. The contextual distance associated with online kernel 324 may be smaller than the contextual distance associated with online kernel 334.

[0060] In the memory update process after the memory initialization process, when the ViT blocks produces a new output for a new frame in the sequence of frames, the new output is stored in the internal memory, and the oldest output of the ViT block is discarded from the internal memory. The internal memory may act as a first-in-first-out memory. In other words, after the memory initialization process, the new output produced by ViT block 310 for a new frame may be stored in memory 312, and the oldest output of ViT block 310 in memory 312 may be discarded from memory 312. A new output produced by ViT block 320 for a new frame may be stored in memory 322, and the oldest output of ViT block 320 in memory 322 may be discarded from memory 322. A new output produced by ViT block 330 may be stored in memory 332, and the oldest output of ViT block 330 in memory 332 may be discarded from memory 332. The same update process may be applied to additional internal memories deeper in visual encoder 302. When the last ViT block produces a new output for a new frame in the sequence of frames, the new output is stored in feature memory 370 and the oldest output (e.g., features extracted by visual encoder 302) may be discarded.

[0061] In the depicted example, the contextual distances or temporal contextual range of an online kernel may be provided by: min (2B+1, max) where B is the block or layer number to which the online kernel is added, and max is the maximum temporal contextual range. For example, the contextual distance D for online kernel 314 may be min (2B+1, max) =3 where B=1. The contextual distance D for online kernel 324 may be min (2B+1, max) =5 where B=2. The contextual distance D for online kernel 334 may be min (2B+1, max) =9 where B=3.

[0062] In the depicted example, the number of inputs processed by an online kernel is 3. In some embodiments, the number of inputs is 2. It is envisioned that other numbers can be used, and the number of inputs being processed by a given online kernel may be a hyperparameter. Increasing the number of inputs may increase the complexity of the online kernel, but having additional inputs may extract more cross-frame interaction from the outputs of a ViT block. In some embodiments, the number of inputs processed by an online kernel may increase as the block number of the ViT block to which the online kernel is added increases. In some embodiments, the number of inputs processed by an online kernel may decrease as the block number of the ViT block to which the online kernel is added increases.

[0063] In some cases, the length or size of a given internal memory, R, may vary based on the contextual range of the online kernel which operates on parts of the internal memory. A larger internal memory, or larger R, may be allocated for an online kernel which has a greater contextual range to ensure the online kernel has access to the historical outputs of the ViT block after which the online kernel is inserted.Exemplary online feature aggregator to condense the length of the input tokens

[0064] To extract compact visual tokens, an online feature aggregation method can be implemented operate iteratively over the duration of the long sequence input, one window of features extracted by the visual encoder at a time. In addition, the online feature aggregation method can implement a cross-attention mechanism between the language features and the visual features. For streamlined computation in the online feature aggregator, a key-value cache mechanism that retains the keys and values of historical features can be implemented to eliminate redundant calculations during the attention process. In other words, key tensors and value tensors calculated for historical features can be cached and reused in the iterative process of generating new tokens, instead of recomputing the key tensors and value tensors when generating the next token. Consequently, visual tokens are learned progressively in an online mode, enhancing the model's efficient learning capabilities.

[0065] The online feature aggregation module can learn critical visual features in an iterative manner. Given the feature memory of size F storing F historical feature outputs of the visual encoder produced for F frames (the feature outputs stored in feature memory may be denoted as y0, y1, …yf, ) feature aggregation can be conducted progressively using the feature memory. At each iteration, a sliding window slides on the feature memory with the sliding space of W, and the sliding window of features in the feature memory is processed by the online feature aggregator. In some cases, the sliding window size W=1. In some cases, the sliding window size W=2. In some cases, the sliding window size W=4. The sliding window size can mean that the feature aggregation process is conducted every W number of frames. W may be a hyperparameter, which can be set to optimize for one or more metrics.

[0066] FIG. 5 depicts feature memory 370, online feature aggregator 376, a further input encoder, and LLM 106, according to some embodiments of the disclosure. The further input encoder may include tokenizer 110 to process language input 108 and generate language tokens 124. Online feature aggregator 376 may include query memory 366 and feature aggregator 380.

[0067] Feature memory 370 as depicted may store historical feature outputs produced and generated by visual encoder 302. A feature output (e.g., having one or more features) generated for a frame is depicted as a shaded circle in feature memory 370. Feature memory 370 may be sized to store F feature outputs extracted for F frames.

[0068] Query memory 366 may be initialized randomly or with random values. Query memory 366 may be initialized with predetermined values. Feature aggregator 380 may iteratively update query memory 366. At each iteration, the network predicts the next query token. At every iteration, feature aggregator 380 may predict the next query token ( “NEXT Q” ) and update query memory 366 by adding the predicted query token and discarding an old query token. Query memory 366 may be a fixed length memory, or a fixed length memory bank, that is allocated to store historical feature outputs generated by visual encoder 302 of FIGS. 3-4. The query memory 366 may be first-in-first-out memories. The fixed length of query tokens stored in query memory 366 may represent the most up to date visual tokens, or visual information about the video input. The most up to date visual tokens are used progressively and iteratively to predict the next visual token and to update the query memory 366 accordingly.

[0069] Feature aggregator 380 may include self-attention 510, and cross-attention 520. Optionally, feature aggregator 380 may include cross-attention 530. In some embodiments, cross-attention 520 may be applied before cross-attention 530. In some embodiments, cross-attention 530 may be applied before cross-attention 520.

[0070] Self-attention 510 may perform a self-attention mechanism based on a fixed length of query tokens stored in query memory 366. The fixed length of query tokens may be retrieved from query memory 366 using a sliding window of a fixed-size and having a sliding stride of 1. Self-attention is a mechanism where a sequence attends to different positions within itself. Self-attention 510 may determine attention weights where the query tokens are attending to different positions within the query tokens, looking at how each query token relates to every other query token. Self-attention 510 may create tensors Q, K and V based on the fixed length of query tokens, and apply self-attention using the following:

[0071] dk represents the dimensionality of the key vectors and is used for scaling. Self-attention 510 produces an output, which is used as an input to a following stage in feature aggregator 380, e.g., cross-attention 520, or cross-attention 530.

[0072] Cross-attention 520 may perform a cross-attention mechanism based on two inputs: output from self-attention 510 and sliding window of features from feature memory 370. Cross-attention is a mechanism where elements in one sequence attend to elements in another sequence. Cross-attention 520 may determine attention weights where output from self-attention 510 are attending to different positions within sliding window of features from feature memory 370, looking at how each token in the output from self-attention 510 relates to every feature in the sliding window of features from feature memory 370. Cross-attention 520 may create three tensors Q, K and V, where Q is based the output from self-attention 510, and K and V are based on the sliding window of features from feature memory 370 and apply cross-attention using equation 2 as shown above. Cross-attention 520 produces an output, which can be used as an input to a following stage in feature aggregator 380, e.g., cross-attention 530. In some cases, the output of cross-attention 520 is stored in query memory 366.

[0073] In cases where cross-attention 530 is implemented, cross-attention 530 may perform a cross-attention mechanism based on two inputs: output from cross-attention 520 and language tokens 124. Cross-attention 530 may determine attention weights where output from cross-attention 520 are attending to different positions within language tokens 124, looking at how each token in the output from cross-attention 520 relates to every token in language tokens 124. Cross-attention 530 may create three tensors Q, K and V, where Q is based the output from cross-attention 520, and K and V are based on the language tokens 124 and apply cross-attention using equation 2 as shown above. Cross-attention 530 produces an output, which can be used as an input to a following stage in feature aggregator 380, e.g., cross-attention 520. In some cases, the output of cross-attention 530 is stored in query memory 366.

[0074] In some embodiments, the order of cross-attention 520 and cross-attention 530 may be swapped, where cross-attention 530 may be applied first to the output of self-attention 510 and language tokens 124, and cross-attention 520 may be applied to the output of cross-attention 530 and the sliding window of features from feature memory 370. The output from cross-attention 520 may be stored in query memory 366.

[0075] In the next iteration, feature aggregator 380 may operate on a further sliding window of features from feature memory 370, and at least a part of query memory 366 to progressively update query memory 366. At least a part of query memory 366 is used as an input to self-attention 510. Cross-attention 520 may process the output of self-attention 510 and the further sliding window of features from feature memory 370 and generate an output. Cross-attention 530 may process the output of cross-attention 520 and language tokens 124 and generate an output. The output of cross-attention 530 may be stored in query memory 366.

[0076] Due to the iterative nature of online feature aggregator 376, a key-value cache can be implemented to cache past key tensors and value tensors calculated for features in the sliding window of features from feature memory 370 (e.g., the sliding windows have common features from one window to the other) , language tokens 124, and query tokens in at least a part of query memory 366, to avoid redundant computations in the self-attention and cross-attention mechanisms of feature aggregator 380.

[0077] In some embodiments, the self-attention and cross-attention mechanisms of feature aggregator 380 can be replaced by or include a multi-layer perceptron, a neural network, one or more 1D convolutional layers, etc.Exemplary method for generating tokens for inputs with long-context

[0078] FIG. 6 depicts a flowchart illustrating a method for generating tokens for inputs with long-context, according to some embodiments of the disclosure. Method 600 can be performed using a computing device, such as computing device 700 in FIG. 7. Method 600 may be performed using or by one or more parts illustrated in FIGS. 4-6.

[0079] In 602, a block of an input encoder processes a frame of a sequence of frames to produce an output.

[0080] In 604, the output of the block is stored in a memory.

[0081] In 606, a kernel of the input encoder is applied to one or more parts of the memory.

[0082] In 608, a further block of the input encoder processes an output of the kernel.

[0083] In 610, an output of the further block is stored in a further memory.

[0084] In 612, a further kernel of the input encoder is applied to one or more parts of the further memory.

[0085] In 614, the input encoder generates one or more features based on an output of the further kernel.Exemplary computing device

[0086] FIG. 7 is a block diagram of an apparatus or a system, e.g., an exemplary computing device 700, according to some embodiments of the disclosure. One or more computing devices 700 may be used to implement the functionalities described with the FIGS. and herein. A number of components illustrated in FIG. 7 can be included in computing device 700, but any one or more of these components may be omitted or duplicated, as suitable for the application. In some embodiments, some or all of the components included in computing device 700 may be attached to one or more motherboards. In some embodiments, some or all of these components are fabricated onto a single system on a chip (SoC) die. Additionally, in various embodiments, computing device 700 may not include one or more of the components illustrated in FIG. 7, and computing device 700 may include interface circuitry for coupling to the one or more components. For example, the computing device 700 may not include display device 706, and may include display device interface circuitry (e.g., a connector and driver circuitry) to which a display device 706 may be coupled. In another set of examples, computing device 700 may not include audio input device 718 or an audio output device 708 and may include audio input or output device interface circuitry (e.g., connectors and supporting circuitry) to which an audio input device 718 or audio output device 708 may be coupled.

[0087] Computing device 700 may include processing device 702 (e.g., one or more processing devices, one or more of the same types of processing device, one or more of different types of processing device) . Processing device 702 may include electronic circuitry that process electronic data from data storage elements (e.g., registers, memory, resistors, capacitors, quantum bit cells) to transform that electronic data into other electronic data that may be stored in registers and / or memory. Examples of processing device 702 may include a central processing unit (CPU) , a GPU, a quantum processor, a machine learning processor, an artificial intelligence processor, a neural network processor, an artificial intelligence accelerator, an application specific integrated circuit (ASIC) , an analog signal processor, an analog computer, a microprocessor, a digital signal processor, a field-programmable gate array (FPGA) , a tensor processing unit (TPU) , a data processing unit (DPU) , etc.

[0088] The computing device 700 may include a memory 704, which may itself include one or more memory devices such as volatile memory (e.g., DRAM) , nonvolatile memory (e.g., read-only memory (ROM) ) , high bandwidth memory (HBM) , flash memory, solid state memory, and / or a hard drive. Memory 704 includes one or more non-transitory computer-readable storage media. In some embodiments, memory 704 may include memory that shares a die with the processing device 702.

[0089] In some embodiments, memory 704 includes one or more non-transitory computer-readable media storing instructions executable to perform operations described with the FIGS. and herein, such as the methods and operations illustrated in the FIGS. In some embodiments, memory 704 includes one or more non-transitory computer-readable media storing instructions executable to perform operations of method 600 of FIG. 6. In some cases, the instructions may include configuration files. In some cases, the instructions may include machine-readable instructions according to an instruction set. Exemplary parts that may be encoded as instructions and stored in memory 704 are depicted. Memory 704 may store instructions that encode one or more exemplary parts, such as one or more parts of multimodal LLM 300, one or more parts illustrated in FIG. 4, one or more parts illustrated in FIG. 5, visual encoder 302, tokenizer 110, and LLM 106. The instructions stored in the one or more non-transitory computer-readable media may be executed by processing device 702. Memory 704 may include internal memories, feature memory, and query memory as described herein.

[0090] In some embodiments, memory 704 may store data, e.g., data structures, binary data, bits, metadata, files, blobs, etc., as described with the FIGS. and herein. For example, memory 704 may store input video frames and output predictions.

[0091] In some embodiments, memory 704 may store one or more DNNs (or parts thereof) of visual encoder 302. Memory 704 may store one or more DNNs (or parts thereof) of LLM 106. Memory 704 may store training data for training (trained) the DNN. Exemplary training data may include a training data set having multimodal inputs and corresponding ground truth labels / classifications. Memory 704 may store instructions that perform operations associated with training the DNN. Memory 704 may store input data, output data, intermediate outputs, intermediate inputs of the one or more DNNs. Memory 704 may store one or more parameters used by the one or more DNNs. Memory 704 may store information that encodes how nodes of the one or more DNNs are connected with each other. Memory 704 may store instructions (e.g., low-level machine code) to perform one or more operations of the one or more DNNs. Memory 704 may store a model definition that specifies one or more operations of a DNN. Memory 704 may store instructions, such as configuration files, that are generated by a compiler based on the model definition.

[0092] In some embodiments, the computing device 700 may include a communication device 712 (e.g., one or more communication devices) . For example, the communication device 712 may be configured for managing wired and / or wireless communications for the transfer of data to and from the computing device 700. The term "wireless" and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data through the use of modulated electromagnetic radiation through a nonsolid medium. The term does not imply that the associated devices do not contain any wires, although in some embodiments they might not. The communication device 712 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.10 family) , IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment) , Long-Term Evolution (LTE) project along with any amendments, updates, and / or revisions (e.g., advanced LTE project, ultramobile broadband (UMB) project (also referred to as "3GPP2" ) , etc. ) . IEEE 802.16 compatible Broadband Wireless Access (BWA) networks are generally referred to as WiMAX networks, an acronym that stands for worldwide interoperability for microwave access, which is a certification mark for products that pass conformity and interoperability tests for the IEEE 802.16 standards. The communication device 712 may operate in accordance with a Global System for Mobile Communication (GSM) , General Packet Radio Service (GPRS) , Universal Mobile Telecommunications System (UMTS) , High Speed Packet Access (HSPA) , Evolved HSPA (E-HSPA) , or LTE network. The communication device 712 may operate in accordance with Enhanced Data for GSM Evolution (EDGE) , GSM EDGE Radio Access Network (GERAN) , Universal Terrestrial Radio Access Network (UTRAN) , or Evolved UTRAN (E-UTRAN) . The communication device 712 may operate in accordance with Code-division Multiple Access (CDMA) , Time Division Multiple Access (TDMA) , Digital Enhanced Cordless Telecommunications (DECT) , Evolution-Data Optimized (EV-DO) , and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond. The communication device 712 may operate in accordance with other wireless protocols in other embodiments. The computing device 700 may include an antenna 722 to facilitate wireless communications and / or to receive other wireless communications (such as radio frequency transmissions) . The computing device 700 may include receiver circuits and / or transmitter circuits. In some embodiments, the communication device 712 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., the Ethernet) . As noted above, the communication device 712 may include multiple communication chips. For instance, a first communication device 712 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second communication device 712 may be dedicated to longer-range wireless communications such as global positioning system (GPS) , EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others. In some embodiments, a first communication device 712 may be dedicated to wireless communications, and a second communication device 712 may be dedicated to wired communications.

[0093] The computing device 700 may include power source / power circuitry 714. The power source / power circuitry 714 may include one or more energy storage devices (e.g., batteries or capacitors) and / or circuitry for coupling components of the computing device 700 to an energy source separate from the computing device 700 (e.g., DC power, AC power, etc. ) .

[0094] The computing device 700 may include a display device 706 (or corresponding interface circuitry, as discussed above) . The display device 706 may include any visual indicators, such as a heads-up display, a computer monitor, a projector, a touchscreen display, a liquid crystal display (LCD) , a light-emitting diode display, or a flat panel display, for example.

[0095] The computing device 700 may include an audio output device 708 (or corresponding interface circuitry, as discussed above) . The audio output device 708 may include any device that generates an audible indicator, such as speakers, headsets, or earbuds, for example.

[0096] The computing device 700 may include an audio input device 718 (or corresponding interface circuitry, as discussed above) . The audio input device 718 may include any device that generates a signal representative of a sound, such as microphones, microphone arrays, or digital instruments (e.g., instruments having a musical instrument digital interface (MIDI) output) .

[0097] The computing device 700 may include a GPS device 716 (or corresponding interface circuitry, as discussed above) . The GPS device 716 may be in communication with a satellite-based system and may receive a location of the computing device 700, as known in the art.

[0098] The computing device 700 may include a sensor 730 (or one or more sensors) . The computing device 700 may include corresponding interface circuitry, as discussed above) . Sensor 730 may sense physical phenomenon and translate the physical phenomenon into electrical signals that can be processed by, e.g., processing device 702. Examples of sensor 730 may include: capacitive sensor, inductive sensor, resistive sensor, electromagnetic field sensor, light sensor, camera, imager, microphone, pressure sensor, temperature sensor, vibrational sensor, accelerometer, gyroscope, strain sensor, moisture sensor, humidity sensor, distance sensor, range sensor, time-of-flight sensor, pH sensor, particle sensor, air quality sensor, chemical sensor, gas sensor, biosensor, ultrasound sensor, a scanner, etc.

[0099] The computing device 700 may include another output device 710 (or corresponding interface circuitry, as discussed above) . Examples of the other output device 710 may include an audio codec, a video codec, a printer, a wired or wireless transmitter for providing information to other devices, haptic output device, gas output device, vibrational output device, lighting output device, home automation controller, or an additional storage device.

[0100] The computing device 700 may include another input device 720 (or corresponding interface circuitry, as discussed above) . Examples of the other input device 720 may include an accelerometer, a gyroscope, a compass, an image capture device, a keyboard, a cursor control device such as a mouse, a stylus, a touchpad, a bar code reader, a Quick Response (QR) code reader, any sensor, or a radio frequency identification (RFID) reader.

[0101] The computing device 700 may have any desired form factor, such as a handheld or mobile computer system (e.g., a cell phone, a smart phone, a mobile Internet device, a music player, a tablet computer, a laptop computer, a netbook computer, a personal digital assistant (PDA) , a personal computer, a remote control, wearable device, headgear, eyewear, footwear, electronic clothing, etc. ) , a desktop computer system, a server or other networked computing component, a printer, a scanner, a monitor, a set-top box, an entertainment control unit, a vehicle control unit, a digital camera, a digital video recorder, an Internet-of-Things device, or a wearable computer system. In some embodiments, the computing device 700 may be any other electronic device that processes data.Select examples

[0102] Example 1 provides an apparatus, including one or more processors; and one or more memories to store data and instructions, where the instructions, when executed by the one or more processors, cause the one or more processors to: process a frame of a sequence of frames by a transformer block of an input encoder to produce an output; store the output of the transformer block of the input encoder in a memory of the one or more memories; apply a temporally dilated kernel of the input encoder to one or more parts of the memory having one or more outputs generated by the transformer block of the input encoder for one or more frames of the sequence of frames; process an output of the temporally dilated kernel of the input encoder by a further transformer block of the input encoder; store an output of the further transformer block of the input encoder in a further memory of the one or more memories; apply a further temporally dilated kernel of the input encoder to one or more parts of the further memory having one or more further outputs generated by the further transformer block of the input encoder for one or more further frames of the sequence of frames; and generate one or more features by the input encoder based on an output of the further temporally dilated kernel of the input encoder.

[0103] Example 2 provides the apparatus of example 1, where the instructions further cause the one or more processors to: store one or more features extracted by the input encoder in a feature memory of the one or more memories; and process one or more parts in the feature memory by a feature aggregator.

[0104] Example 3 provides the apparatus of example 2, where the instructions further cause the one or more processors to: store an output of the feature aggregator in a query memory; process at least a part of the query memory and one or more further parts in the feature memory by the feature aggregator to generate a further output; and store the further output of the feature aggregator in the query memory.

[0105] Example 4 provides the apparatus of example 3, where the instructions further cause the one or more processors to: after storing the further output of the feature aggregator in the query memory, process at least a further part of the query memory and one or more yet further parts in the feature memory by the feature aggregator to generate a yet further output; and store the yet further output of the feature aggregator in the query memory.

[0106] Example 5 provides the apparatus of any one of examples 2-4, where the instructions further cause the one or more processors to: process, by the feature aggregator, one or more tokens extracted by a further input encoder.

[0107] Example 6 provides the apparatus of any one of examples 2-5, where the instructions further cause the one or more processors to: process, by a machine learning model, one or more tokens extracted by the feature aggregator and one or more tokens extracted by a further input encoder.

[0108] Example 7 provides the apparatus of any one of examples 1-6, where: the temporally dilated kernel has a dilation rate; the further temporally dilated kernel has a further dilation rate; and the dilation rate is different from the further dilation rate.

[0109] Example 8 provides one or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause the one or more processors to: process a frame of a sequence of frames by a transformer block of an input encoder to produce an output; store the output of the transformer block of the input encoder in a memory; apply a temporally dilated kernel of the input encoder to one or more parts of the memory having one or more outputs generated by the transformer block of the input encoder for one or more frames of the sequence of frames; process an output of the temporally dilated kernel of the input encoder by a further transformer block of the input encoder; store an output of the further transformer block of the input encoder in a further memory; apply a further temporally dilated kernel of the input encoder to one or more parts of the further memory having one or more further outputs generated by the further transformer block of the input encoder for one or more further frames of the sequence of frames; and generate one or more features by the input encoder based on an output of the further temporally dilated kernel of the input encoder.

[0110] Example 9 provides the one or more non-transitory computer-readable media of example 8, where the instructions further cause the one or more processors to: store one or more features extracted by the input encoder in a feature memory; and process one or more parts in the feature memory by a feature aggregator.

[0111] Example 10 provides the one or more non-transitory computer-readable media of example 9, where the instructions further cause the one or more processors to: store an output generated by the feature aggregator in a query memory; process at least a part of the query memory and one or more further parts in the feature memory by the feature aggregator to generate a further output; and store the further output generated by the feature aggregator in the query memory.

[0112] Example 11 provides the one or more non-transitory computer-readable media of example 10, where the instructions further cause the one or more processors to: after storing the further output of the feature aggregator in the query memory, process at least a part of the query memory and one or more yet further parts in the feature memory by the feature aggregator to generate a yet further output; and store the yet further output of the feature aggregator in the query memory.

[0113] Example 12 provides the one or more non-transitory computer-readable media of any one of examples 9-11, where the instructions further cause the one or more processors to: process, by the feature aggregator, one or more tokens extracted by a further input encoder.

[0114] Example 13 provides the one or more non-transitory computer-readable media of any one of examples 9-12, where the instructions further cause the one or more processors to: process, by a machine learning model, one or more tokens extracted by the feature aggregator and one or more tokens extracted by a further input encoder.

[0115] Example 14 provides the one or more non-transitory computer-readable media of any one of examples 9-13, where: the temporally dilated kernel has a dilation rate; the further temporally dilated kernel has a further dilation rate; and the dilation rate is different from the further dilation rate.

[0116] Example 15 provides a method, including processing a frame of a sequence of frames by a transformer block of an input encoder to produce an output; store the output of the transformer block of the input encoder in a memory; applying a temporally dilated kernel of the input encoder to one or more parts of the memory having one or more outputs generated by the transformer block of the input encoder for one or more frames of the sequence of frames; processing an output of the temporally dilated kernel of the input encoder by a further transformer block of the input encoder; storing an output of the further transformer block of the input encoder in a further memory; applying a further temporally dilated kernel of the input encoder to one or more parts of the further memory having one or more further outputs generated by the further transformer block of the input encoder for one or more further frames of the sequence of frames; and generating one or more features by the input encoder based on an output of the further temporally dilated kernel of the input encoder.

[0117] Example 16 provides the method of example 15, further including storing one or more features extracted by the input encoder in a feature memory; and processing one or more elements in the feature memory by a feature aggregator.

[0118] Example 17 provides the method of example 16, further including storing an output of the feature aggregator in a query memory; processing at least a part of the query memory and one or more further parts in the feature memory by the feature aggregator to generate a further output; and storing the further output of the feature aggregator in the query memory.

[0119] Example 18 provides the method of example 17, further including after storing the further output generated by the feature aggregator in the query memory, processing at least a part of the query memory and one or more yet further parts in the feature memory by the feature aggregator to generate a yet further output; and storing the yet further output of the feature aggregator in the query memory.

[0120] Example 19 provides the method of any one of examples 16-18, further including processing, by the feature aggregator, one or more tokens extracted by a further input encoder.

[0121] Example 20 provides the method of any one of examples 16-19, further including processing, by a machine learning model, one or more tokens extracted by the feature aggregator and one or more tokens extracted by a further input encoder.

[0122] Example 21 provides the method of any one of examples 15-20, where: the temporally dilated kernel has a dilation rate; the further temporally dilated kernel has a further dilation rate; and the dilation rate is different from the further dilation rate.

[0123] Example A is an apparatus comprising means for carrying out any one of the methods according to examples 15-21.

[0124] Example B includes a multimodal LLM as described and illustrated herein.

[0125] Example C includes a visual encoder having one or more internal memories as described and illustrated herein.

[0126] Example D includes the visual encoder of example C, wherein the visual encoder further includes one or more online kernels as described and illustrated herein.

[0127] Example E includes a feature memory and an online feature aggregator as described and illustrated herein.Variations and other notes

[0128] Although the operations of the example method shown in and described with reference to FIGS. 3-6 are illustrated as occurring once each and in a particular order, it will be recognized that the operations may be performed in any suitable order and repeated as desired. Additionally, one or more operations may be performed in parallel. Furthermore, the operations illustrated in FIGS. 3-6 may be combined or may include more or fewer details than described.

[0129] Tokens, such as input tokens, language tokens, visual tokens, etc., represent basic units of data that a LLM, or a transformer-based neural network, processes. Language tokens can represent whole words, parts of words, punctuation, spaces, and special characters. Visual tokens can represent parts of images, images, frames, or a chunk of frames in a sequence of frames, etc. Herein, a token is also used to refer to or is used interchangeably with a token embedding. A token embedding is a dense vector representation of a token in a high-dimensional space. The dense vector can have a vector of floating point numbers. The token embedding can capture semantic properties of the token. Input encoders or feature extraction networks can produce input tokens, or token embeddings for the input tokens. Token embeddings enable the LLM or a transformer-based neural network to perform operations to understand token relationships.

[0130] Hyperparameters are tuning knobs that can impact one or more metrics of a model. Hyperparameters can be set through techniques such as grid search, random search, cost function optimization, or Bayesian optimization. The one or more metrics can include performance of a model, latency of a model, memory utilization, and compute utilization. The one or more metrics may relate to constraints of a system implementing the model (e.g., resources available, available credits, etc. ) . The one or more metrics may relate to targets of a model (e.g., max batch size, length of input sequence, target latency, etc. ) .

[0131] Various models can be trained using training data, or in an unsupervised manner. Parameters of the model (e.g., parameters in: ViT blocks, online kernels, input encoders, LLMs, etc. ) may be updated during the training process, or through unsupervised learning.

[0132] A tensor, as used herein, is a mathematical abstraction for representing numerical data across one or more dimensions. Just as a scalar is a single number, a vector is a one-dimensional array, and a matrix is a two-dimensional grid, a tensor can include a scalar, a vector, a matrix, or data structures having three or more dimensions.

[0133] The various implementations described herein may refer to artificial intelligence, machine learning, and deep learning. Machine learning may be a subset of artificial intelligence. Deep learning may be a subset of machine learning. In cases where a deep learning model is mentioned, if suitable for a particular application, a different kind of machine learning model may be used instead. In cases where a deep learning model is mentioned, if suitable for a particular application, a different kind of artificial intelligence model may be used instead. In cases where a deep learning model, machine learning model, or an artificial intelligence model is mentioned, if suitable for a particular application, a digital signal processing system may be used instead.

[0134] The above description of illustrated implementations of the disclosure, including what is described in the Abstract, is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. While specific implementations of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art will recognize. These modifications may be made to the disclosure in light of the above detailed description.

[0135] For purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the illustrative implementations. However, it will be apparent to one skilled in the art that the present disclosure may be practiced without the specific details and / or that the present disclosure may be practiced with only some of the described aspects. In other instances, well known features are omitted or simplified in order not to obscure the illustrative implementations.

[0136] Further, references are made to the accompanying drawings that form a part hereof, and in which are shown, by way of illustration, embodiments that may be practiced. It is to be understood that other embodiments may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense.

[0137] Various operations may be described as multiple discrete actions or operations in turn, in a manner that is most helpful in understanding the disclosed subject matter. However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. Operations described may be performed in a different order from the described embodiment. Various additional operations may be performed or described operations may be omitted in additional embodiments.

[0138] For the purposes of the present disclosure, the phrase “A or B” or the phrase "A and / or B" means (A) , (B) , or (A and B) . For the purposes of the present disclosure, the phrase “A, B, or C” or the phrase "A, B, and / or C" means (A) , (B) , (C) , (A and B) , (A and C) , (B and C) , or (A, B, and C) . The term "between, " when used with reference to measurement ranges, is inclusive of the ends of the measurement ranges.

[0139] The description uses the phrases "in an embodiment" or "in embodiments, " which may each refer to one or more of the same or different embodiments. The terms "comprising, " "including, " "having, " and the like, as used with respect to embodiments of the present disclosure, are synonymous. The disclosure may use perspective-based descriptions such as "above, " "below, " "top, " "bottom, " and "side" to explain various features of the drawings, but these terms are simply for ease of discussion, and do not imply a desired or required orientation. The accompanying drawings are not necessarily drawn to scale. Unless otherwise specified, the use of the ordinal adjectives “first, ” “second, ” and “third, ” etc., to describe a common object, merely indicates that different instances of like objects are being referred to and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking or in any other manner.

[0140] In the following detailed description, various aspects of the illustrative implementations will be described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art.

[0141] The terms “substantially, ” “close, ” “approximately, ” “near, ” and “about, ” generally refer to being within + / -20%of a target value as described herein or as known in the art. Similarly, terms indicating orientation of various elements, e.g., “coplanar, ” “perpendicular, ” “orthogonal, ” “parallel, ” or any other angle between the elements, generally refer to being within + / -5-20%of a target value as described herein or as known in the art.

[0142] In addition, the terms “comprise, ” “comprising, ” “include, ” “including, ” “have, ” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a method, process, or device, that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such method, process, or device. Also, the term “or” refers to an inclusive “or” and not to an exclusive “or. ”

[0143] The systems, methods and devices of this disclosure each have several innovative aspects, no single one of which is solely responsible for all desirable attributes disclosed herein. Details of one or more implementations of the subject matter described in this specification are set forth in the description and the accompanying drawings.

Claims

1.An apparatus, comprising:one or more processors; andone or more memories to store data and instructions, wherein the instructions, when executed by the one or more processors, cause the one or more processors to:process a frame of a sequence of frames by a transformer block of an input encoder to produce an output;store the output of the transformer block of the input encoder in a memory of the one or more memories;apply a temporally dilated kernel of the input encoder to one or more parts of the memory having one or more outputs generated by the transformer block of the input encoder for one or more frames of the sequence of frames;process an output of the temporally dilated kernel of the input encoder by a further transformer block of the input encoder;store an output of the further transformer block of the input encoder in a further memory of the one or more memories;apply a further temporally dilated kernel of the input encoder to one or more parts of the further memory having one or more further outputs generated by the further transformer block of the input encoder for one or more further frames of the sequence of frames; andgenerate one or more features by the input encoder based on an output of the further temporally dilated kernel of the input encoder.2.The apparatus of claim 1, wherein the instructions further cause the one or more processors to:store one or more features extracted by the input encoder in a feature memory of the one or more memories; andprocess one or more parts in the feature memory by a feature aggregator.3.The apparatus of claim 2, wherein the instructions further cause the one or more processors to:store an output of the feature aggregator in a query memory;process at least a part of the query memory and one or more further parts in the feature memory by the feature aggregator to generate a further output; andstore the further output of the feature aggregator in the query memory.4.The apparatus of claim 3, wherein the instructions further cause the one or more processors to:after storing the further output of the feature aggregator in the query memory, process at least a further part of the query memory and one or more yet further parts in the feature memory by the feature aggregator to generate a yet further output; andstore the yet further output of the feature aggregator in the query memory.5.The apparatus of claim 2, wherein the instructions further cause the one or more processors to:process, by the feature aggregator, one or more tokens extracted by a further input encoder.6.The apparatus of claim 2, wherein the instructions further cause the one or more processors to:process, by a machine learning model, one or more tokens extracted by the feature aggregator and one or more tokens extracted by a further input encoder.7.The apparatus of claim 1, wherein:the temporally dilated kernel has a dilation rate;the further temporally dilated kernel has a further dilation rate; andthe dilation rate is different from the further dilation rate.8.One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause the one or more processors to:process a frame of a sequence of frames by a transformer block of an input encoder to produce an output;store the output of the transformer block of the input encoder in a memory;apply a temporally dilated kernel of the input encoder to one or more parts of the memory having one or more outputs generated by the transformer block of the input encoder for one or more frames of the sequence of frames;process an output of the temporally dilated kernel of the input encoder by a further transformer block of the input encoder;store an output of the further transformer block of the input encoder in a further memory;apply a further temporally dilated kernel of the input encoder to one or more parts of the further memory having one or more further outputs generated by the further transformer block of the input encoder for one or more further frames of the sequence of frames; andgenerate one or more features by the input encoder based on an output of the further temporally dilated kernel of the input encoder.9.The one or more non-transitory computer-readable media of claim 8, wherein the instructions further cause the one or more processors to:store one or more features extracted by the input encoder in a feature memory; andprocess one or more parts in the feature memory by a feature aggregator.10.The one or more non-transitory computer-readable media of claim 9, wherein the instructions further cause the one or more processors to:store an output generated by the feature aggregator in a query memory;process at least a part of the query memory and one or more further parts in the feature memory by the feature aggregator to generate a further output; andstore the further output generated by the feature aggregator in the query memory.11.The one or more non-transitory computer-readable media of claim 10, wherein the instructions further cause the one or more processors to:after storing the further output of the feature aggregator in the query memory, process at least a part of the query memory and one or more yet further parts in the feature memory by the feature aggregator to generate a yet further output; andstore the yet further output of the feature aggregator in the query memory.12.The one or more non-transitory computer-readable media of claim 9, wherein the instructions further cause the one or more processors to:process, by the feature aggregator, one or more tokens extracted by a further input encoder.13.The one or more non-transitory computer-readable media of claim 9, wherein the instructions further cause the one or more processors to:process, by a machine learning model, one or more tokens extracted by the feature aggregator and one or more tokens extracted by a further input encoder.14.The one or more non-transitory computer-readable media of claim 9, wherein:the temporally dilated kernel has a dilation rate;the further temporally dilated kernel has a further dilation rate; andthe dilation rate is different from the further dilation rate.15.A method, comprising:processing a frame of a sequence of frames by a transformer block of an input encoder to produce an output;store the output of the transformer block of the input encoder in a memory;applying a temporally dilated kernel of the input encoder to one or more parts of the memory having one or more outputs generated by the transformer block of the input encoder for one or more frames of the sequence of frames;processing an output of the temporally dilated kernel of the input encoder by a further transformer block of the input encoder;storing an output of the further transformer block of the input encoder in a further memory;applying a further temporally dilated kernel of the input encoder to one or more parts of the further memory having one or more further outputs generated by the further transformer block of the input encoder for one or more further frames of the sequence of frames; andgenerating one or more features by the input encoder based on an output of the further temporally dilated kernel of the input encoder.16.The method of claim 15, further comprising:storing one or more features extracted by the input encoder in a feature memory; andprocessing one or more elements in the feature memory by a feature aggregator.17.The method of claim 16, further comprising:storing an output of the feature aggregator in a query memory;processing at least a part of the query memory and one or more further parts in the feature memory by the feature aggregator to generate a further output; andstoring the further output of the feature aggregator in the query memory.18.The method of claim 17, further comprising:after storing the further output generated by the feature aggregator in the query memory, processing at least a part of the query memory and one or more yet further parts in the feature memory by the feature aggregator to generate a yet further output; andstoring the yet further output of the feature aggregator in the query memory.19.The method of claim 16, further comprising:processing, by the feature aggregator, one or more tokens extracted by a further input encoder.20.The method of claim 16, further comprising:processing, by a machine learning model, one or more tokens extracted by the feature aggregator and one or more tokens extracted by a further input encoder.