Distillation method and device of model, electronic equipment and computer program product
By constructing a multi-dimensional model distillation framework using perceptual loss, response loss, and memory loss, this approach addresses the issues of visual similarity, dynamic response transmission characteristics, and memory mechanisms in student models during game development. This results in high-quality, real-time student models, enhancing the game's interactive experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU BOGUAN TELECOMM TECH LTD
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to build high-quality, real-time student models in game development, failing to simultaneously satisfy visual similarity, dynamic response delivery characteristics, and memory mechanisms, resulting in insufficient real-time interactive frame rates and poor user experience.
A model distillation method is adopted, which constructs a multi-dimensional model distillation framework by combining perceptual loss, response loss and memory loss. This framework combines perceptual loss, response loss and memory loss to ensure high fidelity, dynamic transfer characteristics and memory loss of student models in visual output. Furthermore, the student models with perceptual loss and memory loss achieve high fidelity in visual output, dynamic response characteristics and memory mechanism.
The trained student model, while maintaining lightweight and real-time performance, provides an interactive experience and intelligent behavior comparable to complex teacher models, significantly improving the performance and user experience of the game object interaction model.
Smart Images

Figure CN122242643A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence technology, and more specifically, to a model distillation method, a model distillation apparatus, an electronic device, and a computer program product. Background Technology
[0002] In the field of AI-driven game development, high-fidelity models (i.e., teacher models) can generate cinematic visuals and realistic physical interactions. However, their complex structure and large number of parameters result in computationally intensive inference processes, making it difficult to meet the real-time interactive frame rate requirements of games. Real-time rendering typically requires stable high frame rate output to ensure synchronization between player actions and visual feedback. However, teacher models often cause significant stuttering on ordinary hardware devices, ruining the user experience.
[0003] Meanwhile, lightweight models (i.e. student models) designed for real-time performance have fast reasoning speeds, but their generation quality has obvious defects, including missing visual details, inconsistent physical logic, and distorted responses to player behavior.
[0004] Therefore, there is an urgent need in this field for a model distillation method that can optimize visual similarity, dynamic response transfer characteristics and memory mechanisms, thereby constructing high-quality and real-time student models.
[0005] It should be noted that the information disclosed in the background section above is only used to enhance the understanding of the background of this disclosure, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention
[0006] The purpose of this disclosure is to provide a model distillation method, a model distillation apparatus, an electronic device, and a computer program product, which can at least to some extent optimize visual similarity, dynamic response transfer characteristics, and memory mechanisms, thereby constructing a high-quality and real-time student model.
[0007] According to a first aspect of this disclosure, a distillation method for a model is provided, comprising: The sampled action sequence samples of the game objects are input into the teacher model and the student model respectively to obtain the first video frame sample and the first memory information output by the teacher model, and the second video frame sample and the second memory information output by the student model. The perceptual loss is obtained based on the image features of the first video frame sample and the image features of the second video frame sample; Extract a first time-domain index and a first frequency-domain index of the visual signal in the region of interest of the first video frame sample, and a second time-domain index and a second frequency-domain index of the visual signal in the region of interest of the second video frame sample. The time domain loss is obtained based on the first time domain index and the second time domain index, the frequency domain loss is obtained based on the first frequency domain index and the second frequency domain index, and the response loss is obtained based on the time domain loss and the frequency domain loss. Based on the first memory information and the second memory information, determine the global distribution loss of the teacher model and the student model, as well as the instance alignment loss at each time point, and obtain the memory loss based on the global distribution loss and the instance alignment loss; A joint loss is obtained based on at least one of the perceptual loss, the response loss, and the memory loss, and the model parameters in the student model are updated based on the joint loss to obtain the game object interaction model.
[0008] In one exemplary embodiment of this disclosure, the method further includes: The control data for the game object is acquired, and based on the basic action data of the game object, the control data, and the noise data, the action sequence sample of the game object is sampled.
[0009] In one exemplary embodiment of this disclosure, obtaining the perceptual loss based on the image features of the first video frame sample and the image features of the second video frame sample includes: The first video frame sample and the second video frame sample are input into the perception model, and the image features of the first video frame sample and the second video frame sample in multiple intermediate layers of the neural network of the perception model are extracted respectively. The feature distance between the first video frame sample and the second video frame sample is obtained based on the feature differences of the image features of the multiple intermediate layers, and the perceptual loss is obtained based on the feature distance.
[0010] In an exemplary embodiment of this disclosure, obtaining the feature distance between the first video frame sample and the second video frame sample based on the feature differences of the image features of the plurality of intermediate layers includes: The image features of the first video frame sample and the second video frame sample in the multiple intermediate layers are normalized by unit variance to obtain normalized features; The feature differences of the normalized features of the multiple intermediate layers are calculated based on a preset sliding window, and the feature distance between the first video frame sample and the second video frame sample is obtained based on the feature differences.
[0011] In an exemplary embodiment of this disclosure, the first temporal index and the first frequency index for extracting the visual signal from the region of interest of the first video frame sample include: The region of interest is determined from the first video frame sample based on a preset region selection strategy; Based on the scene type, a target visual signal is extracted from multiple visual signals in the region of interest, and a corresponding response curve is constructed based on the target visual signal. Extract the first time-domain index and the first frequency-domain index of the target visual signal based on the response curve corresponding to the target visual signal; The time-domain metrics include amplitude and time delay, while the frequency-domain metrics include amplitude spectrum, phase spectrum, and group delay.
[0012] In one exemplary embodiment of this disclosure, the method further includes: If multiple regions of interest exist, the target visual signals extracted from each region of interest are weighted and fused.
[0013] In one exemplary embodiment of this disclosure, determining the global distribution loss of the teacher model and the student model based on the first memory information and the second memory information includes: A mid-range memory summary pool for the teacher model is constructed based on the first memory information, and a mid-range memory summary pool for the student model is constructed based on the second memory information. The mid-range memory summary pool is compressed into a single global memory vector by a global aggregator, and the global distribution loss is obtained based on the global memory vectors of the teacher model and the student model.
[0014] In an exemplary embodiment of this disclosure, the first memory information includes short-term memory vectors of the teacher model at various time points, and the step of constructing a mid-range memory summary pool for the teacher model based on the first memory information includes: The short-term memory vectors of the teacher model at various time points within a preset time window are aggregated by an aggregator to obtain the memory information set of the preset time window, and the mid-term memory summary pool of the teacher model is obtained based on the memory information set of the preset time window.
[0015] In an exemplary embodiment of this disclosure, determining the instance alignment loss of the teacher model and the student model at various time points based on the first memory information and the second memory information includes: Based on a preset instance matching strategy, the short-term memory vectors in the memory information sets of the teacher model and the student model are aligned, and the instance alignment loss of the teacher model and the student model at each time point is determined according to the aligned short-term memory vectors.
[0016] In one exemplary embodiment of this disclosure, the method further includes: If the teacher model is a black-box model, then the mid-range memory summary pool of the teacher model is obtained through an external summarizer; wherein, the external summarizer includes a cascaded structure of a frame-level feature encoder and a temporal aggregator, and the model parameters of the external summarizer are frozen during training.
[0017] In one exemplary embodiment of this disclosure, the method further includes: The loss weights of at least one of the perceptual loss, response loss, and memory loss are adaptively adjusted based on a preset weight adjustment strategy.
[0018] In one exemplary embodiment of this disclosure, the method further includes: When the moving average of the time-domain lag difference between the first video frame sample and the second video frame sample is greater than or equal to a preset time-lag difference threshold, the loss weight corresponding to the response loss is increased.
[0019] In one exemplary embodiment of this disclosure, the method further includes: The activation phase of the perceptual loss, the response loss, and the memory loss is determined based on a preset activation strategy.
[0020] According to a second aspect of this disclosure, a model distillation apparatus is provided, comprising: The action sequence input module is used to input the sampled action sequence samples of the game objects into the teacher model and the student model respectively, to obtain the first video frame sample and the first memory information output by the teacher model, and the second video frame sample and the second memory information output by the student model. The perceptual loss determination module is used to obtain the perceptual loss based on the image features of the first video frame sample and the image features of the second video frame sample. The visual index determination module is used to extract a first time-domain index and a first frequency-domain index of the visual signal in the region of interest of the first video frame sample, and a second time-domain index and a second frequency-domain index of the visual signal in the region of interest of the second video frame sample. The response loss determination module is used to obtain time domain loss based on the first time domain index and the second time domain index, obtain frequency domain loss based on the first frequency domain index and the second frequency domain index, and obtain response loss based on the time domain loss and the frequency domain loss. The memory loss determination module is used to determine the global distribution loss of the teacher model and the student model, as well as the instance alignment loss at each time point, based on the first memory information and the second memory information, and to obtain the memory loss based on the global distribution loss and the instance alignment loss. The model parameter update module is used to obtain a joint loss based on at least one of the perceptual loss, the response loss, and the memory loss, and to update the model parameters in the student model based on the joint loss to obtain a game object interaction model.
[0021] According to a third aspect of this disclosure, an electronic device is provided, comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform a distillation method of any of the preceding models by executing the executable instructions.
[0022] According to a fourth aspect of this disclosure, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the distillation method of any of the models described above.
[0023] The exemplary embodiments disclosed herein can have the following beneficial effects: In the model distillation method of the exemplary implementation of this disclosure, a multi-dimensional model distillation framework is constructed by combining perceptual loss, response loss, and memory loss. This framework not only ensures the high fidelity of the student model in visual output, but also effectively transmits the dynamic response characteristics and intrinsic memory logic mechanisms of the teacher model to the student model. This solves the problems of optimizing visual similarity, dynamic response transmission characteristics, and memory mechanisms. The student model trained in this way can provide an interactive experience and intelligent behavior comparable to complex teacher models while maintaining lightweight and real-time performance. This significantly improves the performance and user experience of the game object interaction model and is adapted to resource-constrained game terminal devices.
[0024] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description
[0025] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure. It is obvious that the drawings described below are merely some embodiments of this disclosure, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort.
[0026] Figure 1 A schematic flowchart of the distillation method of a model according to an exemplary embodiment of the present disclosure is shown; Figure 2 A schematic flowchart illustrating the calculation of perceptual loss in an exemplary embodiment of this disclosure is shown; Figure 3 A flowchart illustrating the extraction of the first time-domain index and the first frequency-domain index according to an exemplary embodiment of this disclosure is shown. Figure 4 A schematic flowchart illustrating the calculation of the global distribution loss in an exemplary embodiment of this disclosure is shown; Figure 5 A schematic diagram of the overall framework for dual-channel distillation according to a specific embodiment of the present disclosure is shown; Figure 6 A flowchart illustrating a response loss calculation method according to a specific embodiment of the present disclosure is shown; Figure 7 A flowchart illustrating a memory loss calculation method according to a specific embodiment of the present disclosure is shown; Figure 8 A schematic flowchart of a dual-channel distillation method according to a specific embodiment of the present disclosure is shown; Figure 9 A schematic diagram of an enhanced core technology architecture according to a specific embodiment of the present disclosure is shown; Figure 10 A block diagram of a distillation apparatus of a model according to an exemplary embodiment of the present disclosure is shown; Figure 11 A schematic diagram of the structure of a computer system suitable for implementing the embodiments of the present disclosure is shown. Detailed Implementation
[0027] Example embodiments will now be described more fully with reference to the accompanying drawings. However, example embodiments can be implemented in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided to make this disclosure more comprehensive and complete, and to fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a full understanding of embodiments of this disclosure. However, those skilled in the art will recognize that the technical solutions of this disclosure can be practiced with one or more of the specific details omitted, or other methods, components, apparatus, steps, etc., can be employed. In other instances, well-known technical solutions are not shown or described in detail to avoid obscuring various aspects of this disclosure.
[0028] Furthermore, the accompanying drawings are merely illustrative of this disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and therefore repeated descriptions of them will be omitted. Some block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.
[0029] In modern game development, it is necessary to construct a grand, realistic, and AI-driven "world model" that can respond instantly to player actions. An ideal world model is a "teacher model" running on a cloud supercomputer, with beautiful graphics, realistic physics, and logical consistency. However, the complexity and computational load of such a model are enormous. Running it directly on the hardware of ordinary players, such as on a PC (Personal Computer) or mobile device, will inevitably result in severe lag, completely failing to meet the game's requirements for smooth interactive frame rates.
[0030] In some relevant embodiments, the technical challenges of model distillation mainly include the following aspects: 1. Distillation has a single objective and ignores intrinsic characteristics. The relevant solutions focus on making the student model's output similar to the teacher model in terms of perception (LPIPS (Learned Perceptual ImagePatch Similarity)), features, or relationships, only caring about "whether it looks similar".
[0031] Based on the above problems, this disclosure proposes a "dual-channel mental distillation" that extends the distillation target to two key intrinsic characteristics of the model: interactive feel (alignment of characteristics through action-visual response transmission) and memory organization ability (two-layer alignment of summary pool set statistics and instances).
[0032] 2. "Feel" is immeasurable and cannot be aligned.
[0033] Interactive feel refers to the system's transmission characteristics from user input to visual feedback, including dimensions such as response speed and accuracy. In other words, it refers to whether user input (such as quick turning or jumping) can obtain rapid, accurate, and muscle-memory-compliant visual feedback. It is a subjective and vague concept. Traditional distillation schemes lack methods for quantifying and optimizing the feel, resulting in student models that, while fast, are "half a beat slow" and "soft" in terms of feel.
[0034] Based on the above problems, this disclosure proposes the measurability and system identification of tactile sensation, innovatively formalizing "interactive tactile sensation" as the system transmission characteristics of "action → visual response", and proposing ROI (Region of Interest) response curves, giving measurable and controllable amplitude / delay / amplitude spectrum and phase spectrum as clear distillation indicators, so that "tactile sensation" becomes a measurable and alignable objective engineering goal.
[0035] 3. Memory ability is difficult to transfer.
[0036] Memory organization ability refers to a model's ability to organize, extract, and utilize historical information over the long term. Traditional methods (such as feature / attention map alignment) cannot guarantee that students can reproduce the way teachers organize long-term memory, leading to short-sighted behavior that is superficially similar but lacks the essence.
[0037] Based on the above problems, this disclosure proposes an alignment of memory organization methods. By performing a two-layer alignment of the mid-range memory summary pool with set-level statistical distribution and time-aligned instances, the student model is forced to learn the structure and methods of the teacher model in organizing and refining long-term memory, rather than simply imitating single-frame output.
[0038] 4. Difficulty in project implementation
[0039] Abstract mathematical goals are often difficult to achieve in practice due to high-dimensional output and unstable training.
[0040] Based on the above problems, this disclosure provides a complete engineering solution, giving a complete and reproducible engineering "recipe" from ROI selection, robust metric estimation, training stabilization techniques (such as Huber loss, course learning, adaptive weights) to covering black-box teacher scenarios. It is a specialized solution for generative interaction models.
[0041] This example implementation first provides a distillation method for a model. (See reference...) Figure 1 As shown, the distillation method of the above model may include the following steps: Step S110. Input the action sequence samples into the teacher model and the student model respectively to obtain the first video frame sample and the first memory information output by the teacher model, and the second video frame sample and the second memory information output by the student model.
[0042] Step S120. Obtain the perceptual loss based on the image features of the first video frame sample and the image features of the second video frame sample.
[0043] Step S130. Extract the first temporal index and the first frequency index of the visual signal in the region of interest of the first video frame sample, and the second temporal index and the second frequency index of the visual signal in the region of interest of the second video frame sample.
[0044] Step S140. Obtain the time domain loss based on the first time domain index and the second time domain index, obtain the frequency domain loss based on the first frequency domain index and the second frequency domain index, and obtain the response loss based on the time domain loss and the frequency domain loss.
[0045] Step S150. Based on the first memory information and the second memory information, determine the global distribution loss of the teacher model and the student model, as well as the instance alignment loss at each time point, and obtain the memory loss based on the global distribution loss and the instance alignment loss.
[0046] Step S160. Obtain the joint loss based on at least one of the perceptual loss, response loss, and memory loss, and update the model parameters in the student model based on the joint loss to obtain the interaction model.
[0047] In the model distillation method of the exemplary implementation of this disclosure, a multi-dimensional model distillation framework is constructed by combining perceptual loss, response loss, and memory loss. This framework not only ensures the high fidelity of the student model in visual output, but also effectively transmits the dynamic response characteristics and intrinsic memory logic mechanisms of the teacher model to the student model. This solves the problems of optimizing visual similarity, dynamic response transmission characteristics, and memory mechanisms. The student model trained in this way can provide an interactive experience and intelligent behavior comparable to complex teacher models while maintaining lightweight and real-time performance. This significantly improves the performance and user experience of the game object interaction model and is adapted to resource-constrained game terminal devices.
[0048] Below, in conjunction with Figures 2 to 9 The steps described above in this example implementation will be explained in more detail.
[0049] In step S110, the action sequence samples are input into the teacher model and the student model respectively to obtain the first video frame sample and the first memory information output by the teacher model, and the second video frame sample and the second memory information output by the student model.
[0050] In this example implementation, action sequence samples refer to a data set recording a series of discrete or continuous actions performed by a game object over a period of time. These samples are used to train and evaluate the model's behavior generation capabilities. A game object refers to a virtual object existing in a virtual game environment that can interact with the player or other game elements, such as a character, item, or environmental element. Action sequences can be sampled from an action template library and can be dynamically resampled based on the discriminability of each template for response metrics. For example, different test action template libraries can be scored based on their discriminability for metrics such as amplitude and latency, and dynamically resampled during training to improve the discriminability of transfer characteristics and convergence speed.
[0051] A teacher model is a pre-trained, high-performance model, typically complex in structure and with a large number of parameters; its output is considered a high-quality reference standard. A student model is a relatively simple model with fewer parameters and higher computational efficiency; its goal is to learn the knowledge and behavior of the teacher model through a distillation process. The first and second video frame samples refer to the video frame data generated and output by the teacher model and student model respectively after receiving the same action sequence sample input. The first and second memory information refer to the contextual or memory state information about the sequence generated or stored internally by the teacher model and student model during the processing of the action sequence sample, respectively.
[0052] In this example implementation, control data for a game object can be acquired, and based on the game object's basic action data, control data, and noise data, a sample of the game object's action sequence can be obtained.
[0053] Acquiring manipulation data for game objects refers to collecting instructions or input information generated when users or preset strategies control game objects. This data directly reflects the behavioral intentions of game objects in specific situations. It can be implemented by recording the sequence of player actions on input devices such as keyboards, mice, and gamepads during gameplay. Basic action data based on game objects refers to the set of actions that constitute the basic behavioral units of game objects, predefined or designed. Basic action data provides the skeleton and basic capabilities of game object behavior. It can be implemented by extracting preset animation sequences of game objects from the game engine's animation library or motion capture data, or by defining basic actions that game objects can execute in specific states through a rule engine. Noise data refers to random or non-deterministic perturbations introduced during the generation of action sequence samples. Introducing noise data aims to increase the diversity and robustness of action sequence samples, helping student models learn more generalized and adaptive behavioral strategies. It can be implemented by superimposing random signals such as Gaussian noise and Poisson noise onto manipulation data or basic action data to simulate operational errors or environmental uncertainties.
[0054] Based on the aforementioned manipulation data, basic motion data, and noise data, a series of continuous game object action commands or state transition sequences are generated according to a certain strategy, thus obtaining action sequence samples of the game objects. By organically combining and sampling these three types of data, action sequences that both conform to game logic and possess a certain degree of exploratory nature can be generated. These sequences serve as the common input for both the teacher model and the student model, ensuring that the input samples upon which the distillation process depends can comprehensively cover the potential behaviors of the game objects and simulate the uncertainties in the real world. This allows the student model to better learn complex behavioral strategies and decision-making logic from the teacher model.
[0055] For example, the action at time t is It is a d-dimensional vector representing the control variables of the camera or character, such as translation, turning, jumping, and skill triggering. To ensure that the model training can cover various complex scenarios, a hybrid driving strategy can be adopted. Its action template library is like a comprehensive "driving license test system", which not only includes human player operation recordings (corresponding to "real road conditions"), but also procedural benchmark actions (such as step jumps, pulses, sine sweeps, etc., corresponding to the standard test items of "driving test part 2") and noise perturbations (corresponding to "unexpected situations"), thereby comprehensively evaluating and training the model's response capabilities.
[0056] In step S120, the perceptual loss is obtained based on the image features of the first video frame sample and the image features of the second video frame sample.
[0057] In this example implementation, perceptual loss is a metric used to measure the visual differences between the video frames output by the teacher and student models. This loss aims to ensure that the video frames generated by the student model maintain consistency with the teacher model in image quality and detail. Image features refer to features extracted from video frames that describe the image content and visual attributes.
[0058] In this example implementation, such as Figure 2 As shown, the perceptual loss is obtained based on the image features of the first video frame sample and the image features of the second video frame sample, which may specifically include the following steps: Step S210. Input the first video frame sample and the second video frame sample into the perception model, and extract the image features of the first video frame sample and the second video frame sample in multiple intermediate layers of the neural network of the perception model.
[0059] A perceptual model is a deep learning model that simulates human visual perception and extracts high-level semantic features from images. Its role is to transform raw video frame samples into feature representations with greater semantic information and a higher level of abstraction. By inputting first and second video frame samples into the perceptual model, the pre-trained knowledge of the model can be used to extract rich image features from different levels of abstraction (i.e., multiple intermediate layers). These features reflect the semantic content and structural information of the image more accurately than the original pixel information, helping to more accurately measure the perceptual differences between two video frame samples. The perceptual model can be a pre-trained convolutional neural network, such as VGG (Visual Geometry Group), capable of extracting image features with good generalization ability.
[0060] Step S220. Obtain the feature distance between the first video frame sample and the second video frame sample based on the feature differences of the image features of multiple intermediate layers, and obtain the perceptual loss based on the feature distance.
[0061] Feature difference refers to the degree of difference between image feature vectors extracted from different intermediate layers of a perceptual model. By calculating the distance between these feature vectors, the similarity or difference between the first and second video frame samples at different perceptual levels can be quantified. The larger the feature distance, the greater the perceptual difference between the two video frame samples, and vice versa. The perceptual loss is calculated based on this feature distance, aiming to make the video frame samples generated by the student model perceptually closer to those generated by the teacher model. Feature distance can be calculated using metrics such as L1 distance, L2 distance, or cosine similarity. For example, for each intermediate layer, the L1 or L2 norm between the feature maps extracted by the two video frame samples at that layer can be calculated, and then the distances from different intermediate layers can be weighted and summed to obtain the final feature distance.
[0062] In this example implementation, the first video frame sample output by the teacher model and the second video frame sample output by the student model are not simply compared in terms of image features, but are simultaneously input into a pre-trained perceptual model. This perceptual model can simulate the human visual system's understanding of images, extracting image features at different levels of abstraction through multiple intermediate layers of its neural network. These intermediate layer features capture rich information ranging from low-level texture to high-level semantic concepts. Subsequently, by calculating the differences between the image features extracted by these multiple intermediate layers, the feature distance between the first and second video frame samples at different perceptual levels can be obtained. This multi-level feature distance can more comprehensively and meticulously reflect the visual perceptual differences between the two video frame samples. Finally, the perceptual loss is calculated based on this refined feature distance, enabling the loss function to more accurately guide the student model to learn the visual output of the teacher model, thereby overcoming the limitations of directly comparing the original image features and improving the effectiveness of the distillation process.
[0063] In this example implementation, the image features of the first video frame sample and the second video frame sample in multiple intermediate layers can be normalized by unit variance to obtain normalized features; the feature differences of the normalized features of multiple intermediate layers are calculated based on a preset sliding window, and the feature distance between the first video frame sample and the second video frame sample is obtained based on the feature differences.
[0064] Unit variance normalization is performed on the image features of the first and second video frame samples across multiple intermediate layers. This aims to eliminate scale differences between different feature dimensions, making all features statistically similar in distribution characteristics. This helps prevent certain features from dominating the distance calculation due to their large numerical range, thereby ensuring that all features contribute equally to the final feature difference calculation.
[0065] The feature differences of normalized features from multiple intermediate layers are calculated based on a preset sliding window, which is used to compare the differences of normalized features within local regions. This local comparison strategy can capture subtle spatial changes in image content while exhibiting robustness to global, irrelevant background changes. The sliding window allows for a more refined evaluation of the visual similarity or difference between two video frame samples in different local regions.
[0066] By employing the aforementioned technical solution, unit variance normalization of the intermediate layer image features effectively eliminates the influence of different feature scales, ensuring the fairness of feature comparison. Simultaneously, calculating feature differences based on a preset sliding window makes the feature distance calculation more attentive to local details, more sensitive to changes in local image structure and texture, and enhances robustness to global background changes or slight alignment deviations. This results in more accurate and stable feature distances, thereby improving the effectiveness of perceptual loss and enabling more precise guidance for student models to learn the perceptual capabilities of teacher models, ultimately enhancing the performance of the game object interaction model.
[0067] In step S130, the first time-domain index and the first frequency-domain index of the visual signal in the region of interest of the first video frame sample are extracted, as well as the second time-domain index and the second frequency-domain index of the visual signal in the region of interest of the second video frame sample.
[0068] In this example implementation, the region of interest (ROI) refers to a localized area in a video frame that is of particular interest and closely related to the actions of game objects or specific visual responses. The ROI acts like a precisely positioned virtual sensor, which can be fixed to a location on the screen, follow specific targets in the game (such as vehicle taillights or weapon sights), or even be dynamically selected by a separate network as the most noteworthy area.
[0069] Visual signals refer to image information extracted from a region of interest that changes over time. Specifically, a visual signal is a time series extracted from the region of interest, and its calculation formula is as follows: Here The function can convert pixel information within a ROI into a quantified metric, such as optical flow (the speed and direction of object motion), keypoint displacement, and brightness variation. Which one to choose... It depends on the game scenario. For example, light flow is more suitable for racing games, while the key point displacement of the crosshair is more direct in shooting games. That is, the first A complete video frame image (a tensor of size H×W×C) is... The raw materials for the function. By combining ROI and visual signals, subjective changes in images can be successfully converted into objective time-series data that can be analyzed by computers.
[0070] In this example implementation, time-domain metrics refer to characteristic parameters that describe the changes of visual signals in the time dimension, while frequency-domain metrics refer to characteristic parameters that describe the distribution of visual signals in the frequency dimension. Time-domain metrics may include amplitude and time delay, while frequency-domain metrics may include amplitude spectrum, phase spectrum, and group delay, etc.
[0071] In this example implementation, such as Figure 3 As shown, extracting the first temporal index and the first frequency index of the visual signal from the region of interest of the first video frame sample can specifically include the following steps: Step S310. Determine the region of interest from the first video frame sample based on a preset region selection strategy.
[0072] For the first input video frame sample, a preset region selection strategy can be used to intelligently identify and determine the region of interest most relevant to the game object's behavior. This process ensures that subsequent analysis can focus on core visual information, avoiding computational redundancy and information noise caused by blindly processing the entire complex video frame.
[0073] For example, ROIs can be determined by at least one of fixed anchors, task semantic anchors, or a learnable saliency selector, and the number of ROIs selected per frame can be 1–3, with the area of a single ROI being 2%–15%.
[0074] Step S320. Extract the target visual signal from multiple visual signals in the region of interest based on the scene type, and construct the corresponding response curve based on the target visual signal.
[0075] The response curve refers to the change in a visual signal over time. For example, given a visual signal in... The response curve of an action occurring at a given moment is defined as the change in the visual signal over time: .in, It is the frame where "the player presses the button / the AI triggers the action" that all subsequent analysis takes as the starting point; This is the mathematical notation for the "left-hand limit," which, simply put, is the visual signal value at the exact moment before the action occurs—equivalent to a "reading before movement." Using it as a baseline allows us to zero out subsequent signal changes (removing the baseline), focusing only on the increment caused by the action. This way, regardless of background noise in the game scene, subtracting the baseline reveals the response curve as purely a reflection of the visual change caused by the action. The response curve reflects how much the visual signal in the ROI region increases relative to the baseline after the action occurs. It's a curve that changes over time, its shape resembling a step response in a control system—rising rapidly, reaching a peak, and then possibly oscillating or gradually stabilizing. This curve completely records how the visual signal in the ROI region changes after the action occurs, encoding all the temporal details of the "feel."
[0076] Within a defined region of interest, target visual signals that have a critical impact on the game object's response can be precisely extracted from multiple possible visual signals, based on the current scene type. Once the target visual signal is determined, its intensity or features over time can be recorded and constructed as a corresponding response curve. The visual signal can be constructed based on at least one of optical flow, keypoint displacement, pixel brightness, depth map, or segmentation mask, and mean-mode filtering and Huber loss are applied to enhance robustness.
[0077] In this example implementation, if there are multiple regions of interest, the target visual signals extracted from each region of interest are weighted and fused.
[0078] Weighted fusion refers to combining multiple target visual signals from different regions of interest according to a preset weighting strategy to form a comprehensive visual signal. This fusion method allows for the differentiation of the importance of different regions. For example, weights can be assigned based on factors such as the size of the region of interest, its relevance to the game object, and its salience in the image. Alternatively, methods such as linear weighted summation, non-linear weighted averaging, or dynamic weighting based on attention mechanisms can be used for fusion.
[0079] Specifically, when the system identifies multiple regions of interest (ROIs) within a video frame sample, it extracts the corresponding target visual signals from each region. Subsequently, based on a pre-defined weighting strategy, different weights are assigned to these target visual signals to reflect their relative importance in the current scene. For example, regions with a greater impact on the behavior of game objects can be assigned higher weights. These weighted target visual signals are combined into a comprehensive visual signal. In this way, the temporal and frequency domain metrics subsequently extracted from this comprehensive visual signal can more comprehensively and accurately reflect the contributions of different regions in complex visual input. This allows the response loss calculated based on these metrics to more accurately measure the behavioral differences between the student model and the teacher model under complex visual stimuli. This weighted fusion mechanism ensures the dominant role of key visual information in loss calculation, effectively improving the accuracy and efficiency of model distillation.
[0080] Step S330. Extract the first time-domain index and the first frequency-domain index of the target visual signal based on the response curve corresponding to the target visual signal.
[0081] Based on the response curve, several core differentiable metrics can be extracted to quantify the "interactive feel." Time-domain metrics, such as amplitude, reflect the instantaneous intensity of the visual signal; latency quantifies the delay of the visual signal relative to a specific event, and can be used to evaluate the reaction speed of game objects. Specifically, the response amplitude A is the approximate extreme value or steady-state mean of the response curve within a preset steady-state window using softmax pooling, and the latency δ is the coefficient from the moment the action is triggered until the response curve first reaches a threshold. The time interval of A The value range can be [0.3, 0.7], and the steady-state window length can be [8, 64] frames.
[0082] Specifically, the amplitude A is defined as the response curve within a preset steady-state window after the action occurs. The peak value within the range can be calculated using a formula such as:
[0083] in, This represents the steady-state window length, typically [8, 64] frames, indicating how many frames of response are observed after an action is triggered to estimate "how much change this action will ultimately produce." `softmax_pool` is a "differentiable soft maximum" operation, directly taking... If the gradient is zero at non-maximum points, the gradient flow will be interrupted during training. However, softmax weighted averaging ensures gradients at all points, resulting in more stable training. As training progresses, the temperature parameter... Annealing from large to small, gradually bringing it closer to the true annealing process. In actual gameplay, if the amplitude A of the student model is smaller than that of the teacher model, players will feel that the movements are "soft" and "lacking in power." "Annealing" refers to gradually decreasing (or increasing) a certain parameter as training progresses, converging from a "coarse" state to a "fine" state.
[0084] Time Delay It directly quantifies the "sense of lag" or "responsiveness" of the operation, and its calculation formula can be, for example:
[0085] in, This represents the threshold coefficient, typically 0.5 (i.e., "half-peak"), with a value range of [0.3, 0.7]. It means to see how much time has passed since the response curve first reached 50% (or 30%~70%) of the peak value. This estimation is more robust—in real-world scenarios, the response curve may not be monotonic, and directly finding the peak moment is easily affected by noise. Indicates from The starting time offset (in frames) is minimized to find the shortest time that satisfies the conditions. The smaller the value, the faster the response, and the smoother and more responsive the operation.
[0086] Frequency domain metrics, including amplitude spectrum, phase spectrum, and group delay, can reveal the periodicity, energy distribution, and propagation characteristics of different frequency components of visual signals from a frequency perspective. Frequency domain metrics are obtained by performing a discrete Fourier transform on the response curve and action sequence, measuring the amplitude and / or phase spectra using logarithmically uniformly sampled frequency points, and applying wrapper correction to the phase. These multi-dimensional metrics allow for a deep and detailed characterization of the dynamic response of visual signals.
[0087] Besides intensity and speed, the "quality" of the response, such as whether it is "crisp" or "sluggish," is characterized by frequency domain characteristics. For example, by analyzing the response curve... and driving action Perform Discrete Fourier Transform (DFT) to obtain their spectra. and The ratio of the two is the system's transfer characteristic in the frequency domain. It describes "how much the model responds to different frequency operating signals, and how much phase delay it has." By aligning the amplitude spectrum (which determines the intensity of different frequency components) and the phase spectrum (which determines the delay of different frequency components) of the two models, it is ensured that the student model reproduces the dynamic details of the teacher model in high-frequency micro-operations (such as rapid crosshair adjustments and crisp impact), rather than just the similarity of macroscopic amplitude and delay.
[0088] The extraction methods for the second time-domain index and the second frequency-domain index of the visual signal in the region of interest of the second video frame sample are similar to those for the extraction methods of the first time-domain index and the first frequency-domain index, and will not be elaborated here.
[0089] The above technical solutions enable precise focusing and in-depth quantification of key visual information in video frames. By determining regions of interest based on a preset region selection strategy, interference from irrelevant background information is effectively eliminated, making subsequent visual signal analysis more targeted. Extracting target visual signals based on scene type and constructing response curves ensures a high correlation between the analyzed visual information and the behavioral responses of game objects, avoiding ineffective analysis of non-critical visual elements. Furthermore, by extracting multi-dimensional time-domain and frequency-domain indicators, including amplitude, time delay, amplitude spectrum, phase spectrum, and group delay, from the response curves, the dynamic characteristics of the visual signals can be comprehensively and meticulously characterized. This significantly improves the accuracy and effectiveness of student model behavioral response learning during model distillation, ultimately enabling student models to better reproduce the refined behavioral decisions of teacher models in complex game environments.
[0090] In step S140, the time domain loss is obtained based on the first time domain index and the second time domain index, the frequency domain loss is obtained based on the first frequency domain index and the second frequency domain index, and the response loss is obtained based on the time domain loss and the frequency domain loss.
[0091] In this example implementation, temporal loss and frequency domain loss refer to metrics used to measure the differences in temporal and frequency domain indicators between the visual signals output by the teacher and student models, respectively. Response loss refers to the loss obtained by combining temporal and frequency domain losses, used to measure the consistency between the visual response characteristics of the student model and the teacher model to action sequences. Frequency domain alignment can also minimize the group delay difference between the student and teacher models.
[0092] The response loss, also known as the action-visual response transfer characteristic alignment loss, aims to ensure that the "feel" of the student model is highly consistent with that of the teacher model. This loss innovatively transforms subjective feelings such as "responsiveness," "force," and "smoothness" into a series of calculable and optimizable objective engineering indicators (such as response amplitude A, latency, etc.). (such as frequency domain characteristics), and use these as the target for alignment.
[0093] In step S150, the global distribution loss of the teacher model and the student model, as well as the instance alignment loss at each time point, are determined based on the first memory information and the second memory information, and the memory loss is obtained based on the global distribution loss and the instance alignment loss.
[0094] In this example implementation, memory loss refers to the loss obtained by combining global distribution loss and instance alignment loss, used to measure the consistency between the student model and the teacher model in terms of memory organization and utilization. Memory loss, formally known as memory summary distribution alignment loss, aims to teach the student model "how to think and remember like the teacher." It minimizes the differences in memory organization between the teacher and student models by comparing their respective "memory summary pools" generated over a period of time, ensuring that the student model not only imitates behavioral appearances but also learns the inherent logical inductive ability.
[0095] Global distribution loss is a metric used to measure the difference in the overall distribution or organization of memory information between the teacher model and the student model over a period of time. Instance alignment loss is a metric used to measure the similarity or difference between memory information instances in the teacher model and the student model at a specific point in time; this loss aims to ensure that the student model's memory remains consistent with the teacher model's in a specific context. In this example implementation, as... Figure 4 As shown, the global distribution loss of the teacher model and the student model is determined based on the first memory information and the second memory information. This can specifically include the following steps: Step S410. Construct a mid-range memory summary pool for the teacher model based on the first memory information, and construct a mid-range memory summary pool for the student model based on the second memory information.
[0096] Mid-range memory summaries refer to the time window The model compresses and extracts historical information into a set of summary tokens, forming a collection. This summary pool represents the "memory" of the model over a period of time.
[0097] The mid-term memory summary pools for the teacher and student models are collections constructed based on the memory information of the teacher and student models, respectively, to summarize the behavior or memory state of the models over a period of time. The purpose is to integrate discrete short-term memories into a memory set that can reflect the mid-term behavioral patterns of the models.
[0098] In this example implementation, the short-term memory vectors of the teacher model at various time points within a preset time window can be aggregated by an aggregator to obtain a set of memory information for the preset time window, and a mid-term memory summary pool of the teacher model can be obtained based on the set of memory information for the preset time window.
[0099] The primary memory information includes short-term memory vectors of the teacher model at various time points. These short-term memory vectors are transient memory representations generated by the teacher model at each discrete time point when processing action sequence samples of game objects, capturing the teacher model's understanding of the game scene or object state at a specific moment. An aggregator is a module used to integrate multiple input information into a higher-level or more compact representation, and can be used to process a series of short-term memory vectors. A preset time window refers to a continuous time range considered when aggregating short-term memory vectors. This window setting can be based on empirical values or determined through hyperparameter tuning to balance the real-time and long-term nature of memory. The memory information set refers to a set of memory representations obtained by the aggregator aggregating the teacher model's short-term memory vectors at various time points within the preset time window.
[0100] Specifically, when processing action sequence samples of game objects, the teacher model generates discrete short-term memory vectors at various time points, which constitute the first memory information. To extract more generalized mid-range memories from these transient memories, the short-term memory vectors of the teacher model at various time points can be integrated within a preset time window. This aggregation operation can effectively filter out transient noise and capture the inherent patterns and temporal dependencies of the teacher model's behavior within the time window, thus obtaining a set of memory information representing that time window. Subsequently, based on these memory information sets, a mid-range memory summary pool for the teacher model can be constructed. The mid-range memory summary pool is generated within the time window by an aggregator g. The aggregator can include at least one of DeepSets aggregation, Sinkhorn feature aggregation, or Transformer-pooling, and possess at least one update strategy among Top-k retention, exponential decay, or gated update.
[0101] The above technical solution aggregates the short-term memory vectors of the teacher model at various time points within a preset time window using an aggregator, thereby obtaining a set of memory information and constructing a mid-range memory summary pool. This effectively solves the problem that the original instantaneous memory information is too fragmented and difficult to capture long-term behavioral patterns. This aggregation mechanism can extract more stable and representative mid-range memory representations from a large amount of short-term memory, enabling the mid-range memory summary pool to more accurately reflect the teacher model's strategies and decision-making logic over a period of time. When the mid-range memory summary pool obtained through this construction method is used to calculate the global distributed loss, the student model can learn deeper and more temporally coherent behavioral patterns of the teacher model, rather than just instantaneous reactions, significantly improving the efficiency of model distillation and the student model's ability to imitate the behavior of complex game objects.
[0102] Step S420. Compress the mid-range memory summary pool into a single global memory vector using a global aggregator, and obtain the global distribution loss based on the global memory vectors of the teacher model and the student model.
[0103] A global aggregator is a module used to compress multiple memory representations in a mid-range memory summarization pool into a single global memory vector. Its role is to extract the most core and representative global features from complex, multi-time-step memory information. A global aggregator can be a simple average pooling layer, a max pooling layer, or a more complex neural network structure, such as an attention mechanism network or a Transformer encoder. It learns how to effectively integrate information from the summarization pool to generate a fixed-dimensional vector that comprehensively reflects the model's overall behavior or strategy. The global memory vector is a single vector representation of the model's overall behavioral pattern or long-term strategy extracted from the model's mid-range memory summarization pool by the global aggregator. This vector is a condensation of all key memory information of the model over a period of time. Global distribution loss is a loss function that measures the difference between the global memory vectors of the teacher model and the student model. Its purpose is to encourage the student model to maintain consistency with the teacher model in the overall behavioral distribution. Global distribution loss can use Wasserstein-1 distance or MMD (Maximum Mean Discrepancy).
[0104] Specifically, global distribution alignment aims to align the macro-level "mindset" of the teacher-student model. An aggregator can be used. (e.g., Transformer-pooling) will use a summary pool All tokens are combined into a fixed-dimensional vector, which can be viewed as a "high-level summary of ideas" of the model's overall memory within that time window. Then, the set distribution distance (e.g., Wasserstein distance) between these two "summary of ideas" vectors from the teacher and student models can be calculated. This alignment forces the student model to learn how the teacher model organizes and summarizes information, rather than simply imitating individual memory fragments.
[0105] The above technical solution effectively addresses the problem of capturing global behavioral patterns by directly comparing raw memory information. By constructing a mid-range memory summarization pool and a global memory vector, the model's memory information over a period of time is effectively summarized, allowing the distillation process to focus on the consistency between the teacher and student models in terms of macroscopic behavioral distribution and long-term strategies. This helps the student model learn the teacher model's overall decision-making logic and game strategy more comprehensively and deeply, thereby significantly improving the student model's performance in complex game scenarios, making its behavioral patterns more closely match the teacher model, and ultimately improving the overall effectiveness of the distillation method and the model's generalization ability.
[0106] In this example implementation, the short-term memory vectors in the memory information sets of the teacher model and the student model can be aligned based on a preset instance matching strategy, and the instance alignment loss of the teacher model and the student model at each time point can be determined based on the aligned short-term memory vectors.
[0107] The pre-defined instance matching strategy aims to establish a correspondence between the short-term memory vectors generated by the teacher and student models at different time points. Alignment refers to finding one or more corresponding short-term memory vectors in the student model's memory information set for each short-term memory vector in the teacher model's memory information set, according to the pre-defined instance matching strategy. This process is a prerequisite for calculating the instance alignment loss, ensuring the validity and accuracy of the comparison. The instance alignment loss function is used to quantify the difference between the aligned short-term memory vectors of the teacher and student models, with the aim of encouraging the student model to mimic the teacher model's memory representation at the instance level.
[0108] Specifically, instance alignment aims to perform detailed spot checks after aligning the macro-level "mindset." The formula is as follows: The formula calculates the time within the memory window. At each point in time, the student model's memory summary Memory summary with teacher model The average distance between them. This ensures that the student model not only learns the teacher's "thinking style" but also maintains a high degree of consistency in the "thinking content" at each key point in time.
[0109] In this example implementation, the introduction of instance alignment loss compensates for the shortcomings of relying solely on global memory vectors for memory loss calculation. Specifically, after the teacher and student models generate their respective memory information sets, a pre-defined instance matching strategy is applied at the instance level to systematically establish a correspondence between short-term memory vectors in the teacher and student models' memory information sets. This matching process ensures that, in the temporal dimension, short-term memory vectors generated by the teacher and student models at similar times or states can be effectively correlated. The instance alignment loss reflects the similarity of the teacher and student models' instantaneous memory representations at various time points. By incorporating the instance alignment loss into the calculation of joint loss, the student model is guided not only to maintain consistency with the teacher model in the macroscopic global memory distribution but also to align with the teacher model in the microscopic, instance-level memory evolution path, thereby achieving more refined knowledge transfer.
[0110] In the model distillation process, not only are the global memory distributions of the teacher and student models considered, but instance-level alignment and loss calculation of short-term memory vectors at various time points are also introduced. This effectively captures the dynamic behavior and instantaneous state representation of the teacher model at different time steps, forcing the student model to mimic the teacher model's decision-making process and internal state evolution at a more refined level. This helps the student model learn richer temporal dependencies and contextual information from the teacher model, thereby significantly improving the accuracy and robustness of the student model's control over game objects in complex game scenarios and avoiding the loss of details that may result from relying solely on global information.
[0111] In this example implementation, if the teacher model is a black-box model, the mid-range memory summary pool of the teacher model is obtained through an external summarizer; wherein, the external summarizer includes a cascaded structure of a frame-level feature encoder and a temporal aggregator, and the model parameters of the external summarizer are frozen during training.
[0112] A black-box model refers to a model whose internal structure, parameters, and algorithmic logic are invisible or inaccessible to external observers. This means that it is impossible to directly explore its internal workings, modify it, or directly extract features from intermediate layers. In the context of model distillation, when the teacher model is treated as a black-box model, it can typically only be interacted with through its inputs and outputs, and its internal short-term memory vectors cannot be obtained or aggregation operations performed.
[0113] An external summarizer is a module independent of the teacher model. Its function is to extract and generate summary information corresponding to the internal memory information of the teacher model from its external output; this is known as a mid-range memory summary pool. The external summarizer aims to simulate or infer the memory features that the teacher model might generate when processing input. For example, the external summarizer can be a standalone neural network module whose input is video frames and output is a memory summary. A frame-level feature encoder is part of the external summarizer; its function is to process the input video frames and extract the visual features of each frame. A temporal aggregator is another component of the external summarizer. It receives the continuous frame feature sequence output from the frame-level feature encoder and aggregates it over time to generate a summary representing the memory information over a period of time. Its purpose is to capture the temporal dynamics and contextual information in the video sequence, integrating discrete frame features into a more generalized mid-range memory summary. A cascaded structure refers to the frame-level feature encoder and the temporal aggregator being connected sequentially; that is, the output of the frame-level feature encoder serves as the input of the temporal aggregator. This connection method ensures that each frame is first extracted independently, and then these frame-level features are fed into a temporal aggregator in chronological order to capture temporal dependencies and generate the final mid-range memory summary.
[0114] As a stable feature extraction and memory summary generation module, the behavior of the external summarizer is not affected by the student model training process. That is, all learnable parameters of the external summarizer remain unchanged throughout the distillation training process and no gradient updates are performed.
[0115] In this example implementation, a separate external summarizer is introduced, designed as a cascaded structure comprising a frame-level feature encoder and a temporal aggregator. When the internal structure of the teacher model is inaccessible, the frame-level feature encoder first processes the first video frame sample output by the teacher model frame by frame, extracting the visual features of each frame. Subsequently, these temporally ordered frame-level visual features are input into the temporal aggregator. The temporal aggregator is responsible for integrating these serialized frame features over time, thereby generating a mid-range memory summarization pool representing the teacher model's "memory" over a period of time. To ensure that the external summarizer provides a stable memory reference that does not change with the student model's training during the distillation process, its model parameters are frozen throughout the training period. In this way, even if the teacher model is a black-box model, a mid-range memory summarization pool for memory loss calculation can be effectively inferred and constructed from its observable output, enabling the distillation process to proceed smoothly. This overcomes the limitations imposed by black-box teacher models, greatly expands the applicability of the model distillation method, allowing it to be applied to a wider range of black-box teacher model scenarios, and improving the flexibility and practicality of the distillation method.
[0116] In step S160, a joint loss is obtained based on at least one of the perceptual loss, response loss, and memory loss, and the model parameters in the student model are updated based on the joint loss to obtain the interaction model.
[0117] In this example implementation, joint loss refers to the total loss obtained by combining at least one of perceptual loss, response loss, and memory loss, used to guide the training process of the student model. Model parameters refer to the numerical weights and biases that constitute the internal structure and function of the student model. By adjusting these parameters, the model can learn and improve its performance. The interactive model refers to the student model obtained after distillation training, which can generate high-quality video frames based on the input action sequence and has similar visual response and memory capabilities to the teacher model, thereby achieving effective control over game objects.
[0118] The formula for calculating the joint loss can be, for example:
[0119] in, Indicates joint loss, Indicates perceived loss. Indicates the globally distributed loss. Indicates instance alignment loss. Indicates time-domain loss. Indicates frequency domain loss, , , These are the weighting coefficients.
[0120] This function acts like a commander-in-chief, adjusting various weighting coefficients ( (value), to balance "drawing likeness" (perceptual loss) "Remember well" (memory loss) ) and "good feel" (response loss) These three goals together guide the optimization direction of student models.
[0121] In this example implementation, the loss weights of at least one of the losses, namely perceptual loss, response loss, and memory loss, can be adaptively adjusted based on a preset weight adjustment strategy.
[0122] A predefined weight adjustment strategy refers to a set of predefined rules or algorithms used to guide the dynamic modification of weights associated with different losses during model distillation. This strategy aims to dynamically balance the impact of individual loss terms on the overall joint loss, thereby more effectively guiding the student model's learning. For example, adaptive weight adjustment can be achieved through uncertainty-weighted or gradient-normalization-based strategies.
[0123] In this example implementation, by adaptively adjusting the loss weights of at least one of the perceptual loss, response loss, and memory loss, the importance of different learning objectives in model training can be dynamically balanced, avoiding the problems of low training efficiency or poor model performance that may be caused by fixed weights.
[0124] In this example implementation, when the moving average of the time lag difference in the time domain indicators of the first video frame sample and the second video frame sample is greater than or equal to a preset time lag difference threshold, the loss weight corresponding to the response loss can be increased.
[0125] The difference in time-domain metrics, specifically time delay, between the first video frame sample output by the teacher model and the second video frame sample output by the student model is continuously monitored, and its moving average is calculated. When this moving average reaches or exceeds a preset time delay difference threshold, it indicates that the student model deviates significantly from the teacher model in terms of temporal response, requiring intervention. The time delay difference threshold can be set based on specific application scenarios, game type requirements for temporal accuracy, and experience.
[0126] Increasing the weight of the response loss refers to dynamically increasing the weight coefficient of the response loss when calculating the joint loss. The response loss is a key indicator for measuring the differences between the teacher model and the student model in the time and frequency domain responses of visual signals. When significant time delay differences are detected, increasing the weight of the response loss can make the model more biased towards reducing inconsistencies in the time and frequency domains during parameter updates, thereby prompting the student model to adjust its temporal behavior more quickly and maintain better synchronization with the teacher model.
[0127] In this example implementation, the model distillation process is further optimized by introducing a dynamic monitoring and weight adjustment mechanism for time delay differences. This targeted weight adjustment enables the student model to learn and correct its temporal dynamic behavior more effectively, allowing the distillation process to more accurately align the performance of the student model and the teacher model in the time dimension.
[0128] In this example implementation, the activation phase of perceptual loss and at least one of response loss and memory loss can also be determined based on a preset activation strategy.
[0129] The preset activation strategy aims to dynamically determine which loss terms should be activated and participate in the calculation of joint loss based on the current state or objective of distillation training. Specifically, before calculating the joint loss, the system queries the activation strategy based on the current training stage or preset conditions to determine which terms among perceptual loss, response loss, and memory loss should be activated. Only loss terms determined to be activated will participate in the construction of joint loss. For example, only perceptual loss can be activated in the early stages of training, and response loss and memory loss can be gradually activated as training progresses; different loss activation combinations can be preset for game scenarios or game object behaviors of different complexities; or a strategy based on loss value or gradient changes can be used to temporarily disable a loss term when its value or gradient tends to stabilize or fall below a certain threshold over a period of time to avoid overfitting or computational redundancy.
[0130] For example, a course learning strategy can be introduced during frequency domain alignment, which involves increasing the weights of the high-frequency subbands w(ω) linearly from 0 to 1 during the training process.
[0131] The above technical solution allows for flexible control of the activation status of perceptual loss, response loss, and memory loss based on different stages of distillation training or specific needs. This loss term selection mechanism based on a preset activation strategy effectively avoids the waste of computational resources and gradient interference caused by calculating and optimizing certain loss terms at inappropriate stages. This allows the student model to focus more on the core learning objectives of the current stage during training, such as focusing on visual feature matching in the early stages and gradually introducing temporal response and memory consistency learning in the later stages. This method further optimizes the stability and efficiency of the distillation process, helping the student model converge faster and more accurately.
[0132] Examples of loss-enabled and weight-adaptive strategies are shown in Table 1:
[0133] Table 1
[0134] In this example implementation, in addition to the conventional distillation loss used to make the visual output of the student model perceptually similar to that of a teacher model, at least one of the following losses is introduced: a) Action-visual response transfer characteristic alignment loss: For a preset action sequence, the visual response signals of the output videos of the student model and the teacher model in the region of interest are obtained, response curves are constructed, and the response amplitude A and response lag are calculated. The differences between the amplitude spectrum and the phase spectrum are minimized; b) Memory organization alignment loss: Within a preset time window, mid-range memory summary sets for the student model and the teacher model are generated respectively, and the overall statistical distribution distance and / or time-aligned summary instance distance between the two sets are calculated to minimize the distance. If both response loss and memory loss are introduced simultaneously, compared to a scheme that only introduces one, a synergistic technical effect of low-latency interactive feel and long-term memory consistency can be obtained simultaneously under the same inference budget.
[0135] like Figure 5 The diagram shown is a schematic representation of the overall framework of a dual-channel distillation system according to a specific embodiment of this disclosure. The framework includes the following parts: 1. Input Data Layer: Through the action sequence generation module, human manipulation, programmed test actions, and noise disturbances are integrated to generate standardized action sequences as a unified input for teacher and student models, ensuring the consistency and comprehensiveness of the evaluation.
[0136] 2. Model Execution Layer: The teacher model (a high-quality model with frozen parameters) and the student model (a lightweight model to be trained) process the same action sequence in parallel, which reflects the core idea of "master demonstration and student practice" in distillation learning.
[0137] 3. Feature Output Layer: The teacher and student models generate two types of key outputs: visual frames (for perceptual alignment and tactile alignment) and memory states (for memory organization alignment), providing basic data for subsequent multi-channel loss calculation.
[0138] 4. Dual-channel distillation loss layer: Contains the following two parallel channels: Channel 1 (Perceptual Alignment): By comparing the visual output of teachers and students through a perceptual loss calculator, the traditional "painting skills" can be passed on.
[0139] Channel 2 (Mindset Alignment): Divided into two sub-channels, which align "interactive feel" through the response loss calculator and "memory organization method" through the memory loss calculator.
[0140] 5. Adaptive Optimization Layer: The weight controller intelligently combines the three types of losses into a joint loss based on the course learning strategy and the adaptive adjustment algorithm, and finally updates the student model parameters through the optimizer.
[0141] like Figure 5 As shown, the key technical points and data settings in the specific implementation scheme are as follows: 1. Action Sequence Generation: Input is the initial synchronization state and a random seed; output is an action tensor A∈R^{T×d} of length T. Implementation suggestions: The programmed reference set includes: step (amplitude U[a_min,a_max], number of hold frames U[8,64]), pulse (width U[1,4]), and frequency sweep (logarithmically uniform sampling of start and end frequencies [1 / T_max, 0.25]); where T_max is the maximum sequence length / the longest action duration in frames.
[0142] The combined sampling ratio (human: programmed: noise) can be 6:3:1; the noise is band-limited white noise BN(f_c) to avoid high-frequency knocking, and f_c is the cutoff frequency.
[0143] 2. Perceptual Loss Calculation: Input is video clips or features of T_Out1 and S_Out1; output is... Scalar. Where T represents the teacher model, S represents the student model, T_Out1 represents the visual frames output by the teacher model, S_Out1 represents the visual frames output by the student model, T_Out2 represents the memory states output by the teacher model, and S_Out2 represents the memory states output by the student model.
[0144] Optional LPIPS or multilayer VGG sensing loss: ;in, This represents the layer index of the VGG network. Indicates the first Layer weight coefficients, This indicates that the VGG network is the first The layer is a feature extractor, containing the activation values after inputting the image; the features are normalized to unit variance. Higher weights are assigned to higher levels. The core idea of the above formula for calculating perceptual loss is to compare high-level semantic features extracted by the perceptual model (VGG) instead of directly comparing pixel values (MSE). This is because the human visual system judges images more closely by distances in the feature space than in the pixel space.
[0145] Enhanced temporal consistency: Sliding window matching (window W=8~16) is performed in the time dimension, and small-amplitude optical flow alignment is added to compensate for phase drift.
[0146] 3. Response Loss Calculation: Inputs are T_Out1, S_Out1, and ROI; output is a scalar L_resp. Differentiable implementation is emphasized here. Amplitude soft pooling: ,τ annealing [0.3,0.05].
[0147] Its function is to control the "softness" of the softmax distribution. When it is large (e.g., 0.3), The differences are small, the weights at each time point tend to be uniform, the whole formula approximates a weighted average, the gradient is distributed across each time step, training is very stable, but the estimated magnitude is small. It will be smaller (because the "average peak value" after softening is taken). Small (e.g., 0.05) hours: The differences are significant, with weights highly concentrated at the maximum value. The entire formula approximates the true max, and the estimated value is much larger. It is very close to the true peak of the response curve, but at this time the gradients almost all come from the maximum point, and the gradients at other time steps are close to zero, which may make the training unstable.
[0148] The purpose of annealing is to use large amounts of heat in the early stages of training. (0.3) allows the gradient flow to permeate the entire response curve, enabling the model to learn stably; as training progresses, By gradually reducing the coefficient of magnitude to 0.05 using a cosine or linear timescale, the amplitude estimation becomes increasingly accurate. This ensures both early stability and final accuracy—similar to metal annealing, allowing time for structural adjustment before gradually locking in the optimal configuration.
[0149] Time delay δ is used to locate the first arrival using linear interpolation. At intersection point A, the interpolation weights are gradient-friendly; parallel supervision δ' is obtained from the cross-correlation peak value. = Huber( , ')+ .
[0150] Frequency domain: DFT frequency points K∈[16,64] logarithmic sampling; phase wrapping uses the principal value interval and phase unwrapping (threshold π); subband weights w(ω) linear schedule (enabled after 30% training).
[0151] 4. Memory Loss Calculation: Input is the time window memory T_Out2 and S_Out2; output is the scalar L_mem. Key Engineering Points: Global distribution distance: Sinkhorn approximation Wasserstein-1, regularization ε∈[0.01,0.1], iteration steps 20~50, numerical stability using log-sum-exp (log-sum exponent).
[0152] Instance alignment: π can be set as a shared linear projection W∈R^{d×d} or a Hungarian matching (approximately differentiable scheme: Sinkhorn-Knopp assignment matrix). π is an alignment mapping used to solve the problem that "memory tokens in the teacher-student model may not correspond one-to-one".
[0153] Shared linear projection: Using a learnable matrix Project the teacher's token onto the student's space and then calculate the distance. This is the simplest solution, assuming a linear relationship between the two.
[0154] Hungarian matching: Within a fixed window, perform optimal bipartite graph matching on all student and teacher tokens to find the pairing scheme that minimizes the total distance. Because it is discrete and non-differentiable, the Sinkhorn-Knopp assignment matrix is used as an approximate differentiable substitute.
[0155] 5. Weighted controller: Uncertainty weighting or error gating.
[0156] Uncertainty-weighted: ,in It is a learnable parameter, representing the first... The uncertainty of each loss item It is a regular expression to prevent Infinitely increasing the weights has the advantage of not requiring manual weight adjustment, as the model automatically discovers the relative importance of each loss during training; or GradNorm (Gradient Normalization) maintains the gradient norm balance across multiple tasks.
[0157] Error gating: when EMA( )> hour, ← ·(1+g), g∈[0.2,0.5], and a cooldown period is set to avoid oscillation. EMA (Exponential Moving Average) refers to a method that does not directly consider the current frame. Instead of looking at the exponentially weighted average of its historical values, we can filter out occasional noise peaks. Only when the time lag error is consistently large will the weight be increased. This refers to the trigger threshold. For example, setting it to "3 frames" means that "if the moving average of the lag error exceeds 3 frames, it indicates a problem with the alignment." The cooldown period refers to a "cooldown period" that can be set after the weights are increased. During this period, the weights will not be increased further to avoid continuous triggering that could lead to exponential expansion of the weights and cause training oscillations.
[0158] Numerical stability: Global gradient clipping ≤ 1.0; Mixed precision training with loss scaling; Input / feature standardization (zero mean, unit variance); DFT uses Hanning window to suppress spectral leakage.
[0159] like Figure 6 The diagram shown is a flowchart illustrating a response loss calculation method in a specific embodiment of this disclosure. The specific steps of the flowchart are as follows: Step S610. ROI Smart Selection.
[0160] The system offers three ROI selection strategies: fixed anchor points (such as the center of the screen), semantic anchor points (such as tracking a weapon's crosshair), and learnable selectors (neural networks automatically select the most sensitive areas). This step determines "where to measure feel."
[0161] Step S620. Visual signal extraction.
[0162] Within the selected ROI, based on the specific application scenario, select from multiple signal types: optical flow (motion velocity), keypoint displacement (positional change), brightness change (visual impact), and depth change (3D spatial sense). Signal extraction function. Image regions can be converted into time series.
[0163] Step S630. Construct the response curve.
[0164] Through calculation The response curve after the action occurs is constructed. To improve robustness, filtering is also performed to remove noise.
[0165] Step S640. Multidimensional index estimation.
[0166] Quantifying "feel" precisely from two dimensions: Time-domain metrics: amplitude A (response strength estimated by softmax pooling) and delay δ (delay time to reach half peak value).
[0167] Frequency domain metrics: Amplitude spectrum (frequency response intensity), phase spectrum (frequency phase delay), and group delay (more refined delay characteristics) are extracted through discrete Fourier transform.
[0168] Step S650. Loss calculation and alignment.
[0169] The time-domain loss and frequency-domain loss are calculated separately and then combined into a response loss. By minimizing the differences between the teacher and student models on these metrics, the precise transmission of "feel" can be achieved.
[0170] like Figure 6 As shown, the key technical points and data settings in the specific implementation scheme are as follows: 1. Intelligent ROI selection: Fixed anchor point: Predefined screen coordinates (such as center / corner / crosshair position).
[0171] Semantic anchors: target boxes or key points (e.g., weapons, vehicle taillights, NPC heads) from detection / segmentation / keypoint models.
[0172] Learnable selector S_ A lightweight network is used to output K candidate ROIs and their confidence scores. Discrete approximation selection is performed using Gumbel-TopK (sampling k samples without replacement), and area and overlap regularization are added (area ∈ [2%, 15%], IoU (Intersection over Union) suppresses overlap).
[0173] 2. Visual signal extraction y(t) = ψ(frame, ROI): Optical flow: TV-L1 (a classic optical flow algorithm with Total Variation + L1 penalty, robust to noise and occlusion, fast and suitable for real-time scenarios) or RAFT (Recurrent All-Pairs Field Transforms, a high-precision optical flow network based on deep learning). The output is mean / energy pooled within the ROI. Median filtering (window 3~5) is used in time to improve robustness. Median filtering is applied again to the extracted time series to remove outliers caused by occlusion or fast motion, making the signal smoother.
[0174] Keypoint displacement: Using Shi-Tomasi / FAST for point selection and LK (Lucas-Kanade) tracking, the average displacement magnitude of points within the ROI is calculated. Shi-Tomasi / FAST are two corner detection algorithms used to find textured and easily trackable "feature points" within the ROI. LK tracking assumes local constant optical flow between adjacent frames and uses least squares to calculate the displacement of each feature point; this method has low computational cost and good real-time performance. The average displacement magnitude is calculated for all successfully tracked feature points within the ROI, taking the magnitude of the displacement vector of each point. Then take the average to obtain the scalar. .
[0175] Brightness / Depth: Normalize the mean or energy of the ROI and apply bilateral filtering for noise reduction. Bilateral filtering can remove noise while preserving edges and is more suitable for processing structured image regions than Gaussian filtering.
[0176] Different game scenarios are suitable for different For example, in racing games where cars travel at high speeds, optical flow is the best choice; in shooting games where the crosshair is small and precise, key point tracking is more suitable; and in explosion and skill effect scenes, changes in brightness are the most direct signal.
[0177] 3. Response curve r(t): Trigger point : Determined by marking "significant events" in the action template or by exceeding the input threshold; Preprocessing: Zero-meaning to remove baseline drift; median or Savitzky-Golay filtering for smoothing; Differentiable peak value: A = softmax_pool(r[ , +Δ]), temperature τ is used for cosine annealing to avoid gradient saturation.
[0178] 4. Indicator Estimation: Time Delay Linear interpolation for first-order reach time, ∈[0.3,0.7]; Parallel supervision ' = argmax xcorr(a(t), y(t)), with = Huber( , ') + Improve robustness.
[0179] Frequency domain: Perform DFT after windowing (Hanning window); logarithmically uniformly sample the number of frequency points K (K=32 is recommended); perform phase wrapping correction and phase unwrapping; group delay. =- ∠H(ω) / ω is approximated by finite difference.
[0180] 5. Loss Design:
[0181] in, This represents the response magnitude of the student model (estimated by softmax_pool). This indicates the magnitude of the teacher model's response. This represents the response time delay of the student model (reaching...). (time) This represents the response time lag of the teacher model. This represents the weighting coefficient of time delay loss relative to amplitude loss, used to control the balance between "responsiveness" and "power".
[0182]
[0183] Among them, the high-frequency weight w(ω) linearly changes from 0 to 1 after 30% of the training process; The transfer function describes "giving the system a frequency of..." Given a sinusoidal input, how much is the amplitude amplified (or reduced) and how much is the phase lagging of the system's output compared to the input?
[0184]
[0185] in, This refers to the weight of the frequency domain loss relative to the time domain loss. The initial value is 0, which means that in the early stages of training, only amplitude and time delay (time domain) are supervised. After the basic feel is aligned, frequency domain supervision is gradually added by starting from 0 to γ_max (e.g., 0.5) according to the course.
[0186] like Figure 7 The diagram shown is a flowchart illustrating a specific embodiment of the memory loss calculation method disclosed herein. It demonstrates how to achieve deep alignment of the "memory organization methods" between teacher and student models, representing a complete system from memory construction to two-layer alignment. The specific steps of the flowchart are as follows: Step S710. Input data stream.
[0187] The input sequence (including historical frames and actions) is fed into the teacher encoder and student encoder respectively, and the memory construction process begins.
[0188] Step S720. Memory pool construction.
[0189] The teacher and student models each construct their own memory systems: Short-term memory: Initial encoding of the current input.
[0190] Aggregator: Uses architectures such as Transformer or DeepSets to aggregate short-term memories.
[0191] Update strategy: Manage the memory pool through Top-k retention, exponential decay, or gating mechanisms.
[0192] Mid-range summary pool: final formation time window The memory set inside .
[0193] Step S730. Black-box teacher adaptation.
[0194] For black-box teacher models that cannot access their internal state, an external summarizer scheme is provided to reconstruct the memory representation from the video output using a frozen pre-trained network (such as ResNet+Transformer) and an appropriate frame sampling strategy.
[0195] Step S740. Double-layer memory alignment.
[0196] Memory alignment is achieved through two complementary levels: Level 1 – Alignment of Global Mindsets: Compressing the entire memory pool into a single vector through a global aggregator. It provides multiple distribution distance methods: Wasserstein-1 distance, maximum mean difference (MMD), JS (Jensen-Shannon) divergence, and energy distance; and calculates the global distribution loss. This ensures consistency in the macro-level "thinking patterns" between teachers and students.
[0197] Level Two – Temporally Aligned Instance Matching: Provides multiple instance matching strategies: direct alignment, learnable projection, and Hungarian optimal matching; calculates instance alignment loss. This ensures alignment of memory details at each time step; Formula: .
[0198] Step S750. Synthesis of memory loss.
[0199] The losses from the two layers are combined into a total memory loss through a weight balancing mechanism (β parameter adjustment). .
[0200] This dual-level alignment mechanism ensures both the macro-level consistency of memory organization ("how to think") and the precise transmission of key details ("what to think"), thereby achieving true transmission of the "mindset".
[0201] like Figure 7As shown, the key technical points and data settings in the specific implementation scheme are as follows: 1. Memory Construction: Short-time memory vector Features from intermediate layers of the backbone encoder (such as Temporal UNet / Temporal Transformer), rather than the final output layer; these intermediate layers contain both low-level visual features (texture, edges) and high-level semantic features (object category, motion intent), resulting in the highest information density. Dimensions Consistent with the backbone network, typically 256~1024. The larger the size, the more information it carries, but the greater the computational overhead and the greater the difficulty of alignment.
[0202] Aggregator: aggregator The task is to Frames The sequence is compressed into a fixed-size representation. , It is an aggregated "summary", representing a compressed representation of a piece of history.
[0203] Transformer-pooling: The Transformer takes a sequence of input tokens, which are then encoded with positional codes to distinguish the temporal order. A CLS token (classification token) is inserted at the beginning of the sequence. After self-attention, the output vector of the CLS token is... This aggregates the information from the entire sequence; DeepSets: ,in ρ stands for MLP (Multilayer Perceptron). Sinkhorn feature aggregation: Construct a similarity cost matrix and solve for a differentiable double random matrix to obtain weighted aggregation.
[0204] Update strategy: Top-k: Scores all candidate tokens based on activation energy or attention score, and only retains the highest-scoring tokens. indivual( ), the rest are discarded; Decay: For old tokens that were not selected, multiply by 1 / 2 each step. Exponential decay occurs; Gating: If new If the average distance to an existing token in the pool is below a threshold, it is merged into the nearest token to prevent the digest pool from being occupied by a large amount of similar content and reduce redundancy.
[0205] 2. Global distribution alignment: Wasserstein-1 (Sinkhorn approximation) implementation; MMD: Kernel function k(x,y)=exp(-||xy||² / 2σ²), σ logarithmic grid {0.5,1.0,2.0}; unbiased estimation is used to avoid self-similarity bias; Normalization: Tokens are first subjected to LayerNorm (Layer Normalization); distances are standardized between batches using mean-variance to stabilize the dynamic range.
[0206] 3. Instance alignment: Projection π(z) = Wz + b (shared or independent), L2 distance; Hungarian matching: Find the minimum matching of the S / T token graph within a fixed window. Approximately differentiable, it can be achieved by using the Sinkhorn allocation matrix P and || -P ||Measurement; Time consistency: Add smoothing regularization ||π( to the matching of adjacent time steps) )-π( To avoid jittery alignment, use )||.
[0207] 4. Loss composition and weighting: , The course learning progresses from 0 to 1; and , They all enter into adaptive weighting, using GradNorm or uncertainty weighting; Early training improvement The weights are gradually increased after the distributions are roughly aligned. With meticulous attention to detail.
[0208] 5. Black-box digester implementation: Frame encoder: ResNet-18 / 50 with the classification head removed, output feature maps GAP (Global Average Pooling) to vectors; Temporal aggregation: Single-layer Transformer or GRU (Gated Recurrent Unit), hidden dimensions 128~256; Sampling rate: The input frame is downsampled at a rate of 1 / 2 or 1 / 4 to ensure temporal coverage while controlling overhead; Freeze parameters: to prevent distillation target slippage caused by feature drift on the teacher side.
[0209] like Figure 8 The diagram shown is a schematic flow chart of a dual-channel distillation method according to a specific embodiment of this disclosure. The specific steps of the flow chart are as follows: Step S810. Initialize synchronization.
[0210] Starting from the beginning of the training batch, initial synchronization is performed first to ensure that the teacher and student models start from the same initial state.
[0211] Step S820. Action sequence sampling.
[0212] Action sequences are sampled, and a hybrid strategy (human control + proceduralization + noise) is used to generate diverse test actions.
[0213] Step S830a. Forward propagation of the teacher model.
[0214] Step S830b. Forward propagation of the student model.
[0215] The teacher model forward propagation and the student model forward propagation are performed simultaneously, generating their respective output frame sequences and memory states.
[0216] The three parallel loss calculation channels are as follows: Perceptual loss channel: directly compare the visual output of the teacher and student models to calculate the perceptual loss.
[0217] Memory loss channel: By aggregating memory pool operations, the global distribution distance and instance distance are calculated separately and finally combined into memory loss.
[0218] Response loss channel: The pipeline that goes through ROI selection → response curve extraction → index estimation calculates the time domain loss and frequency domain loss respectively, and combines them into the response loss.
[0219] Step S840. Adaptive weight calculation.
[0220] The adaptive weight calculation module dynamically adjusts the weights of each loss term to form a joint loss.
[0221] Step S850. Backpropagation.
[0222] Step S860. Parameter update.
[0223] Perform backpropagation and parameter updates to complete one training step.
[0224] like Figure 8As shown, the key technical points and data settings in the specific implementation scheme are as follows: 1. Initialize synchronization: Reset the hidden state, random seed, and environment state of the teacher and students to be consistent. The RNG (Random Number Generator) is seeded in a fixed order to ensure repeatability; If environmental simulation (physics / rendering) is involved, lock the same initial phase to eliminate exogenous differences.
[0225] 2. Action sequence sampling: The action library is divided into buckets according to template type; each batch is sampled from each bucket according to a set ratio. To avoid distribution offset, stratified sampling is used and coverage metrics (such as hit rate of each frequency band) are tracked.
[0226] 3. Forward propagation: Output: frames∈R^{T×H×W×C}, memories are the aggregate representations or intermediate hidden states at each step; Performance: Enable blended precision to reduce VRAM; enable clipping / distillation instrumentation during inference for student models.
[0227] 4. Perceived loss: Multi-scale computation (e.g., 224 / 128 / 64) to improve scale robustness; Temporal consistency: Inter-frame differential sensing terms can be added to suppress flicker.
[0228] 5. Memory loss: Aggregate window length ∈[32,128]; the token dimension is aligned with the student backbone; Global loss is enabled first, with β gradually increasing from 0 to 1 to avoid noise interference from early instances.
[0229] 6. Response loss: ROI strategy: During the training period, either the ROI can be fixed or a learnable selector can be enabled to improve adaptability; Frequency domain term: 30% off before training, retaining only the time domain loss; then linearly open the frequency domain amplitude spectrum; finally add the phase spectrum and group delay.
[0230] 7. Adaptive weighting: Uncertainty-weighted initial =1; Update every N steps (e.g., 1000) to smooth out jitter; GradNorm target gradient norm g The average of the three values is taken, and the learning rate is consistent with that of the main optimizer.
[0231] 8. Optimization and Stability: Optimizer: AdamW(lr=1e-4, β1=0.9, β2=0.999, wd=0.01); The basic idea of the Adam optimizer is to maintain an adaptive learning rate for each parameter, automatically adjusting it by tracking the first moment (mean) and second moment (variance) of the gradient. lr=1e-4 refers to the global learning rate (base step size); β1 refers to the first moment decay coefficient, β2 refers to the second moment decay coefficient, and wd represents the weight decay.
[0232] Gradient clipping: clip_grad_norm_(θ, 1.0), specifically refers to calculating the L2 norm of the gradients of all parameters (i.e., the total length of the gradient vector). ,if Then all gradients are scaled proportionally, so that ,if No processing is performed initially to ensure that training does not collapse due to occasional large gradients. The learning rate is then cosine annealed plus a 5% warm-up. Specifically, during the warm-up phase (the first 5% of training steps), the learning rate linearly increases from 0 to lr=1e-4. This is because the parameters are random and the gradient direction is chaotic in the early stages of training; directly using a large learning rate can easily push the parameters to very poor levels. A small learning rate is used first to "test" the waters, allowing the model to find a reasonable initial direction before accelerating. During the cosine annealing phase (the remaining 95%), the learning rate gradually decays from lr to near 0 according to a cosine curve. The characteristic of a cosine curve is slow decay in the early stages and rapid decay in the later stages, allowing the model sufficient time to fine-tune within a good range before finally landing smoothly.
[0233] Logs and Monitoring: Key KPIs include (Expected value of time delay error) (Expected value of amplitude error), Wasserstein distance (distribution distance of the memory summary pools of teacher and student models), FPS (inference frame rate of student model), and GPU memory usage.
[0234] like Figure 9 The diagram shown is an enhanced core technology architecture diagram of a specific embodiment of this disclosure, providing a comprehensive view of the solution and the complete data flow from input to output. The architecture diagram includes the following parts: 1. Input Data Layer: This layer demonstrates how the three sources of action sequences (human manipulation, procedural testing, and random perturbation) converge into the action sequence generator.
[0235] When processing video sequences output by the teacher model, frame sampling can be performed at 1 / 2 or 1 / 4 rate to balance information density and computational overhead.
[0236] 2. Model Execution Layer: Clearly distinguishes the different roles of the teacher model (a high-quality model with frozen parameters) and the student model (a lightweight model during training).
[0237] 3. Feature Extraction Layer: This layer details the output frames and memory states generated by the teacher and student models, which form the basis for all subsequent loss calculations.
[0238] 4. Dual-channel distillation core: Clearly demonstrates the parallel processing architecture of "perceptual alignment" (traditional drawing techniques) and "mental alignment" (memory + feel). In particular, memory organization alignment is carried out through two dimensions: global distribution alignment and instance alignment, while interactive feel alignment is carried out through a complete pipeline: ROI selection → signal extraction → response curve → time domain / frequency domain index → loss synthesis.
[0239] 5. Adaptive Optimization Layer: This layer demonstrates how to intelligently synthesize multiple loss terms into a final joint loss through strategies such as adaptive weight adjustment and course learning control.
[0240] Adaptive loss weights can employ uncertainty weighting (based on the homoscedastic uncertainty of the loss). (or GradNorm dynamically adjusts each weight) Furthermore, error-triggered gating can be designed, for example, when... The sliding average value exceeds the preset threshold. (e.g., 3 frames) temporarily increase 20-50%, until the error decreases.
[0241] Loss is not activated simultaneously. Generally, global distribution alignment is performed first, and instance alignment is added after the model has initially stabilized to avoid early gradient noise. Frequency domain loss is activated even later and can be phased: for example, it is not activated in the first 30% of training, only amplitude spectrum is aligned from 30% to 70%, and both amplitude and phase spectra are aligned after 70%.
[0242] 6. Stabilization Enhancement Module: Includes several auxiliary modules to improve training stability and usability, such as error trigger gating, online calibration, and black box adapter.
[0243] In the black-box teacher scenario, the external summarizer can be composed of a frame-level feature encoder (such as ResNet-18) pre-trained and frozen on a large image dataset (such as ImageNet) and a lightweight temporal aggregator (such as single-layer Transformer-pooling or GRU).
[0244] After the model is launched, the actual [data / data] can be back-estimated based on a large number of player logs. Drifting, and performing lightweight fine-tuning or control mapping correction, forms a closed loop of "operation and maintenance-experience".
[0245] To demonstrate the beneficial effects of the model distillation method in this example implementation, the following subjective and objective evaluations can be designed: 1. Objective indicators Feel error: Statistical time delay error on the standard test action set. and amplitude error .
[0246] Frequency domain distortion: Calculating the amplitude and phase spectra Norm differences.
[0247] Memory capability: Calculate the Wasserstein / MMD distance of the memory summary pool distribution; test accuracy on long-range recall tasks (such as specific entity re-identification after N steps).
[0248] 2. Subjective blind testing
[0249] High-level players scored the responsiveness, camera tracking, and drift / shaking feel.
[0250] Testers evaluate the consistency of NPC long-range behavior.
[0251] On the same hardware (such as RTX 4060), perform A / B testing and report the comparison curves in three dimensions: FPS, quality, and feel.
[0252] 3. Ablation test
[0253] Remove each Remove Remove frequency domain alignment and quantify the contribution of each component to the final performance.
[0254] Comparison of different ROI selection strategies and different memory window lengths The impact.
[0255] The following are some specific examples: Example 1: Racing game (tactile intensive) Application scenario: Develop a real-time AI opponent for a racing game, making its "driving feel" highly consistent with a realistic but not real-time "teacher AI driver".
[0256] Process demonstration: 1. Action Sequence: Design standardized test actions, such as "high-speed entry into a curve and emergency braking" and "rapid lane change after straight-line acceleration".
[0257] 2. ROI and Signal: Set the ROI at the vehicle's "taillight" or "front bumper", record the changes in optical flow in that area, and generate a response curve.
[0258] 3. Hands-on Alignment: Extract A (maximum optical flow value during steering) and δ (time from steering wheel rotation to half the peak optical flow value) from the curve. If the student model's δ is larger than the teacher model's, it indicates "slow response," and the loss function will penalize it; if the student model's A is smaller than the teacher model's, it indicates "weak steering," and it will also be penalized.
[0259] 4. Memory Alignment: Over a driving sequence spanning several laps, the distribution of memory summaries for key track points (such as specific corners and landmarks) by the teacher and student AI is compared, and calculations are performed. This ensures that students learn the teacher's "track memory method".
[0260] 5. Combined training: Combine the above losses to conduct end-to-end training.
[0261] Example 2: Distillation of "Feel" and "Tactical Memory" in Fast-Paced Action Games
[0262] Application scenario settings: • Game Background: A sci-fi action game with highly mobile and fast-paced PVE combat as its core.
[0263] • Technical Challenges: A "teacher model" running on a cloud server boasts cinematic visuals, accurate weapon recoil physics simulation, and complex enemy tactical AI capable of predicting player behavior. However, this model is computationally intensive, achieving only 15-20 FPS on mainstream gaming graphics cards (such as the RTX 4060), far from meeting the demands of real-time interaction. In contrast, a lightweight "baseline student model" designed solely for real-time performance can reach 120 FPS, but its controls are generally described by test players as "unstable" and "lacking in power," with enemy AI behavior being "rigid," completely lacking the tactical challenge offered by the teacher model.
[0264] • Objective: To train a new “student model” using a dual-channel distillation method, enabling it to run stably at over 100 FPS on an RTX 4060, while perfectly replicating the “responsiveness” and “tactical memory” of the teacher model.
[0265] Process demonstration: 1. Initialization and Action Sampling: A typical tactical scenario is selected, where the player controls the character to perform a difficult "180-degree turn and flick shot" maneuver, quickly locking onto and attacking a distant elite enemy. This action is precisely recorded as a standardized action sequence.
[0266] 2. Model forward propagation: The teacher model and the student model receive the exact same initial state (character position, enemy position, etc.) and action sequence, and propagate forward in parallel to generate their own video frame sequences (teacher_frames, student_frames) and internal memory states (teacher_memories, student_memories).
[0267] 3. Three-channel loss calculation (core module): • Perceived loss channel ( ): Calculate the LPIPS loss between teacher_frames and student_frames to ensure that the visual style and lighting effects of the student model are highly consistent with those of the teacher model.
[0268] • Response loss channel (feel alignment): • ROI Selection: The learnable ROI selector automatically and dynamically determines a 32x32 pixel ROI around the weapon's "holographic crosshair".
[0269] • Visual signal extraction: Extract the screen space displacement y(t) of the key point (crosshair center point) within the ROI as the visual signal.
[0270] • Indicator estimation and alignment: Time Delay (Serendipity): Calculation from the point of mouse input to the point of... The time it takes for the crosshair displacement to stabilize within a very small neighborhood of the target point. This was measured using the teacher model. The latency was 8ms, while the baseline student model was 25ms (which players perceived as noticeably "sluggish"). Loss terms will drive Approaching 8ms.
[0271] • Amplitude A (Power Sensation): Calculates the maximum displacement of the front sight due to recoil after firing. (Teacher Model) At 15 pixels, it provides a strong and controllable feedback. The baseline student model may only have 5 pixels (feeling "soft") or an irregular 30 pixels ("excessively jittery"). The loss term will penalize this bias.
[0272] • Frequency domain characteristics (sharpness): through By comparing the response curve spectra of the teacher and student models, it is ensured that the student model can reproduce the "crisp and efficient" shooting feedback of the teacher model, rather than the "sluggish and sluggish" texture.
[0273] • Memory loss pathway (tactical memory alignment): • Scenario Setting: After the player fires, the elite enemy immediately releases a "phase smoke grenade" and disappears from sight. The teacher model can accurately predict that the enemy will most likely reappear from behind cover on the left, based on the battlefield terrain and the enemy's attack pattern.
[0274] • Memory pool construction and alignment: In Within a time window of 64 frames, the teacher and student models each generate memory summary pools M_T and M_S.
[0275] • Global distribution alignment ( ): By calculating the Wasserstein distance W(M_T, M_S), the student model is forced to learn the teacher model's macroscopic understanding of the entire battle situation, such as "when smoke is present, the path with the highest probability of enemy attack is the left cover".
[0276] • Instance alignment ( ): Ensure the student model retains a memory summary of the specific event "elite enemy disappears at point A". , and the memory of the teacher model Precise matching in the vector space, remembering key details.
[0277] 4. Adaptive optimization and updates: Course Learning: Initial Training Phase and The emphasis is on ensuring students' models are "realistic" and "think logically" first; in the middle stage, and As the weighting gradually increases, the "feel" and "memory details" are fine-tuned.
[0278] Error-triggered gating: When an error is detected... When the moving average value is greater than 10ms for multiple consecutive frames, the weight controller temporarily increases the weight. 25%, the forced model prioritizes solving the operation delay problem.
[0279] Parameter update: final joint loss The parameters of the student model are backpropagated and updated through the AdamW optimizer.
[0280] 5. Result verification and data comparison: The evaluation metrics are as follows: • Average frame rate (RTX 4060): 18 FPS (teacher model) vs 120 FPS (student model). The frame rate dropped slightly by 5%, but it far exceeded the smooth standard. This sacrificed the quality of the core gaming experience.
[0281] • Slingshot aiming time lag error E[ The response latency was 0 ms (teacher model) vs 3 ms (student model), with an absolute error reduction of 82%. The responsiveness of the operation was greatly improved, almost reaching the level of the teacher model.
[0282] · Recoil amplitude error E[| [0 pixels (teacher model) vs 1.2 pixels (student model)] The force feedback is realistic and controllable, solving the problem of the baseline solution being "soft" or "excessively jittery".
[0283] • Tactical prediction accuracy (Probe Task): 95% (Teacher model) vs 92% (Student model). The AI's tactical memory ability is at the same level as the teacher model, making its behavior more logical and threatening.
[0284] Subjective blind test rating (professional gamers, N=20): N / A (teacher model) vs 8.9 / 10 (student model). Professional gamers commented, "Smooth operation, clear feedback, the AI can finally use its brain."
[0285] 6. Conclusions from the comparison of the proposed solutions: This example implementation successfully solves the fundamental problem of student models in traditional solutions being "similar in form but lacking in spirit" through a dual-channel approach of "feel" and "memory." At the cost of only a small sacrifice in frame rate, it achieves a significant improvement in core game experience (operation feel and AI intelligence), providing a complete and effective engineering solution for deploying high-quality world models on ordinary player hardware.
[0286] Example 3: Open-world NPCs (Long-range memory)
[0287] Application scenario: Distill a lightweight AI NPC model for open-world games, enabling it to possess long-term memory capabilities comparable to heavyweight teacher models.
[0288] Process demonstration: 1. Memory Window: Extend the coverage to include multiple conversations or multiple game days.
[0289] 2. Memory Alignment: The focus is on summaries of key plot information, player preferences, and historical interaction events.
[0290] 3. Task verifiability: Introduce additional probe task loss. For example, teachers and students can be asked to answer whether they discussed a certain topic with the player a few days ago. The accuracy of the answers can be included in the joint optimization goal to ensure that the memory is not only "stored accurately" but also "used correctly".
[0291] It should be noted that although the steps of the method in this disclosure are described in a specific order in the accompanying drawings, this does not require or imply that the steps must be performed in that specific order, or that all the steps shown must be performed to achieve the desired result. Additional or alternative steps may be omitted, multiple steps may be combined into one step, and / or a step may be broken down into multiple steps.
[0292] Furthermore, this disclosure also provides a model distillation apparatus. (See reference) Figure 10 As shown, the distillation apparatus of this model may include an action sequence input module 1010, a perceptual loss determination module 1020, a visual index determination module 1030, a response loss determination module 1040, a memory loss determination module 1050, and a model parameter update module 1060. Wherein: The action sequence input module 1010 can be used to input action sequence samples into the teacher model and the student model respectively, to obtain the first video frame sample and the first memory information output by the teacher model, and the second video frame sample and the second memory information output by the student model. The perceptual loss determination module 1020 can be used to obtain the perceptual loss based on the image features of the first video frame sample and the image features of the second video frame sample. The visual index determination module 1030 can be used to extract the first time-domain index and the first frequency-domain index of the visual signal in the region of interest of the first video frame sample, and the second time-domain index and the second frequency-domain index of the visual signal in the region of interest of the second video frame sample. The response loss determination module 1040 can be used to obtain time domain loss based on a first time domain index and a second time domain index, obtain frequency domain loss based on a first frequency domain index and a second frequency domain index, and obtain response loss based on time domain loss and frequency domain loss. The memory loss determination module 1050 can be used to determine the global distribution loss of the teacher model and the student model, as well as the instance alignment loss at each time point, based on the first memory information and the second memory information, and to obtain the memory loss based on the global distribution loss and the instance alignment loss. The model parameter update module 1060 can be used to obtain a joint loss based on at least one of the perceptual loss, response loss, and memory loss, and update the model parameters in the student model based on the joint loss to obtain an interaction model.
[0293] The specific details of each module in the distillation apparatus of the above model have been described in detail in the corresponding method embodiment section, and will not be repeated here.
[0294] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to exemplary embodiments of this disclosure, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.
[0295] Figure 11 A schematic diagram of the structure of a computer system suitable for implementing the embodiments of the present disclosure is shown.
[0296] It should be noted that, Figure 11 The computer system 1100 of the electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments disclosed herein.
[0297] like Figure 11 As shown, the computer system 1100 includes a central processing unit (CPU) 1101, which can perform various appropriate actions and processes based on programs stored in read-only memory (ROM) 1102 or programs loaded from storage section 1108 into random access memory (RAM) 1103. The RAM 1103 also stores various programs and data required for system operation. The CPU 1101, ROM 1102, and RAM 1103 are interconnected via a bus 1104. An input / output (I / O) interface 1105 is also connected to the bus 1104.
[0298] The following components are connected to I / O interface 1105: an input section 1106 including a keyboard, mouse, etc.; an output section 1107 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 1108 including a hard disk, etc.; and a communication section 1109 including a network interface card such as a LAN card, modem, etc. The communication section 1109 performs communication processing via a network such as the Internet. A drive 1110 is also connected to I / O interface 1105 as needed. Removable media 1111, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., are installed on drive 1110 as needed so that computer programs read from them can be installed into storage section 1108 as needed.
[0299] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 1109, and / or installed from removable medium 1111. When the computer program is executed by central processing unit (CPU) 1101, it performs various functions defined in the system of this disclosure.
[0300] Exemplary embodiments of this disclosure also provide a computer program product. The computer program product includes a computer program that, when executed by a processor, implements the distillation method of the above-described model.
[0301] In one implementation, the computer program product can be a tangible product containing a computer program, such as a computer-readable storage medium storing the computer program. The readable storage medium can be a storage medium based on electrical, magnetic, optical, electromagnetic, infrared, or other signals, including but not limited to: random access memory (RAM), read-only memory (ROM), magnetic tape, floppy disk, flash memory, hard disk drive (HDD), solid-state drive (SSD), etc. For example, the computer program product can be implemented as a non-volatile storage medium storing a computer program, such as read-only memory, NAND flash memory, etc.
[0302] In one implementation, the computer program product can be an intangible product containing a computer program. For example, the computer program product can be implemented as a virtual digital product, such as an executable file, installation package, or other digital file storing the computer program.
[0303] Computer program code can be written in one or more programming languages. Examples of programming languages include C, Java, and C++. Program code can execute entirely on the user's computing device, partially on the user's computing device, or as a standalone software package. It can also execute partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, such as a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via an internet connection provided by a mobile network operator).
[0304] Computer programs can be carried or transmitted via signals such as electricity, magnetism, light, electromagnetic fields, and infrared radiation. Electronic devices can convert signals carrying computer programs into digital signals, thereby running the computer programs. When a computer program runs on an electronic device, its code is used to cause the electronic device to execute (more specifically, to execute by the processor of the electronic device) the method steps of various exemplary embodiments of this disclosure, such as the distillation method of the model described above.
[0305] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0306] It should be noted that although several modules for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to embodiments of this disclosure, the features and functions of two or more modules described above can be embodied in one module. Conversely, the features and functions of one module described above can be further divided and embodied by multiple modules.
[0307] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein.
[0308] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.
Claims
1. A model distillation method, characterized in that, include: The action sequence samples are input into the teacher model and the student model respectively to obtain the first video frame sample and the first memory information output by the teacher model, and the second video frame sample and the second memory information output by the student model. The perceptual loss is obtained based on the image features of the first video frame sample and the image features of the second video frame sample; Extract a first time-domain index and a first frequency-domain index of the visual signal in the region of interest of the first video frame sample, and a second time-domain index and a second frequency-domain index of the visual signal in the region of interest of the second video frame sample. The time domain loss is obtained based on the first time domain index and the second time domain index, the frequency domain loss is obtained based on the first frequency domain index and the second frequency domain index, and the response loss is obtained based on the time domain loss and the frequency domain loss. Based on the first memory information and the second memory information, determine the global distribution loss of the teacher model and the student model, as well as the instance alignment loss at each time point, and obtain the memory loss based on the global distribution loss and the instance alignment loss; A joint loss is obtained based on at least one of the perceptual loss, the response loss, and the memory loss, and the model parameters in the student model are updated based on the joint loss to obtain an interaction model.
2. The distillation method for the model according to claim 1, characterized in that, The method further includes: The system acquires control data for a game object, and based on the game object's basic action data, the control data, and noise data, samples the game object's action sequence.
3. The distillation method for the model according to claim 1, characterized in that, The process of obtaining perceptual loss based on the image features of the first video frame sample and the image features of the second video frame sample includes: The first video frame sample and the second video frame sample are input into the perception model, and the image features of the first video frame sample and the second video frame sample in multiple intermediate layers of the neural network of the perception model are extracted respectively. The feature distance between the first video frame sample and the second video frame sample is obtained based on the feature differences of the image features of the multiple intermediate layers, and the perceptual loss is obtained based on the feature distance.
4. The distillation method for the model according to claim 3, characterized in that, The step of obtaining the feature distance between the first video frame sample and the second video frame sample based on the feature differences of the image features of the multiple intermediate layers includes: The image features of the first video frame sample and the second video frame sample in the multiple intermediate layers are normalized by unit variance to obtain normalized features; The feature differences of the normalized features of the multiple intermediate layers are calculated based on a preset sliding window, and the feature distance between the first video frame sample and the second video frame sample is obtained based on the feature differences.
5. The distillation method for the model according to claim 1, characterized in that, The first temporal index and the first frequency index for extracting the visual signal from the region of interest of the first video frame sample include: The region of interest is determined from the first video frame sample based on a preset region selection strategy; Based on the scene type, a target visual signal is extracted from multiple visual signals in the region of interest, and a corresponding response curve is constructed based on the target visual signal. Extract the first time-domain index and the first frequency-domain index of the target visual signal based on the response curve corresponding to the target visual signal; The time-domain metrics include amplitude and time delay, while the frequency-domain metrics include amplitude spectrum, phase spectrum, and group delay.
6. The distillation method for the model according to claim 5, characterized in that, The method further includes: If multiple regions of interest exist, the target visual signals extracted from each region of interest are weighted and fused.
7. The distillation method for the model according to claim 1, characterized in that, Based on the first memory information and the second memory information, determine the global distribution loss of the teacher model and the student model, including: A mid-range memory summary pool for the teacher model is constructed based on the first memory information, and a mid-range memory summary pool for the student model is constructed based on the second memory information. The mid-range memory summary pool is compressed into a single global memory vector by a global aggregator, and the global distribution loss is obtained based on the global memory vectors of the teacher model and the student model.
8. The distillation method for the model according to claim 7, characterized in that, The first memory information includes the short-term memory vectors of the teacher model at various time points, and the step of constructing the mid-range memory summary pool of the teacher model based on the first memory information includes: The short-term memory vectors of the teacher model at various time points within a preset time window are aggregated by an aggregator to obtain the memory information set of the preset time window, and the mid-term memory summary pool of the teacher model is obtained based on the memory information set of the preset time window.
9. The distillation method for the model according to claim 8, characterized in that, The step of determining the instance alignment loss of the teacher model and the student model at each time point based on the first memory information and the second memory information includes: Based on a preset instance matching strategy, the short-term memory vectors in the memory information sets of the teacher model and the student model are aligned, and the instance alignment loss of the teacher model and the student model at each time point is determined according to the aligned short-term memory vectors.
10. The distillation method for the model according to claim 8, characterized in that, The method further includes: If the teacher model is a black-box model, then the mid-range memory summary pool of the teacher model is obtained through an external summarizer; wherein, the external summarizer includes a cascaded structure of a frame-level feature encoder and a temporal aggregator, and the model parameters of the external summarizer are frozen during training.
11. The distillation method for the model according to claim 1, characterized in that, The method further includes: The loss weights of at least one of the perceptual loss, response loss, and memory loss are adaptively adjusted based on a preset weight adjustment strategy.
12. The distillation method for the model according to claim 11, characterized in that, The method further includes: When the moving average of the time-domain lag difference between the first video frame sample and the second video frame sample is greater than or equal to a preset time-lag difference threshold, the loss weight corresponding to the response loss is increased.
13. The distillation method for the model according to claim 11, characterized in that, The method further includes: The activation phase of the perceptual loss, the response loss, and the memory loss is determined based on a preset activation strategy.
14. A model distillation apparatus, characterized in that, include: An action sequence input module is used to input action sequence samples into a teacher model and a student model respectively, to obtain a first video frame sample and first memory information output by the teacher model, and a second video frame sample and second memory information output by the student model. The perceptual loss determination module is used to obtain the perceptual loss based on the image features of the first video frame sample and the image features of the second video frame sample. The visual index determination module is used to extract a first time-domain index and a first frequency-domain index of the visual signal in the region of interest of the first video frame sample, and a second time-domain index and a second frequency-domain index of the visual signal in the region of interest of the second video frame sample. The response loss determination module is used to obtain time domain loss based on the first time domain index and the second time domain index, obtain frequency domain loss based on the first frequency domain index and the second frequency domain index, and obtain response loss based on the time domain loss and the frequency domain loss. The memory loss determination module is used to determine the global distribution loss of the teacher model and the student model, as well as the instance alignment loss at each time point, based on the first memory information and the second memory information, and to obtain the memory loss based on the global distribution loss and the instance alignment loss. The model parameter update module is used to obtain a joint loss based on at least one of the perceptual loss, the response loss, and the memory loss, and to update the model parameters in the student model based on the joint loss to obtain an interaction model.
15. An electronic device, characterized in that, include: processor; as well as A memory for storing one or more programs that, when executed by the processor, cause the processor to implement the distillation method of the model as described in any one of claims 1 to 13.
16. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the distillation method of the model as described in any one of claims 1 to 13.