An edge visual model optimization method and system based on a dynamic inference core

By employing a dynamic example feature binding engine, an incremental parameter operation mechanism, and a self-supervised verification framework, we have achieved second-level model optimization on edge devices without the need for data backflow. This solves the problems of small sample detection accuracy bottleneck and poor on-site adaptation timeliness, thereby improving detection accuracy and reducing adaptation costs.

CN121640109BActive Publication Date: 2026-06-23LINKER

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
LINKER
Filing Date
2025-10-09
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies face technical challenges such as limitations in small-sample detection accuracy on edge devices, reliance on and risks associated with data feedback, and poor timeliness for on-site adaptation, making it difficult to achieve second-level model evolution and low forgetting rates.

Method used

By employing a Dynamic Instance Feature Binding Engine (DSFBE), an edge-side incremental parameter operation mechanism, and a self-supervised instance validity verification framework, a closed-loop application paradigm of "instance input - real-time inference - local evolution" is constructed, and model optimization is achieved on edge devices through a dynamically deformable inference kernel.

Benefits of technology

It achieves zero data backflow, second-level model evolution, lightweight design and low forgetting, improves the detection accuracy of new categories by 25%, reduces adaptation costs by 90%, has a memory footprint of less than 1GB, and is compatible with embedded systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121640109B_ABST
    Figure CN121640109B_ABST
Patent Text Reader

Abstract

The application discloses an edge visual model optimization method and system based on a dynamic inference kernel, and specifically comprises the following steps: first, a cross-modal task prototype is generated for a small number of new target example images; then, a parameter projection network based on the principle of neural radiation field is used to directly map and generate a task-specific dynamic deformable inference kernel for the prototype; then, in combination with the self-supervised generated pseudo label, a parameter isolation update strategy is adopted, the inference kernel is bound to the specified layer of the pre-trained model as a hard update parameter, and only a small number of key parameters of the model are subjected to fast gradient update. Through the construction of an edge end closed loop of 'example input-local evolution', the application can realize the second-level evolution of the model without data backflow and manual labeling, significantly improves the detection accuracy of new tasks, effectively suppresses the forgetting of old knowledge, and has the advantages of low adaptation cost, fast response speed and high safety.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of artificial intelligence and computer vision, and in particular to an edge vision model optimization method and system based on dynamic inference kernel, for fast and efficient optimization of vision models on edge computing devices. Background Technology

[0002] With the rapid development of artificial intelligence technology, intelligent vision technology has been widely applied in many fields such as industrial quality inspection, security monitoring, autonomous driving, and smart retail. In these scenarios, models are typically deployed on edge devices with limited computing power, storage, and power consumption to meet the requirements of low latency and data privacy. However, traditional intelligent vision systems face significant challenges in edge applications.

[0003] The main pain points of existing technologies are as follows:

[0004] Limited sample size and accuracy bottleneck: Traditional object detection models, such as YOLOv8 and Faster R-CNN, typically rely on large-scale labeled datasets for training. When it's necessary to quickly adapt to a new category in the field (e.g., a new type of industrial product defect), only a small number of samples can often be collected. Under these limited sample conditions, fine-tuning the model results in a significant drop in average precision (AP), making it difficult to meet production requirements. Generally, at least 100 labeled samples are needed to achieve a satisfactory fine-tuning effect.

[0005] Data feedback dependency and risks: Existing incremental or continuous learning frameworks, such as MARS, while capable of handling new data, generally employ a "edge acquisition-cloud training-edge update" model. This model heavily relies on network connectivity, incurring a network latency of at least 200ms, which cannot meet the demands of real-time response. More importantly, feeding back field-acquired data (such as industrial defect images containing trade secrets) to the cloud poses a significant risk of data privacy breaches and generates substantial cloud computing costs.

[0006] Poor on-site adaptability and timeliness: In actual delivery scenarios, when encountering new goals or tasks (such as discovering a new type of equipment defect that has never been seen before during substation inspection), the traditional process requires "on-site data collection -> manual annotation -> data feedback -> cloud retraining -> model deployment". The entire response cycle averages more than 7 days, which is completely unable to meet the requirements of modern production for real-time performance and agility.

[0007] In addition, some of the related technologies proposed by academia also have limitations:

[0008] Meta-learning solutions (such as MAML and Reptile): While the MAML algorithm is designed to quickly adapt to new tasks, it typically requires dozens of iterations (taking more than 10 seconds) to converge, making it difficult to implement in real-time for edge devices with limited computing power. Algorithms like Reptile, while adapting quickly, suffer from catastrophic forgetting of older tasks, with forgetting rates exceeding 15%.

[0009] Lightweight fine-tuning frameworks (such as TinyCLIP): While these frameworks compress the model, they still require a complete backpropagation process during fine-tuning. On typical embedded devices such as the Raspberry Pi 4B, the time taken for a single parameter update is still more than 5 seconds, which cannot meet the requirements for second-level evolution.

[0010] Parameterless example inference (such as CLIP): The CLIP model relies on its strong pre-trained semantic alignment capabilities for zero-shot inference, but for fine-grained categories that have not been seen during training and lack explicit textual descriptions (such as "crack No. 2 in industrial gear"), there is a semantic gap, and the detection accuracy of direct application is extremely low (such as 32%), making it unreliable in industrial scenarios.

[0011] Therefore, there is an urgent need for a new technical solution that requires no data backflow, can utilize a very small number of samples, achieves second-level model evolution at the edge, and effectively avoids forgetting old knowledge. Summary of the Invention

[0012] This invention mainly addresses the technical problems of existing technologies, such as bottlenecks in small sample detection accuracy, dependence on data backflow, and poor timeliness of on-site adaptation. It provides a method and system for optimizing edge vision models based on dynamic inference kernels, which features zero data backflow adaptation, second-level evolution capability, lightweight design, and low forgetting.

[0013] The core of this invention lies in constructing a closed-loop application paradigm of "example input - real-time inference - local evolution", which is mainly supported by three core innovations: Dynamic Instance Feature Binding Engine (DSFBE), edge-end incremental parameter operation mechanism, and self-supervised example validity verification framework.

[0014] First, this invention proposes a Dynamic Instance Feature Binding Engine (DSFBE) for efficiently injecting information from a small number of instances into a pre-trained model. This engine constructs a three-stage processing pipeline:

[0015] Semantic-visual dual prototype generation: Extracting example visual features using a lightweight version of DINOv2, and combining it with the CLIP text encoder to generate cross-modal prototype pairs (P_v, P_t).

[0016] Deformable Inference Kernel Generation: Based on the Neural Radiation Field (NeRF) principle, prototype pairs are encoded into task-specific deformable convolutional kernels K_task and bound to the ROIAlign layer of the model. This binding process is achieved through a "dynamic parameter injection mechanism": K_task is parsed as a spatial offset matrix (Δx, Δy) and weight adjustment coefficients W; during the feature sampling stage of the ROIAlign layer, (Δx, Δy) is injected into the calculation of the sampling point coordinates, the original coordinates (x, y) are adjusted to (x + Δx_i, j, y + Δy_i, j), and weights W are applied to the feature values ​​of the adjusted sampling points. This binding process achieves real-time parameter loading through memory-mapped files, without recompiling the network structure, and the binding time is ≤10ms.

[0017] Dual-branch inference routing: The globally pre-trained model and the example subgraph branch loaded with K_tasks are run in parallel, and the outputs are fused through an attention gating mechanism. The fusion process consists of four steps: feature extraction, attention weight calculation, and feature fusion (using a weighted fusion formula). And output correction (applying a batch normalization (BN) layer). This mechanism can improve the confidence of new category detection by 25%, with a total fusion time of ≤30ms.

[0018] Secondly, in order to avoid catastrophic forgetting while quickly adapting to new tasks, this invention designs an edge-end incremental parameter surgery mechanism and proposes a ternary parameter isolation update strategy:

[0019] Frozen layer (80% parameters): Fix the pre-trained visual backbone network (such as Swin-T) to retain its powerful general feature extraction capabilities.

[0020] Soft update layer (17% parameter): The cross-modal fusion module is updated smoothly using exponential moving average (EMA) to ensure a smooth transition between old and new knowledge, resulting in a forgetting rate of <3%.

[0021] Hard update layer (3% parameters): Only updates the deformable convolutional kernel K_task bound to the example and the final classification head. Second-level updates are achieved through local gradient masking.

[0022] Furthermore, to achieve automated evolution without human intervention, this invention constructs a self-supervised example validity verification framework. This framework designs an entropy-diversity dual-threshold filtering mechanism and integrates an automatic annotation generator.

[0023] Entropy filtering: When the predicted entropy value is >1.8, it is determined to be a valid new category, avoiding repeated learning of known categories.

[0024] Feature diversity measure: Redundant examples with similarity > 0.9 are removed by local feature PCA projection to ensure that 5 examples can cover 85% of the feature space.

[0025] Automatic label generation: A pseudo-label generator based on Teacher-Student distillation generates high-quality labels for valid examples for model updates without manual annotation. The Teacher model is a pre-trained general-purpose model (such as DINOv2+FPN) deployed at the edge, while the Student model is the lightweight inference model of this system. The Student model is optimized using distillation loss L_distill=L1(B_stu,B_tea)+CrossEntropy(P_stu,P_tea).

[0026] The specific solution of this invention is: an edge vision model optimization method based on dynamic inference kernel, comprising the following steps:

[0027] S1: Cross-modal task prototype generation:

[0028] For at least one example image containing a new target and its corresponding optional text description, N-dimensional visual features are extracted using a pre-trained visual model (such as Vision Transformer, ViT), and M-dimensional text features are extracted using a pre-trained text encoder (such as CLIP-Text). The N-dimensional visual features are clustered using a Gaussian Mixture Model (GMM-Clustering) to generate a visual prototype P_v representing the core visual attributes of the new target. The visual prototype P_v is combined with the text prototype P_t generated from the M-dimensional text features to form a cross-modal task prototype (P_v, P_t). Generally, N can be 2048, M can be 768, and P_v and P_t are concatenated to form a 2816-dimensional cross-modal task model.

[0029] S2: Dynamically deformable inference kernel generation:

[0030] The cross-modal task prototype (P_v, P_t) is taken as input and fed into a parametric projection network constructed based on the Neural Radiation Field (NeRF) principle. The parametric projection network maps the task prototype to a task-specific spatial grid parameter θ_grid and parses the dynamically deformable inference kernel K_task from the θ_grid. The K_task includes a spatial offset matrix (Δx, Δy) for adjusting the feature sampling position and a weight coefficient matrix W for adjusting the feature values.

[0031] S3: Self-supervised example validation and pseudo-label generation:

[0032] The example image is input into the visual model currently deployed on the edge device, and the entropy value H of its predicted category probability distribution is calculated. When H is greater than a preset entropy threshold (e.g., 1.8), the example image is determined to be a valid new category example image. Pseudo-labels are generated for the valid new category example image using the Teacher-Student distillation framework. Here, the Teacher model is a high-performance but slow inference general model pre-built at the edge, and the Student model is a lightweight edge visual model to be optimized. The Teacher model outputs high-confidence bounding box coordinates B_tea and category c_tea as pseudo-labels.

[0033] S4: Incremental updates based on parameter isolation strategy:

[0034] Based on the predefined parameter mask matrix M, the parameters of the Student model are divided into a frozen layer (M[i]=0), a soft update layer (M[i]=0.3), and a hard update layer (M[i]=1). The dynamic deformable inference kernel K_task generated in step S2 is used as the hard update parameter and bound to the region of interest alignment (ROIAlign) layer of the Student model. Based on the pseudo-labels B_tea and c_tea generated in step S3 and the prediction results B_stu and P_stu of the Student model for the example image, the overall loss function L(θ;D_new) is calculated according to the following formula:

[0035] L(θ;D_new)=α·L_cls(P_stu,c_tea)+β·L_reg(B_stu,B_tea);

[0036] Where L_cls is the classification loss, L_reg is the regression loss, and α and β are preset weight coefficients;

[0037] The parameters θ of the Student model are updated according to the following update rules:

[0038] ;

[0039] Where Δθ is the parameter update amount, and η is the learning rate. The gradient of the loss function is used; updates are performed only on the parameters of the soft update layer and the hard update layer (M[i]≠0), so as to quickly adapt to the new target while retaining general knowledge.

[0040] Preferably, in step S2, the parametric projection network contains at least three fully connected layers. The parametric projection network maps the input cross-modal task prototype to a spatial grid parameter θ_grid with a dimension of 4×4×(3+1), where 3 represents the convolution kernel weight dimension and 1 represents the offset dimension; the parsed spatial offset matrix (Δx,Δy) has a size of 4×4×2.

[0041] Preferably, in step S3, the entropy value H is calculated using the following formula:

[0042] H = -∑(p_i × log(p_i)); where p_i is the model's predicted probability for the i-th category.

[0043] Preferably, step S3 further includes a feature diversity measurement step: performing principal component analysis (PCA) to reduce the dimensionality of the local feature vectors of multiple valid new category examples, and calculating the cosine similarity S between the features after dimensionality reduction. When S is less than a preset similarity threshold (e.g., 0.9), the example is retained to avoid sample redundancy.

[0044] Preferably, in step S4, the parameter mask matrix M defines no more than 3% of the model parameters as hard update layers and 10%-20% of the parameters as soft update layers; the soft update layers use exponential moving average (EMA) for smooth updates, and the hard update layers use the RMSprop optimizer for gradient updates.

[0045] An edge vision model optimization system based on dynamic inference kernel includes:

[0046] The cross-modal task prototype generation module is used to extract visual and text features from at least one example image containing a new target and its corresponding optional text description, and generate a cross-modal task prototype (P_v, P_t) through processing such as Gaussian mixture model clustering.

[0047] The dynamic deformable inference kernel generation module is configured with a parameter projection network based on the Neural Radiation Field (NeRF) principle, which is used to receive the cross-modal task prototype and map it to generate a dynamic deformable inference kernel K_task containing a spatial offset matrix (Δx,Δy) and a weight coefficient matrix W.

[0048] The self-supervised verification and labeling module is used to calculate the predicted entropy value of the input example to verify whether it is a new category, and uses the built-in Teacher-Student distillation framework to automatically generate pseudo-labels (B_tea, c_tea) for supervised learning for the verified examples.

[0049] The parameter isolation update module is used to classify the parameters of the Student model to be optimized into three categories: frozen, soft update, and hard update according to the predefined parameter mask matrix M; and bind the dynamic deformable inference kernel K_task as a hard update parameter to a specified layer of the model, calculate the loss according to the pseudo label, and perform gradient update only on non-frozen parameters to complete the optimization of the model.

[0050] Preferably, the parameter projection network in the dynamic deformable inference kernel generation module takes the cross-modal task prototype as input and outputs the task-specific spatial grid parameter θ_grid. The module further parses the dynamic deformable inference kernel K_task from θ_grid.

[0051] Preferably, in the parameter isolation update module, the learning rate of the hard update parameter (e.g., 0.01) is higher than that of the soft update parameter (e.g., 0.0001), and the hard update parameter is updated directly through gradient descent, while the soft update parameter is updated through exponential moving average (EMA).

[0052] From a system architecture perspective, this system can be divided into four hierarchical functional layers: input layer, DSFBE engine, edge evolution layer, and inference decision layer.

[0053] Input layer: Supports real-time video streaming via USB / industrial camera, or batch import of sample images in JPEG / PNG format. Text descriptions (e.g., "gear tooth surface crack") can be included.

[0054] The DSFBE engine includes:

[0055] Example processor: Extracts 2048-dimensional visual features and 768-dimensional text embeddings, aligning cross-modal sequences via Dynamic Time Warping (DTW).

[0056] Prototype generator: Generates 32 cross-modal prototypes by clustering example features, and uses a topology protection mechanism to prevent feature drift.

[0057] Inference kernel generator: Encodes prototype pairs into 4×4 deformable convolution kernels (parameters only 128KB) and loads them directly into the inference network.

[0058] The marginal evolutionary layer includes:

[0059] Parameter isolator: The frozen layer, soft update layer, and hard update layer are divided by a binary mask matrix M (M[i]∈{0,0.3,1}) with the same dimension as the model parameters. This mask matrix is ​​determined through offline sensitivity analysis.

[0060] Incremental optimizer: The RMSprop algorithm is used, with a learning rate of 0.01 for hard update layers and 0.0001 for other layers, and ≤5 iterations per update.

[0061] Hardware acceleration unit: Adapted to NPU operators such as Ascend 310 and NVIDIA Jetson AGX Orin, achieving update time ≤200ms.

[0062] The reasoning and decision-making layer includes:

[0063] Dual-branch fusion: The outputs of the global model branch and the example subgraph branch are fused using attention-gated weighted fusion.

[0064] Uncertainty estimator: Calculates the predicted entropy value and feature outlier, and automatically triggers example collection prompts (such as "New category detected, suggest taking 3 examples").

[0065] This invention enables zero-data backflow adaptation: model optimization can be completed directly on an edge device using only 5-10 sample images collected on-site, without any cloud intervention or manual annotation.

[0066] Second-level capability evolution: After receiving a new example image, the model parameters can be updated within 200 milliseconds (ms), which improves the average accuracy (AP) of the model for new categories by more than 15%.

[0067] Lightweight and low forgetting rate: Only about 3% of the key parameters in the model are updated, ensuring that the forgetting rate of old tasks is less than 3% while adapting to new tasks, and the memory usage of the entire system on edge devices does not exceed 1GB, making it suitable for embedded systems.

[0068] The substantial effects of this invention are:

[0069] Achieving true edge-closed-loop evolution: This invention pioneered the construction of an application paradigm of "example input - real-time inference - local evolution," where all computations are completed on edge devices, completely eliminating dependence on the cloud and fundamentally solving the problems of network latency, data privacy, and cloud costs.

[0070] This significantly reduces adaptation costs and time: requiring only 5-10 sample images taken by on-site personnel, without any professional annotation, the system can automatically complete model optimization within 200ms. Compared to the traditional solution's time-consuming manual annotation and retraining cycle of several days, this invention reduces on-site adaptation manpower costs by more than 90%, and compresses time costs from days to seconds.

[0071] Significantly improves the performance and efficiency of small-sample learning: By dynamically generating task-specific inference kernels and combining them with a precise parameter isolation and update strategy, this invention can improve the detection AP of new categories by more than 15% with only a few samples, while ensuring that the forgetting rate of the original task is less than 3%, thus achieving an effective balance between "fast learning" and "long-term memory".

[0072] The solution is highly versatile and portable: the optimization method of this invention can be used as a "plugin" or "adapter" to be combined with various mainstream pre-trained visual models, and the system design is lightweight and compatible with mainstream embedded hardware (such as NVIDIA Jetson series), with broad application prospects. Attached Figure Description

[0073] Figure 1 This is a flowchart of an edge vision model optimization method based on a dynamic inference kernel according to the present invention. Detailed Implementation

[0074] The technical solution of the present invention will be further described in detail below through embodiments and in conjunction with the accompanying drawings.

[0075] Example: This example provides an edge vision model optimization system based on a dynamic inference kernel. The system is deployed on an edge computing device, such as a terminal consisting of an industrial smart camera and an edge computing box (e.g., Intel NUC + NVIDIA RTX A2000). Figure 1 As shown, the optimization method includes four steps: cross-modal task prototype generation, dynamic deformable inference kernel generation, self-supervised example verification and pseudo-label generation, and incremental update based on parameter isolation strategy. The system also includes the following modules: cross-modal task prototype generation module, dynamic deformable inference kernel generation module, self-supervised verification and labeling module, and parameter isolation update module.

[0076] The function of the cross-modal task prototype generation module is to receive new task examples from external input and transform them into highly condensed feature representations that the model can understand—that is, cross-modal task prototypes.

[0077] In one specific implementation, when the user inputs N=5 example images of a new category (e.g., "novel insulator crack"), the cross-modal task prototyping module performs the following operations:

[0078] Feature Extraction: For each example image, a powerful pre-trained visual model, preferably a lightweight version of VisionTransformer (ViT) (such as DINOv2), is used to extract its global features. ViT effectively captures the contextual information of the image by dividing the image into multiple patches and utilizing the self-attention mechanism of the Transformer architecture. In this embodiment, the extracted visual features are 2048-dimensional vectors. If the user also inputs the text description "insulator crack," the CLIP model's text encoder (CLIP-Text) is used to convert it into a 768-dimensional text feature vector.

[0079] Visual Prototype Generation: Only a few samples may have visual features that are noisy or have different viewpoints. To obtain a stable and representative visual center, the cross-modal task prototype generation module uses a Gaussian Mixture Model (GMM-Clustering) to cluster the five 2048-dimensional visual feature vectors. GMM can fit complex data distributions with a weighted sum of multiple Gaussian distributions, and compared to hard clustering methods such as K-Means, it is better able to capture the uncertainty of features. The result of clustering is the generation of one or more visual prototypes P_v, which are the centers of the distribution and represent the core visual patterns of the new category.

[0080] Prototype Combination: The generated visual prototype P_v is concatenated with the textual prototype P_t (if any) to form a 2048+768=2816-dimensional cross-modal task prototype (P_v, P_t). This prototype contains both the "appearance" (visual information) and "definition" (semantic information) of the new target, providing comprehensive guidance for the subsequent generation of customized inference kernels.

[0081] To describe this process more formally, the following key formulas and model definitions are introduced:

[0082] Formula: P_v=GMM-Clustering(ViT(x_i)),P_t=CLIP-Text(y_i);

[0083] Usage steps:

[0084] For the input example image x_i, extract the image features using a Visual Transformer (ViT).

[0085] The extracted visual features are processed using Gaussian mixture model clustering (GMM-Clustering) to generate a visual prototype P_v.

[0086] The input text description y_i is processed by the CLIP model's text encoder (CLIP-Text) to generate the text prototype P_t.

[0087] Related model description:

[0088] GMM-Clustering (Gaussian Mixture Model Clustering): Gaussian mixture model clustering generates visual prototypes by fitting feature distributions through multiple Gaussian probability density functions.

[0089] ViT (Vision Transformer): A visual Transformer model used to extract image patch embedding features.

[0090] CLIP-Text (CLIP Text Encoder): A text encoder for the CLIP model, based on the Transformer architecture, that converts text descriptions into 768-dimensional semantic embedding vectors.

[0091] The core innovation of this invention lies in the dynamic deformable inference kernel generation module. Its function is to "compile" the abstract task prototype into concrete model parameters that can be directly used to modify the model's behavior—the dynamic deformable inference kernel K_task.

[0092] This module creatively draws inspiration from Neural Radiation Fields (NeRF). NeRF was originally used to synthesize new 3D scenes from multi-view 2D images, its core being the mapping of spatial coordinates (x, y, z) to color and density. This invention transfers this principle: treating a high-dimensional task prototype vector as a "scene description," and "rendering" it into low-dimensional convolutional kernel parameters with a specific spatial structure.

[0093] Parametric Projection Network: The Dynamic Deformable Inference Kernel Generation Module incorporates a lightweight Multilayer Perceptron (MLP), namely the parametric projection network. This network contains three fully connected layers with a hidden layer dimension of 256. Its input is a 2816-dimensional cross-modal task prototype.

[0094] Spatial parameter mapping: The network outputs a one-dimensional vector, whose dimensions are designed to be reshaped into a 4×4×(3+1)=64-dimensional spatial grid parameter θ_grid. Here, 4×4 corresponds to the size of the convolutional kernel that will be used for subsequent operations on the feature map, 3 represents the weight at each position, and 1 represents the offset. This θ_grid can be understood as a "radiation field" of the task prototype in the parameter space.

[0095] Inference kernel parsing: The dynamic deformable inference kernel generation module 120 parses two parts from θ_grid:

[0096] Spatial offset matrix (Δx, Δy): A 4×4×2 matrix used to finely adjust the position of standard grid sampling points during subsequent feature sampling.

[0097] Weight coefficient matrix W: A 4×4×C_in×C_out matrix (for simplification, it can be regarded as 4x4x3), used to weight the feature values ​​sampled after the position is adjusted.

[0098] Together, these two constitute the dynamically deformable inference kernel K_task. This process is a pure forward computation, which takes very little time (≤10ms).

[0099] The core operations of this module can be defined using the following key formulas and models:

[0100] Formula: K_task=NeRF-Encode(P_v,P_t,θ_grid)

[0101] Usage steps:

[0102] Obtain the visual prototype P_v and the text prototype P_t, as well as the spatial grid parameter θ_grid.

[0103] The above three elements are input into the Neural Radiation Field Encoding (NeRF-Encode) module.

[0104] The NeRF-Encode module processes the data to generate a task-specific deformable convolution kernel K_task.

[0105] Related model description:

[0106] NeRF-Encode (Neural Radiance Field Encoding): Based on voxel rendering encoding using the Neural Radiance Field, it projects cross-modal prototypes into spatial grid parameters θ_grid to generate dynamic convolutional kernels. The NeRF-Encode used in this invention is a lightweight improved version (70% reduction in parameters), and its network structure includes 3 fully connected layers (256 hidden dimensions) + 1 output layer. The input is a concatenated vector of cross-modal prototypes (P_v, P_t) (dimension 2048+768=2816), and the output is a spatial grid parameter θ_grid (dimension 4×4×(3+1)=64, where 3 is the convolution kernel weight and 1 is the offset). The projection process is as follows: P_v and P_t are concatenated into an input vector V; a three-dimensional voxel feature field Φ(V) is obtained through NeRF network mapping (each voxel contains "feature value + density" information); Φ(V) is voxel rendered along the spatial dimension (2×2×2 voxels) to obtain the two-dimensional grid parameter θ_grid (containing the weight and offset of each convolution kernel position); the dynamic convolution kernel is generated by parsing a 4×4 weight matrix W (size 4×4×C_in×C_out) and an offset matrix Δ (size 4×4×2) from θ_grid, where C_in and C_out are the number of input / output channels (adapting to the feature dimension of the ROIAlign layer).

[0107] The self-supervised validation and labeling module automatically determines the validity of input samples and generates high-quality training labels for them without human intervention.

[0108] Example Validation (Entropy Filtering): When a new image is input, it is first predicted using the current model, resulting in a probability distribution P=[p_1,...,p_K] covering all known categories. Then, the Shannon entropy of this distribution is calculated as H=-∑(p_i×log(p_i)). The entropy measures the uncertainty of the prediction. If the model's prediction for the image is highly uncertain (i.e., the probability is uniformly distributed across multiple categories), the entropy H will be high. This invention sets a threshold, such as 1.8. When H>1.8, the system determines that the image likely belongs to a new category not seen by the model, and is a "valid" example that should be used for model evolution. This effectively avoids using difficult or repetitive samples of known categories for unnecessary updates.

[0109] Example Diversity Validation (Optional): To avoid the five valid input example images being highly similar (e.g., taken from the same location and angle), the self-supervised validation and annotation module can also perform diversity measurement. Local features are extracted from each image, and after dimensionality reduction using Principal Component Analysis (PCA), the cosine similarity S between feature vectors is calculated. If S > 0.9 for two images, they are considered redundant samples, and only one is retained. This ensures that the small number of samples used for optimization covers as many feature variations as possible.

[0110] Pseudo-label generation (Teacher-Student distillation): For valid examples that pass the validation, the self-supervised validation and annotation module 130 initiates the Teacher-Student framework to generate labels.

[0111] Teacher model: A larger, more accurate general vision model (such as DINOv2+FPN) pre-built at the edge. It has a slow inference speed and is not used for real-time detection, but it has strong feature extraction capabilities.

[0112] Student model: A lightweight model currently deployed for real-time inference that needs optimization.

[0113] The example image is input into the Teacher model to obtain its predicted bounding box B_tea and class c_tea. Predictions with a confidence score higher than 0.85 are selected as pseudo-labels, serving as the "gold standard." This process eliminates all manual annotation work.

[0114] The parameter isolation update module is responsible for performing the final model parameter update. To balance rapid adaptation and avoid forgetting, this module adopts a refined update strategy of "ternary parameter isolation".

[0115] Parameter partitioning: During system initialization, an offline sensitivity analysis (analyzing the impact of each parameter gradient on the loss of the new task) generates a mask matrix M with the same dimensionality as the Student model parameters. This matrix partitions the model parameters into three parts:

[0116] Frozen layers (M[i]=0, approximately 80% of parameters): These are mainly the backbone network of the model (such as the first few layers of Swin-T), responsible for extracting general low-level features. These parameters remain unchanged during updates, thus preserving the model's generalization ability and are key to avoiding catastrophic forgetting.

[0117] Soft update layer (M[i]=0.3, approximately 17% of parameters): This mainly consists of the fusion layer for the neck and part of the head of the model. These parameters have a certain impact on both new and old tasks. An exponential moving average (EMA) is used for smooth updates, resulting in a slower update rate and a stable transition.

[0118] Hard update layer (M[i]=1, approximately 3% parameters): mainly the final classification head of the model and the parts directly related to the new task. In this invention, the dynamic deformable inference kernel K_task generated by module 120 is directly regarded as a hard update parameter and bound to the ROIAlign layer of the Student model.

[0119] Loss Calculation and Gradient Update: Substitute the pseudo-labels (B_tea, c_tea) and the predictions from the Student model (B_stu, P_stu) into the loss function L(θ; D_new) = α·L_cls + β·L_reg. Here, L_cls preferably uses cross-entropy loss, and L_reg preferably uses smoothed L1 loss. The weight coefficients are determined experimentally, for example, α = 1.0, β = 0.5. Then, calculate the gradient ∇L of the loss function with respect to all parameters.

[0120] Mask update: Finally, apply the update rule Δθ=M·η· Due to the mask M, only parameters with M[i] ≠ 0 (in both soft and hard update layers) are updated. Hard update layers use a higher learning rate (e.g., 0.01) and the RMSprop optimizer for fast adjustments, while soft update layers use a lower learning rate (e.g., 0.0001) for a smooth transition. The entire update process involves very few iterations (≤5 times), achieving update times of ≤200ms on NPUs such as the NVIDIA Jetson AGX Orin.

[0121] The core mathematical principle of parameter updating can be described by the following formula:

[0122] formula: ;

[0123] Usage steps:

[0124] Determine the local gradient mask M, the learning rate η, and the gradient of the loss function based on the new example dataset D_new. (θ;D_new).

[0125] The local gradient mask M, the learning rate η, and the gradient of the loss function are used. (θ;D_new) Multiply the three together.

[0126] The result obtained is the parameter update amount Δθ.

[0127] Detailed explanation of each part of the formula:

[0128] M (Local Gradient Mask): Set to 1 only for 3% of the hard update layer parameters, and set to 0 or 0.3 for the rest according to the strategy.

[0129] η (learning rate): 0.01 for hard update layers, and 0.0001 for other layers.

[0130] (Loss Function Gradient): The loss function L is a weighted sum of the classification loss and the regression loss: L(θ;D_new)=α⋅L_cls+β⋅L_reg. Where L_cls is the improved cross-entropy loss, and L_reg is the smoothed L1 loss. The weight coefficients α and β are determined through grid search, with optimal values ​​of α=1.0 and β=0.5.

[0131] The following is the flowchart of the method of the present invention.

[0132] Step A1: The system enters standby mode and loads the pre-trained Student and Teacher models, as well as the predefined parameter mask matrix M.

[0133] Step A2: Receive N (e.g., N=3-5) new category example images and optional text descriptions from the user.

[0134] Step A3: Perform cross-modal task prototype generation. Extract visual and textual features, generate stable visual prototypes through GMM clustering, and combine them into a cross-modal task prototype (P_v, P_t).

[0135] Step A4: Generate a dynamic deformable inference kernel. The task prototype is forward computed through a NeRF-based parametric projection network to generate a task-specific dynamic deformable inference kernel K_task.

[0136] Step A5: Perform self-supervised example validation and pseudo-label generation. The input examples are filtered for entropy and diversity to remove invalid and redundant samples. Then, the Teacher-Student framework is used to generate high-precision pseudo-labels (B_tea, c_tea) for the valid samples.

[0137] Step A6: Perform incremental updates based on a parameter isolation strategy. Bind K_task to the ROIAlign layer of the Student model. Calculate the loss based on the pseudo-labels and apply the mask matrix M and the differential learning rate, performing gradient updates only on the hard and soft update layers with a small number of iterations.

[0138] Step A7: Update complete. The Student model now has high-precision detection capabilities for the new category. The system switches to real-time inference mode and uses the optimized model for online detection.

[0139] Application Scenario Example 1: Real-time Adaptation for Fresh Produce Sorting

[0140] The intelligent camera system of this invention was deployed on an automated fresh produce sorting line. The initial model was able to identify apples, bananas, and oranges.

[0141] A new task arises: One day, a new batch of "black plums" arrives on the assembly line. The system has never seen this type of fruit before, and the initial detection accuracy is only about 50%.

[0142] Example data collection: The operator removes three "black plums" from the production line at three different angles, takes three example images in front of the camera, and inputs them into the system.

[0143] Second-level evolution: The system automatically executes the complete process of S203-S206 mentioned above.

[0144] Extract the visual features of the "black plum" and generate a prototype.

[0145] Generate a dynamic inference kernel K_task specifically for recognizing the texture, color, and shape of the "black plum".

[0146] The Teacher model generates accurate bounding box pseudo-labels for these three images.

[0147] The parameter isolation update module completed model optimization in less than one minute (including multiple iterations and hardware processing time).

[0148] Results Verification: The optimized model was immediately put into use, and the accuracy rate for detecting "black plums" on the production line surged from 50% to 92%. The entire process required no line downtime and no data experts, achieving seamless integration with the production workflow. At night, the system automatically integrates all new examples from the day for a more comprehensive incremental learning, further improving sorting efficiency by 30% the following day.

[0149] Application Scenario Example 2: Substation Equipment Defect Detection

[0150] In unattended substations, the system of this invention is deployed on inspection robots or fixed cameras to monitor the status of critical equipment (such as insulators, transformers, switchgear, etc.) in real time. The initial model library contains common equipment types and known defect patterns.

[0151] A new type of defect was discovered: With equipment aging or environmental changes, a novel type of insulator series crack emerged, a crack pattern not previously observed in the initial training data. During inspection, the inspection robot equipped with this invention consistently exhibited low detection confidence in this area, with an initial false negative rate as high as 28%.

[0152] Rapid on-site adaptation: After receiving the alert, maintenance personnel arrived at the site and confirmed the existence of the new defect. Using handheld devices or a remote-controlled inspection robot, they captured five clear images of the new crack from different angles and lighting conditions. Due to the isolation of the substation's internal network, the maintenance personnel imported these five sample images into the invention's system deployed locally on the robot via USB drive.

[0153] Autonomous evolution and inference kernel generation: After receiving the image, the system immediately executes the optimization process locally.

[0154] The system identified this as a "high entropy" event and determined it to be a valid new category.

[0155] Within 200ms, the system generates a dedicated dynamic inference kernel K_task for the fine-grained category of "novel insulator series crack".

[0156] The Teacher-Student framework automatically generated pixel-level pseudo-labels for these 5 images.

[0157] The parameter isolation update module completes the fine-tuning of the local model within seconds.

[0158] Performance improvements and continuous optimization:

[0159] Immediate results: After the model was updated, the inspection robot immediately re-inspected the same area, and the missed detection rate of the new type of crack was reduced from 28% to 5%, with a false alarm rate of less than 1%, achieving a high reliability standard for industrial applications.

[0160] Continuous learning: During subsequent routine inspections, the system will automatically aggregate newly detected, high-confidence images of this type of defect as new samples. For example, during low-load periods at night, the system will automatically integrate all new examples collected that day for an incremental learning round, further consolidating and optimizing model performance, achieving continuous evolution without human intervention and without the need for secondary intervention from maintenance personnel.

[0161] The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which this invention pertains may make various modifications or additions to the described specific embodiments or use similar methods to substitute them, without departing from the spirit of the invention or exceeding the scope defined by the appended claims.

[0162] Although this document uses a variety of terms, the possibility of using other terms is not excluded. These terms are used merely for the convenience of describing and explaining the essence of the invention; interpreting them as any additional limitation would contradict the spirit of the invention.

Claims

1. A method for optimizing edge vision models based on dynamic inference kernels, characterized in that, Includes the following steps: S1: Cross-modal task prototype generation: For at least one example image containing a new target and its corresponding optional text description, N-dimensional visual features are extracted using a pre-trained visual model, and M-dimensional text features are extracted using a pre-trained text encoder. The N-dimensional visual features are clustered using a Gaussian mixture model to generate a visual prototype P_v representing the core visual attributes of the new target; The visual prototype P_v is combined with the text prototype P_t generated from the M-dimensional text features to form a cross-modal task prototype. S2: Dynamically deformable inference kernel generation: The cross-modal task prototype is used as input and fed into a parametric projection network constructed based on the principle of neural radiation fields. The parametric projection network maps the task prototype to a task-specific spatial grid parameter θ_grid, and parses the dynamically deformable inference kernel K_task from the θ_grid; The K_task includes a spatial offset matrix for adjusting the feature sampling position and a weight coefficient matrix W for adjusting the eigenvalues; S3: Self-supervised example validation and pseudo-label generation: The example image is input into the visual model currently deployed on the edge device, and the entropy value H of its predicted category probability distribution is calculated. When H is greater than the preset entropy threshold, the example image is determined to be a valid new category example image. The Teacher-Student distillation framework is used to generate pseudo-labels for valid new category example images. The Teacher model is a high-performance general model pre-built at the edge, and the Student model is a lightweight edge vision model to be optimized. The Teacher model outputs high-confidence bounding box coordinates B_tea and category c_tea as pseudo-labels. S4: Incremental updates based on parameter isolation strategy: Based on the predefined parameter mask matrix M, the parameters of the Student model are divided into a frozen layer, a soft update layer, and a hard update layer. The dynamically deformable inference kernel K_task generated in step S2 is used as a hard update parameter and bound to the region of interest alignment layer of the Student model; Based on the pseudo-labels B_tea and c_tea generated in step S3 and the prediction results B_stu and P_stu of the Student model for the example image, the overall loss function L(θ;D_new) is calculated according to the following formula: L(θ;D_new)=α·L_cls(P_stu,c_tea)+β·L_reg(B_stu,B_tea); Where L_cls is the classification loss, L_reg is the regression loss, and α and β are preset weight coefficients; The parameters θ of the Student model are updated according to the following update rules: ; Where Δθ is the parameter update amount, and η is the learning rate. The gradient of the loss function is used; updates are performed only on the parameters of soft and hard update layers.

2. The edge vision model optimization method based on dynamic inference kernel according to claim 1, characterized in that, In step S2, the parametric projection network contains at least three fully connected layers. The parametric projection network maps the input cross-modal task prototype to a spatial grid parameter θ_grid with a dimension of 4×4×(3+1), where 3 represents the convolution kernel weight dimension and 1 represents the offset dimension; the parsed spatial offset matrix has a size of 4×4×2.

3. The edge vision model optimization method based on dynamic inference kernel according to claim 1, characterized in that, In step S3, the entropy value H is calculated using the following formula: H = -∑(p_i × log(p_i)); where p_i is the model's predicted probability for the i-th category.

4. The edge vision model optimization method based on dynamic inference kernel according to claim 1 or 3, characterized in that, Step S3 also includes a feature diversity measurement step: performing principal component analysis to reduce the dimensionality of the local feature vectors of multiple valid new category examples, and calculating the cosine similarity S between the features after dimensionality reduction. When S is less than a preset similarity threshold, the example is retained.

5. The edge vision model optimization method based on dynamic inference kernel according to claim 1, characterized in that, In step S4, the parameter mask matrix M defines no more than 3% of the model parameters as hard update layers and 10%-20% of the parameters as soft update layers; the soft update layers use exponential moving average for smooth updates, and the hard update layers use the RMSprop optimizer for gradient updates.

6. An edge vision model optimization system based on a dynamic inference kernel, characterized in that, include: The cross-modal task prototype generation module is used to extract visual and text features from at least one example image containing a new target and its corresponding optional text description, and generate a cross-modal task prototype (P_v, P_t) through processing such as Gaussian mixture model clustering. The dynamic deformable inference kernel generation module is configured with a parameter projection network based on the principle of neural radiation field, which is used to receive the cross-modal task prototype and map it to generate a dynamic deformable inference kernel K_task containing a spatial offset matrix (Δx,Δy) and a weight coefficient matrix W. The self-supervised verification and labeling module is used to calculate the predicted entropy value of the input example to verify whether it is a new category, and uses the built-in Teacher-Student distillation framework to automatically generate pseudo-labels (B_tea, c_tea) for supervised learning for the verified examples. The parameter isolation update module is used to divide the parameters of the Student model to be optimized into three categories: frozen, soft update, and hard update, based on the predefined parameter mask matrix M. The dynamic deformable inference kernel K_task is then bound to a specified layer of the model as a hard update parameter. The loss is calculated based on the pseudo-label, and gradient updates are performed only on the non-frozen parameters to complete the model optimization.

7. The edge vision model optimization system based on dynamic inference kernel according to claim 6, characterized in that, The parameter projection network in the dynamic deformable inference kernel generation module takes the cross-modal task prototype as input and outputs the task-specific spatial grid parameter θ_grid. The module further parses the dynamic deformable inference kernel K_task from θ_grid.

8. The edge vision model optimization system based on dynamic inference kernel according to claim 6, characterized in that, In the parameter isolation update module, the learning rate of hard update parameters is higher than that of soft update parameters, and hard update parameters are updated directly through gradient descent, while soft update parameters are updated through exponential moving average.