A real-time image processing method and system based on edge computing
By constructing a circular queue at edge nodes and using deformable convolution kernels for adaptive feature enhancement, a spatial attention mask is generated, and the feature extraction rate and dimension mapping are dynamically adjusted. This solves the real-time and stability problems of image processing in edge computing, and achieves high efficiency in feature extraction and continuous execution of the processing flow.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GANSU IND VOCATIONAL & TECH COLLEGE
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-12
- Estimated Expiration
- Not applicable · inactive patent
AI Technical Summary
In existing technologies, real-time image processing in edge computing suffers from several problems: fixed convolutional structures cannot adapt to the deformed regions of the target in the image; feature extraction is not targeted; single-frame processing is independent and has no temporal correlation; feature storage is disordered, which leads to a decrease in the real-time performance of image processing; feature dimensions do not match after cloud model updates; there is no dynamic control method for edge processing thread load; and queue depth imbalance can easily cause processing congestion.
A circular queue is built at the edge nodes, deformable convolutional kernels are used to adaptively enhance the intermediate feature maps, spatial attention masks are generated to filter out low-activation regions, feature dimensions are mapped through a projection network, the feature extraction rate is dynamically adjusted, and the depth of the processing thread queue is monitored to balance the load.
It improves the fit of feature extraction, maintains the temporal arrangement of features across multiple frames, removes invalid information, ensures that edge processing is consistent with the feature dimensions of the cloud model, avoids processing congestion, and guarantees the stability and real-time performance of the image processing workflow.
Smart Images

Figure CN122195686A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of edge computing intelligent image processing technology, and in particular to a real-time image processing method and system based on edge computing. Background Technology
[0002] Traditional real-time image processing at the edge often directly transmits the original image or shallow features. Edge nodes use fixed-structure convolution to extract features without setting a temporally ordered storage structure, and feature extraction is only processed independently for a single frame. Cloud-based recognition models use fixed feature dimensions, and the edge only generates and transmits feature data according to fixed specifications without setting feature filtering and dimension adaptation mechanisms. Edge processing threads extract features at a fixed rate.
[0003] Fixed convolutional structures cannot adapt to deformed regions of targets in images, resulting in weak feature extraction targeting. Single-frame processing lacks temporal correlation, and feature storage is disordered. No spatial filtering is performed on features, leading to a large number of low-activation redundant features participating in transmission and consuming edge transmission bandwidth. After cloud model updates, feature dimensions change, causing a mismatch between edge features and cloud model dimensions, hindering subsequent recognition. The lack of dynamic load control for edge processing threads and queue depth imbalances easily cause processing congestion, degrading the real-time performance of image processing.
[0004] Multi-scale intermediate features cannot be stored and adaptively enhanced in an ordered manner according to timestamps, and low-activation regions in image features cannot be effectively removed. It is also impossible to proactively acquire the latest feature dimensions of the cloud model and complete local feature mapping adaptation, and it is impossible to dynamically adjust the feature extraction rate based on the edge thread queue depth. Summary of the Invention
[0005] The purpose of this invention is to address the shortcomings of existing technologies by proposing a real-time image processing method and system based on edge computing.
[0006] To achieve the above objectives, the present invention adopts the following technical solution: a real-time image processing method based on edge computing, comprising: It receives real-time captured image streams and adds timestamps and source device identifiers to the images as they enter the edge node processing pipeline. It then applies spatial pyramid pooling to the image stream with the added label information to generate an intermediate feature map containing multi-scale information. A circular queue is built inside the edge node. The intermediate feature maps are stored in the circular queue in the order of timestamps. The intermediate feature maps are taken out from the head of the circular queue and adaptively enhanced using deformable convolution kernels to form enhanced feature maps. Based on the activation intensity of different regions within the enhanced feature map, a spatial attention mask is generated; By using spatial attention masks to perform weighted filtering on the enhanced feature maps, features in low-activation regions are removed to obtain a condensed feature package. Query the preset cloud model update log to obtain the feature dimension requirements on which the latest cloud recognition model depends; The dimensions of the condensed feature package are compared with the feature dimension requirements of the latest recognition model in the cloud. If the dimensions do not match, the condensed feature package is mapped to the target dimension through a projection network to form a feature vector to be transmitted. The feature vector to be transmitted, the corresponding timestamp, and the source device identifier are encapsulated into a transmission data unit; Monitor the queue depth of multiple processing threads within the edge node, and dynamically adjust the rate at which intermediate feature maps are retrieved from the circular queue based on the queue depth.
[0007] As a further aspect of the present invention, the specific steps of applying spatial pyramid pooling to the image stream with added labeling information to generate an intermediate feature map containing multi-scale information include: Define a set of pooling window sizes from small to large, where the pooling window size corresponds to the feature scale from local to global; For the received single-frame image, sliding pooling is performed on the entire image using the smallest pooling window size, and the pooling method is average pooling, to obtain the feature response map with the richest details. For the same frame of image, the moving average pooling operation is repeated with a larger pooling window size in turn, and each operation generates a feature response map of the corresponding scale. All feature response maps of different scales are stitched together in ascending order of scale along the channel dimension. The multidimensional feature map formed after splicing is subjected to channel normalization to eliminate the dimensional differences between different scales. The result after normalization is the intermediate feature map containing multi-scale information.
[0008] As a further aspect of the present invention, the specific steps of using deformable convolution kernels for adaptive feature enhancement include: Take an intermediate feature map from the circular queue and feed it into a lightweight offset prediction network; The offset prediction network outputs an offset field with the same shape as the intermediate feature map. Each value in the offset field represents the vector that the convolution kernel needs to sample and offset at the corresponding position. Prepare a standard-sized square convolution kernel, whose weight parameters are fixed after initialization; A deformable convolution kernel is formed by combining a standard-sized square convolution kernel with the predicted offset field. The actual sampling position of the deformable convolution kernel is determined by the original grid position plus the offset field. The extracted intermediate feature maps are convolved using deformable convolution kernels. During the convolution operation, the weights of the convolution kernels remain unchanged, but the position of the pixel affected by each weight changes adaptively according to the offset field. The output feature map after convolution is added element-wise to the original intermediate feature map to achieve residual connection. The result of the addition is the enhanced feature map.
[0009] As a further aspect of the present invention, the specific steps for generating the spatial attention mask include: The absolute average value of the enhanced feature map is calculated along the channel dimension to obtain a two-dimensional single-channel average activation map; Gaussian smoothing filtering is applied to the single-channel average activation map to eliminate local extrema caused by noise. Calculate the global average value of the single-channel average activation map after Gaussian smoothing filtering, and use it as the activation threshold; The activation value at each position in the single-channel average activation map is compared with the activation threshold. If it is greater than the activation threshold, a mask value of one is generated at the corresponding position; otherwise, a mask value of zero is generated. The generated binarized mask is convolved with a preset dilation kernel. Regions with a value of one are appropriately dilated to ensure that the target edge region is completely covered. The resulting binary image after the dilation operation is the spatial attention mask.
[0010] As a further aspect of the present invention, the specific steps of mapping the condensed feature package to the target dimension through a projection network to form the feature vector to be transmitted include: Read the latest feature dimensions required by the recognition model from the cloud model update log, and denote them as the target feature dimensions; Obtain the current feature dimension of the condensed feature package, denoted as the source feature dimension; The condensed feature package is input into the first fully connected layer of the projection network, which transforms the source feature dimension into an intermediate hidden layer dimension. A nonlinear activation function is applied to the feature vectors transformed to the intermediate hidden layer dimension to introduce nonlinear transformation capability. The feature vector processed by the nonlinear activation function is input into the second fully connected layer of the projection network. The second fully connected layer transforms the dimensions of the intermediate hidden layers into the dimensions of the target features. The feature vector output by the second fully connected layer is standardized to conform to a standard normal distribution. The standardized feature vector is the feature vector to be transmitted.
[0011] As a further aspect of the present invention, the training and updating steps of the projection network are completed in the cloud, specifically including: The cloud periodically collects transmission data units from multiple edge nodes and extracts the feature vector to be transmitted and its corresponding timestamp from the transmission data units; The cloud takes the collected feature vectors to be transmitted as input and feeds them into the latest recognition model to obtain the recognition result; The cloud uses the recognition results obtained from the latest recognition model and the actual annotation results to calculate the model loss, and updates the parameters of the latest recognition model through the backpropagation algorithm; During the process of updating the latest recognition model parameters, the parameters of the fixed projection network remain unchanged, and the gradient of the feature vector to be transmitted with respect to the projection network parameters is calculated. The gradients from multiple batches of training data are aggregated in the cloud, and the aggregated gradients are used to update the parameters of the projection network, generating updated projection network parameters. The cloud sends the updated projection network parameters and update logs to all edge nodes, which then use them to replace their local projection network parameters.
[0012] As a further aspect of the present invention, the specific steps of dynamically adjusting the rate of retrieving intermediate feature maps from the circular queue based on the queue depth include: Periodically sample the task queues of all image processing threads within the edge nodes and calculate the number of tasks waiting to be processed in each task queue; The number of waiting tasks in each task queue is compared with a preset queue length threshold, and the number of task queues that exceed the queue length threshold is counted and recorded as the number of overloaded queues. Calculate the ratio of the number of overloaded queues to the total number of task queues to obtain the instantaneous overload rate; Query a historical overload rate record table, which records the instantaneous overload rate for multiple sampling periods over a recent period; Calculate the moving average of all instantaneous overload rates in the historical overload rate record table and use it as the reference value for steady-state overload rate; The instantaneous overload rate obtained in the current sampling period is compared with the steady-state overload rate reference value, and the difference between the instantaneous overload rate and the steady-state overload rate is calculated. Based on the sign and magnitude of the difference, a global rate control parameter is adjusted, which directly determines the time interval for retrieving intermediate feature maps from the circular queue. Based on the adjusted global rate control parameters, the operation of retrieving intermediate feature maps from the circular queue is controlled so that the frequency of the retrieval operation matches the processing capacity of the processing thread.
[0013] As a further aspect of the present invention, before the step of counting the number of task queues exceeding the queue length threshold and recording them as the number of overloaded queues, a step of classifying and managing the task queues is included: Define task priorities for different types of image processing tasks, with three levels: high, medium, and low. Each task priority is assigned an independent physical processing thread group and a corresponding task queue; When sampling task queues, different sampling frequencies are used for task queues with different priorities. The sampling frequency is highest for high-priority task queues and lowest for low-priority task queues. When counting the number of overloaded queues, different queue length thresholds are set for task queues with different priorities. The queue length threshold for high-priority task queues is the smallest, and the queue length threshold for low-priority task queues is the largest. When calculating the instantaneous overload rate, the overload queues of different priorities are weighted and counted. The high-priority overload queues have the largest weight, and the low-priority overload queues have the smallest weight.
[0014] As a further aspect of the present invention, after receiving the transmission data unit in the cloud, the method further includes executing: The cloud analyzes the transmitted data unit and separates the feature vector to be transmitted, timestamp, and source device identifier; Based on the source device identifier, retrieve the feature distribution template stored in the historical interaction records of edge nodes in the cloud; Align the feature vector to be transmitted with the retrieved feature distribution template to eliminate feature offset between different edge nodes; The aligned feature vector to be transmitted is input into a gated recurrent unit network. The gated recurrent unit network combines the state vector of the edge node in the previous processing request to perform temporal context modeling on the current feature. The final state vector output by the gated recurrent unit network is input into the head of the multi-task learning network in the cloud. The cloud-based multi-task learning network head simultaneously performs three sub-tasks: target classification, position fine-tuning, and behavior prediction, generating comprehensive cloud-based analysis results. The specific steps for aligning the feature vector to be transmitted with the retrieved feature distribution template include: Read the feature vector to be transmitted and calculate its mean vector and covariance matrix; Read the feature distribution template pre-stored for the current source device identifier. The feature distribution template contains the mean vector and covariance matrix of the historical feature vectors of the edge nodes. Calculate the difference vector between the mean vector of the feature vector to be transmitted and the mean vector of the feature distribution template; The difference vector is whitened by a transformation matrix constructed from the square root of the inverse of the covariance matrix of the feature distribution template. Add the whitening-transformed difference vector to the feature vector to be transmitted to obtain an intermediate alignment vector; Finally, the intermediate alignment vector is rescaled so that its covariance matrix is approximately the same as the covariance matrix of the feature distribution template. The rescaled vector is the feature vector to be transmitted after the alignment operation is completed.
[0015] As a further aspect of the present invention, the present invention also includes a real-time image processing system based on edge computing, the system including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein when the processor executes the computer program, it implements the steps of the real-time image processing method based on edge computing as described above.
[0016] Compared with the prior art, the advantages and positive effects of the present invention are as follows: A circular queue is constructed within the edge nodes, storing multi-scale intermediate feature maps in timestamp order. Adaptive feature enhancement is performed on the head feature map of the queue using deformable convolutional kernels. A spatial attention mask is generated based on the activation intensity of the enhanced feature map regions. This mask is used to weight and filter the enhanced feature maps, eliminating low-activation features. The deformable convolution adjusts the receptive field to fit the deformable regions of the image target, improving the fit of feature extraction. The circular queue maintains the temporal arrangement of features across multiple frames, preventing temporal disorder. The spatial attention mask accurately locates high-value feature regions, eliminating low-activation features with no effective information and reducing redundant feature data.
[0017] The system queries the cloud model update log to obtain the feature dimension requirements of the latest recognition model, compares the condensed feature package dimensions with the target dimensions, and completes the dimension mapping through a projection network. This ensures that the feature dimensions generated at the edge are consistent with the cloud model, eliminating the dimension incompatibility issues caused by cloud model iteration. It monitors the queue depth of multiple processing threads at the edge nodes and dynamically adjusts the feature retrieval rate of the circular queue. This balances the load of different processing threads, avoids congestion caused by excessive queue depth in a single thread, maintains the stable operation of the edge processing pipeline, and ensures the continuous execution of the image processing workflow. Attached Figure Description
[0018] Figure 1 This is a flowchart of a real-time image processing method based on edge computing as described in this invention; Figure 2 A flowchart for generating intermediate feature maps using spatial pyramid pooling; Figure 3 This is a flowchart for adaptive feature enhancement using deformable convolution kernels. Detailed Implementation
[0019] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0020] In the description of this invention, it should be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," and "outer," etc., indicating orientation or positional relationships, are based on the orientation or positional relationships shown in the accompanying drawings and are only for the convenience of describing the invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of the invention. Furthermore, in the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.
[0021] See Figure 1 This invention provides a real-time image processing method based on edge computing, the overall implementation of which is as follows: The real-time captured image stream is received, and a timestamp and source device identifier are appended to it as it enters the edge node processing pipeline. Spatial pyramid pooling is applied to the image stream with the appended label information to generate intermediate feature maps containing multi-scale information. A circular queue is built inside the edge node, and the intermediate feature maps are stored in the circular queue in the order of their timestamps. The intermediate feature map is taken from the head of the circular queue and adaptively enhanced using deformable convolutional kernels to form an enhanced feature map. A spatial attention mask is generated based on the activation intensity of different regions within the enhanced feature map. The enhanced feature map is weighted and filtered using the spatial attention mask to remove features from low-activation regions, resulting in a condensed feature package. The preset cloud model update log is queried to obtain the feature dimension requirements on which the latest cloud recognition model depends. The dimension of the condensed feature package is compared with the feature dimension requirements of the latest cloud recognition model. If the dimensions do not match, the condensed feature package is mapped to the target dimension through a projection network to form a feature vector to be transmitted. The feature vector to be transmitted, the corresponding timestamp, and the source device identifier are encapsulated into a transmission data unit. The queue depth of multiple processing threads inside the edge node is monitored, and the rate at which intermediate feature maps are taken from the circular queue is dynamically adjusted according to the queue depth.
[0022] In one embodiment of the present invention, see [reference] Figure 2A set of pooling window sizes, ranging from small to large, is defined, with each size corresponding to a feature scale from local to global. For a received single-frame image, sliding pooling is performed on the entire image using the smallest pooling window size, employing average pooling to obtain the feature response map with the richest details. For the same frame image, sliding average pooling is repeated using progressively larger pooling window sizes, generating a feature response map at the corresponding scale each time. All feature response maps at different scales are then stitched together along the channel dimension in ascending order of scale. The resulting multi-dimensional feature map is then subjected to channel normalization to eliminate dimensional differences between different scales. The normalized result is the intermediate feature map containing multi-scale information.
[0023] In practice, a set of pooling window sizes, ranging from small to large, is defined. These pooling window sizes correspond to feature scales ranging from local to global. The specific values of the pooling window sizes can be adjusted based on the image resolution and the desired feature granularity. For example, in one configuration, the pooling window size sequence is set to [4×4, 8×8, 16×16], used to extract fine-grained texture, medium-scale structure, and macroscopic contour information, respectively. This set of sizes ensures that feature extraction can cover different levels from microscopic details to global semantics. For the received single-frame image, sliding pooling is performed on the entire image using the smallest pooling window size. The pooling method is average pooling, resulting in the most detailed feature response map. During the sliding process, the pooling window traverses every possible position of the image with a fixed step size, thereby generating a dense response map where each pixel value reflects the concentration of features within a local region. For the same frame of image, the moving average pooling operation is repeated with a larger pooling window size. Each operation generates a feature response map at the corresponding scale. As the pooling window increases, the spatial resolution of the response map gradually decreases, but the receptive field represented by each response value expands, which can capture a wider range of structural information, thus forming a series of multi-scale feature representations.
[0024] In some embodiments, feature response maps of different scales are concatenated along the channel dimension in ascending order of scale. The concatenated multidimensional feature map retains information from the original image at different levels of abstraction, enabling subsequent processing modules to utilize both local details and global context simultaneously, thus enhancing the robustness and richness of feature representation. Channel normalization is then applied to the concatenated multidimensional feature map to eliminate dimensional differences between different scales. The result of this normalization is the intermediate feature map containing multi-scale information. Channel normalization can be implemented using the following formula:
[0025] in: This represents the original value of the spliced multidimensional feature map at the c-th channel and spatial position (i,j). It is the mean of the characteristic values of all spatial locations of this channel. It is the standard deviation of the corresponding channel. It is a very small constant set to prevent division by zero. These are the normalized eigenvalues. After normalization, the distribution of eigenvalues in each channel tends to be consistent, eliminating the data scale differences caused by different scales.
[0026] In one embodiment of the present invention, see [reference] Figure 3 An intermediate feature map is taken from the circular queue and input into a lightweight offset prediction network. The offset prediction network outputs an offset field with the same shape as the intermediate feature map. Each value in the offset field represents the vector that the convolutional kernel needs to sample at the corresponding position. A standard-sized square convolutional kernel is prepared, and its weight parameters are fixed after initialization. The standard-sized square convolutional kernel is combined with the predicted offset field to form a deformable convolutional kernel. The actual sampling position of the deformable convolutional kernel is determined by the original grid position plus the offset field. The deformable convolutional kernel is used to perform a convolution operation on the taken intermediate feature map. During the convolution operation, the weights of the convolutional kernel remain unchanged, but the pixel position of each weight changes adaptively according to the offset field. The output feature map after the convolution operation is added element-wise to the original intermediate feature map to achieve residual connection. The result of the addition is the enhanced feature map.
[0027] The absolute average value of the enhanced feature map is calculated along the channel dimension to obtain a two-dimensional single-channel average activation map. Gaussian smoothing filtering is applied to the single-channel average activation map to eliminate local extrema caused by noise. The global average value of the Gaussian smoothed single-channel average activation map is calculated and used as the activation threshold. The activation value at each position in the single-channel average activation map is compared with the activation threshold. If it is greater than the activation threshold, a mask value of one is generated at the corresponding position; otherwise, a mask value of zero is generated. The generated binarized mask is convolved with a preset dilation kernel, and regions with a value of one are appropriately dilated to ensure that the target edge region is completely covered. The binary map obtained after the dilation operation is the spatial attention mask.
[0028] In practice, an intermediate feature map is retrieved from the circular queue and input into a lightweight offset prediction network. This network can consist of two convolutional layers. The first layer uses a small kernel to extract local features, and the second layer outputs an offset field with the same shape as the intermediate feature map. Each value in the offset field is a two-dimensional vector representing the horizontal and vertical displacement required by the kernel at the corresponding position. A standard-sized square convolutional kernel is prepared, with its weights fixed after initialization (e.g., a 3×3 square kernel). The initial weights are set to a regular uniform or Gaussian distribution and are not updated in subsequent processing to ensure the stability of the convolution operation. The standard-sized square kernel is combined with the predicted offset field to form a deformable convolutional kernel. The actual sampling position of the deformable convolutional kernel is determined by the original regular grid position plus the offset field, allowing the receptive field of the kernel to adaptively focus on important regions based on the image content, thereby capturing the geometric deformation features of non-rigid targets.
[0029] In some embodiments, a deformable convolutional kernel is used to perform a convolution operation on the extracted intermediate feature map. During the convolution operation, the weights of the convolutional kernel remain unchanged, but the pixel position where each weight is applied changes adaptively according to the offset field. That is, for each position in the output feature map, the sampling point of the convolutional kernel is no longer a fixed regular grid, but a coordinate position offset according to the offset field. The feature value at the non-integer coordinates is obtained by bilinear interpolation for calculation. The output feature map after the convolution operation is added element-wise to the original intermediate feature map to achieve residual connection. The result of the addition is the enhanced feature map. The introduction of residual connection avoids feature degradation, preserves the low-level information of the original intermediate feature map, and incorporates the geometrically adaptive features extracted by deformable convolution.
[0030] In practice, the absolute average value of the enhanced feature map is calculated along the channel dimension to obtain a two-dimensional single-channel average activation map. Each pixel value in the single-channel average activation map reflects the average activation intensity of the corresponding spatial location across all channels, which can preliminarily characterize the feature saliency of different regions. Gaussian smoothing filtering is applied to the single-channel average activation map to eliminate local extrema caused by noise. The kernel size and standard deviation of the Gaussian smoothing filter can be set according to the image resolution. For example, a 5×5 Gaussian kernel can be used to convolve the average activation map to suppress spurious responses caused by high-frequency noise. The global average value of the single-channel average activation map after Gaussian smoothing is calculated and used as the activation threshold. The global average value is obtained by summing all pixel values of the single-channel average activation map and dividing by the total number of pixels, representing the average activation level of the entire image, and serving as a benchmark for distinguishing high and low activation regions.
[0031] The process involves comparing the activation value at each location in the single-channel average activation map with an activation threshold. If the value is greater than the threshold, a mask value of one is generated at the corresponding location; otherwise, a mask value of zero is generated, resulting in a binary initial mask. Regions with a value of one represent high-activation areas, i.e., the locations of potential targets or salient structures. The generated binary mask is then convolved with a preset dilation kernel, appropriately dilating the regions with values of one. The size of the dilation kernel can be set according to the minimum expected size of the target; for example, a 3×3 all-one matrix can be used as the dilation kernel. The convolution operation expands the high-activation areas outward by several pixels. The resulting binary map after the dilation operation is the spatial attention mask, which more comprehensively covers the target boundary and adjacent related regions.
[0032] In practical implementation, the activation threshold comparison operation in the spatial attention mask generation process can be explicitly expressed by the following formula:
[0033] in: This indicates the position of the single-channel average activation map after Gaussian smoothing filtering. Activation value at that location, This represents the calculated global average activation threshold. This indicates the generated initial binary mask at position. The formula clearly defines the binary mask generation logic based on threshold comparison, with the value at that point.
[0034] In one embodiment of the present invention, the feature dimension required by the latest recognition model is read from the cloud model update log and denoted as the target feature dimension; the current feature dimension of the condensed feature package is obtained and denoted as the source feature dimension; the condensed feature package is input into the first fully connected layer of the projection network, and the first fully connected layer transforms the source feature dimension into an intermediate hidden layer dimension; a nonlinear activation function is applied to the feature vector transformed into the intermediate hidden layer dimension to introduce nonlinear transformation capability; the feature vector processed by the nonlinear activation function is input into the second fully connected layer of the projection network, and the second fully connected layer transforms the intermediate hidden layer dimension into the target feature dimension; the feature vector output by the second fully connected layer is standardized to conform to a standard normal distribution, and the standardized feature vector is the feature vector to be transmitted.
[0035] The cloud periodically collects transmission data units from multiple edge nodes, extracting the feature vectors to be transmitted and their corresponding timestamps from these data units. The cloud then uses these collected feature vectors as input to the latest recognition model to obtain the recognition result. The cloud calculates the model loss using the recognition result from the latest model and the actual annotation result, updating the parameters of the latest recognition model through backpropagation. During this update, the parameters of the projection network remain unchanged, and the gradient of the feature vector to be transmitted with respect to the projection network parameters is calculated. The cloud aggregates gradients from multiple batches of training data and uses these aggregated gradients to update the projection network parameters, generating updated projection network parameters. The cloud then distributes the updated projection network parameters and update logs to all edge nodes, which replace their local projection network parameters with these updated parameters.
[0036] In practical implementation, the feature dimensions required by the latest recognition model are read from the cloud model update log, denoted as the target feature dimension. The target feature dimension is specified by the configuration file released by the cloud. For example, if the latest recognition model in the cloud requires an input feature dimension of 1024, then this value is the target feature dimension. The current feature dimension of the condensed feature package is obtained, denoted as the source feature dimension. The condensed feature package is a compressed feature representation obtained by edge nodes performing spatial attention weighted filtering on the enhanced feature map. Its dimension depends on the network structure design of the front-end processing, and can be, for example, 512 or 256 dimensions. The condensed feature package is input into the first fully connected layer of the projection network. The first fully connected layer transforms the source feature dimension into an intermediate hidden layer dimension. The setting of the intermediate hidden layer dimension needs to balance expressive power and computational complexity. A common configuration is between the source feature dimension and the target feature dimension. For example, when the source feature dimension is 512 and the target feature dimension is 1024, the intermediate hidden layer dimension can be set to 768 to achieve a smooth transition of features and sufficient nonlinear transformation.
[0037] In some embodiments, a nonlinear activation function is applied to the feature vector transformed to the intermediate hidden layer dimension to introduce nonlinear transformation capability. Commonly used nonlinear activation functions include ReLU or its variants, which are used to enhance the fitting ability of the projection network to complex feature distributions and avoid the limitations of linear transformation. The feature vector processed by the nonlinear activation function is input into the second fully connected layer of the projection network. The second fully connected layer transforms the intermediate hidden layer dimension into the target feature dimension, that is, it finally maps the feature vector into the input space required by the cloud model, ensuring that the feature representation is compatible with the input interface of the latest cloud recognition model. The feature vector output by the second fully connected layer is standardized to conform to a standard normal distribution. The standardized feature vector is the feature vector to be transmitted. The standardization operation can be implemented using the following formula:
[0038] in: The second fully connected layer output feature vector represents the first... One portion, It is the mean of all components of the eigenvector. It is the standard deviation of all components of the eigenvector. It is the standardized first Each component, after standardization, has a mean of 0 and a variance of 1 across all dimensions, which helps improve the training stability and convergence speed of the cloud-based model. Refer to Table 1, which shows the optional settings for the intermediate hidden layer dimensions of the projection network under different source and target feature dimension configurations: Table 1: Optional Settings for the Hidden Layer Dimensions of Projected Networks ; In practice, the cloud periodically collects transmission data units from multiple edge nodes, extracting the feature vectors to be transmitted and their corresponding timestamps from these units. The collection cycle can be dynamically adjusted based on system load and model update frequency, such as collecting every 24 hours or triggering the collection process after accumulating a certain number of samples. The cloud uses the collected feature vectors to be transmitted as input to the latest recognition model to obtain the recognition result. The type of recognition result depends on the specific task of the cloud model, such as target category label, detection box coordinates, or behavior probability distribution. The cloud uses the recognition result obtained from the latest recognition model and the actual annotation result to calculate the model loss, and updates the parameters of the latest recognition model through the backpropagation algorithm. The choice of loss function is related to the recognition task; for example, cross-entropy loss is commonly used for classification tasks, while classification loss and regression loss are often combined for detection tasks.
[0039] It is understandable that during the process of updating the latest recognition model parameters, the parameters of the projection network remain unchanged, and the gradient of the feature vector to be transmitted with respect to the projection network parameters is calculated. At this time, the gradient calculation only applies to the projection network part and does not affect the update path of the main parameters of the latest recognition model, thereby achieving modular parameter optimization. The cloud aggregates gradients from multiple batches of training data, and uses the aggregated gradients to update the parameters of the projection network, generating updated projection network parameters. Gradient aggregation can use moving average or cumulative methods to ensure the stability and representativeness of the update direction. The cloud distributes the updated projection network parameters and update logs to all edge nodes. The edge nodes replace their local projection network parameters with these. The update logs record the new target feature dimension requirements and projection network version information. The edge nodes adjust their local processing flow according to the log content to maintain synchronization with the cloud.
[0040] In one embodiment of the present invention, the task queues of all image processing threads within the edge node are periodically sampled, and the number of tasks waiting to be processed in each task queue is calculated. The number of waiting tasks in each task queue is compared with a preset queue length threshold, and the number of task queues exceeding the queue length threshold is counted and recorded as the number of overloaded queues. The proportion of the number of overloaded queues to the total number of task queues is calculated to obtain the instantaneous overload rate. A historical overload rate record table is queried, which records the instantaneous overload rate of multiple sampling periods in a recent period. The moving average of all instantaneous overload rates in the historical overload rate record table is calculated and used as the steady-state overload rate reference value. The instantaneous overload rate obtained in the current sampling period is compared with the steady-state overload rate reference value, and the difference between the instantaneous overload rate and the steady-state overload rate is calculated. Based on the sign and magnitude of the difference, a global rate control parameter is adjusted, which directly determines the time interval for retrieving intermediate feature maps from the circular queue. Based on the adjusted global rate control parameter, the operation of retrieving intermediate feature maps from the circular queue is controlled so that the frequency of the retrieval operation matches the processing capacity of the processing thread.
[0041] Task priorities are defined for different types of image processing tasks, with three levels: high, medium, and low. Each task priority is assigned an independent physical processing thread group and a corresponding task queue. When sampling task queues, different sampling frequencies are used for different priority task queues, with the highest sampling frequency for high-priority task queues and the lowest for low-priority task queues. When counting the number of overloaded queues, different queue length thresholds are set for different priority task queues, with the lowest threshold for high-priority task queues and the highest threshold for low-priority task queues. When calculating the instantaneous overload rate, overloaded queues of different priorities are weighted, with the highest weight for high-priority overloaded queues and the lowest weight for low-priority overloaded queues.
[0042] In practice, the task queues of all image processing threads within the edge node are periodically sampled. The number of tasks waiting to be processed in each task queue is calculated. The sampling period can be set according to the system's real-time requirements, such as once per second or once after processing several frames of images, to obtain the continuous trend of queue status changes. The number of waiting tasks in each task queue is compared with a preset queue length threshold. The queue length threshold is determined based on the computing power of the processing threads and the average task time, and is used to determine whether the queue is congested. The number of task queues exceeding the queue length threshold is counted and recorded as the number of overloaded queues. The number of overloaded queues directly reflects the current instantaneous load pressure of the edge node. The proportion of the number of overloaded queues to the total number of task queues is calculated to obtain the instantaneous overload rate. The instantaneous overload rate quantifies the severity of the current system overload, and its value range is [0,1]. A historical overload rate record table is queried. The historical overload rate table records the instantaneous overload rate for multiple sampling periods over a recent period. For example, the instantaneous overload rate data of the most recent 100 sampling points is saved for analyzing the long-term trend of load changes. Calculate the moving average of all instantaneous overload rates in the historical overload rate record table, and use this as the steady-state overload rate reference value. The steady-state overload rate reference value represents the average load level of the system within the normal fluctuation range. Compare the instantaneous overload rate obtained in the current sampling period with the steady-state overload rate reference value, and calculate the difference between the instantaneous overload rate and the steady-state overload rate. The magnitude and direction of the difference indicate the degree to which the current load deviates from the steady state. Based on the sign and magnitude of the difference, adjust a global rate control parameter. This global rate control parameter directly determines the time interval for retrieving intermediate feature maps from the circular queue. The adjustment logic can be implemented using the following formula:
[0043] in: This indicates the current global rate control parameters. This represents the difference between the instantaneous overload rate and the steady-state overload rate reference value. The adjustment coefficient is used to control the adjustment range. These are the adjusted global rate control parameters. Based on these parameters, the operation of retrieving intermediate feature maps from the circular queue is controlled to ensure that the frequency of retrieval matches the processing capacity of the processing thread. When the instantaneous overload rate is higher than the steady-state overload rate reference value, the global rate control parameters are decreased to reduce the retrieval rate; conversely, the global rate control parameters are increased to improve throughput.
[0044] In some embodiments, task priorities are defined for different types of image processing tasks. Task priorities are categorized into high, medium, and low levels. High-priority tasks correspond to scenarios with high real-time requirements, such as emergency event detection; medium-priority tasks correspond to routine monitoring and analysis; and low-priority tasks correspond to offline or background processing tasks. Each task priority is assigned an independent physical processing thread group and a corresponding task queue. Processing thread groups of different priorities do not interfere with each other, ensuring that high-priority tasks can obtain computing resources in a timely manner. When sampling task queues, different sampling frequencies are used for task queues of different priorities. The sampling frequency for high-priority task queues is the highest, for example, once every 0.1 seconds, while the sampling frequency for low-priority task queues is the lowest, for example, once every 1 second, to quickly respond to changes in the load of high-priority tasks. When counting the number of overloaded queues, different queue length thresholds are set for task queues of different priorities. The queue length threshold for high-priority task queues is the smallest, and the queue length threshold for low-priority task queues is the largest, to accommodate the different latency sensitivities of different priority tasks. When calculating the instantaneous overload rate, the overload queues of different priorities are weighted and counted. The high-priority overload queues have the largest weight and the low-priority overload queues have the smallest weight. This makes the overload situation of the high-priority queues have a greater impact on the instantaneous overload rate. Therefore, when adjusting the global rate control parameters, the stability of high-priority tasks is prioritized. See Table 2.
[0045] Table 2: Displays the configuration parameters for task queues of different priorities. ; In one embodiment of the present invention, the cloud parses the transmission data unit, separating the feature vector to be transmitted, the timestamp, and the source device identifier; based on the source device identifier, it retrieves the feature distribution template stored in the historical interaction records of the edge nodes in the cloud; it aligns the feature vector to be transmitted with the retrieved feature distribution template to eliminate feature offset between different edge nodes; it inputs the aligned feature vector to be transmitted into a gated recurrent unit network, which combines the state vector of the edge node's previous processing request to perform temporal context modeling of the current feature; the final state vector output by the gated recurrent unit network is input into the head of the multi-task learning network in the cloud; the head of the multi-task learning network in the cloud simultaneously performs three sub-tasks: target classification, position fine-tuning, and behavior prediction, generating a comprehensive cloud analysis result.
[0046] Read the feature vector to be transmitted and calculate its mean vector and covariance matrix; read the feature distribution template pre-stored for the current source device identifier, which contains the mean vector and covariance matrix of the historical feature vectors of the edge nodes; calculate the difference vector between the mean vector of the feature vector to be transmitted and the mean vector of the feature distribution template; perform a whitening transformation on the difference vector, the transformation matrix is constructed based on the square root of the inverse of the covariance matrix of the feature distribution template; add the whitened difference vector to the feature vector to be transmitted to obtain an intermediate alignment vector; finally, rescale the intermediate alignment vector so that its covariance matrix is approximately the same as the covariance matrix of the feature distribution template, and the rescaled vector is the feature vector to be transmitted after the alignment operation.
[0047] In practical implementation, the cloud parses the transmitted data unit, separating the feature vector to be transmitted, the timestamp, and the source device identifier. After the transmitted data unit arrives at the cloud via an encrypted channel, the decryption module extracts the structured fields. The feature vector to be transmitted is used for subsequent analysis, the timestamp is used for temporal correlation, and the source device identifier serves as the index key. Based on the source device identifier, the cloud retrieves the feature distribution template stored in the historical interaction records of the edge nodes. The cloud maintains a distributed database that stores the feature distribution parameters statistically analyzed by different edge nodes over long-term operation, with each source device identifier corresponding to one record. The feature vector to be transmitted is aligned with the retrieved feature distribution template to eliminate feature offsets between different edge nodes. Feature offsets may originate from camera perspective, lighting conditions, or hardware differences. The alignment operation makes the distribution of features from different sources converge. The aligned feature vector to be transmitted is input into a gated recurrent unit network (GRU). The GRU combines the state vector of the edge node's previous processing request to perform temporal context modeling for the current feature. The state vector caches the hidden states from historical feature processing, used to capture cross-frame motion trends and behavioral continuity. The final state vector output by the gated recurrent unit network is input into the head of the multi-task learning network in the cloud. The head of the multi-task learning network shares the underlying feature extraction results and simultaneously performs three sub-tasks: target classification, position fine-tuning, and behavior prediction, generating comprehensive cloud analysis results. The analysis results can be used to trigger alarms or update monitoring logs.
[0048] In practice, the feature vector to be transmitted is read, and its mean vector and covariance matrix are calculated. The mean vector is obtained by averaging the features vector across its dimensions, and the covariance matrix reflects the linear correlation between the dimensions. The feature distribution template pre-stored for the current source device is read. This template contains the mean vector and covariance matrix of the historical feature vectors of the edge node. These parameters are statistically derived from a large amount of feature data transmitted by the edge node in the past, representing the inherent feature distribution characteristics of the node. The difference vector between the mean vector of the feature vector to be transmitted and the mean vector of the feature distribution template is calculated. This difference vector represents the offset of the current feature at its center position. A whitening transformation is performed on the difference vector. The transformation matrix is constructed based on the square root of the inverse of the covariance matrix of the feature distribution template. The purpose of the whitening transformation is to remove the correlation of the feature distribution template and normalize the variance. The transformation process can be expressed by the following formula:
[0049] in: It is the mean vector of the feature vector to be transmitted. It is the mean vector of the feature distribution template. It is the covariance matrix of the characteristic distribution template. It is the square root of the inverse of the covariance matrix. This is the difference vector after whitening transformation. The whitened difference vector is added to the feature vector to be transmitted to obtain an intermediate alignment vector. The addition operation superimposes the corrected offset onto the original feature, achieving initial alignment. The intermediate alignment vector is then rescaled so that its covariance matrix approximates the covariance matrix of the feature distribution template. The rescaled vector is the feature vector to be transmitted after alignment, and the scaling factor is determined based on the eigenvalues of the covariance matrix of the feature distribution template.
[0050] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention in any other way. Any person skilled in the art may make changes or modifications to the above-disclosed technical content to create equivalent embodiments that can be applied to other fields. However, any simple modifications, equivalent changes, and modifications made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the protection scope of the present invention.
Claims
1. A real-time image processing method based on edge computing, characterized in that, Includes the following steps: It receives real-time captured image streams and adds timestamps and source device identifiers to the images as they enter the edge node processing pipeline. It then applies spatial pyramid pooling to the image stream with the added label information to generate an intermediate feature map containing multi-scale information. A circular queue is built inside the edge node. The intermediate feature maps are stored in the circular queue in the order of timestamps. The intermediate feature maps are taken out from the head of the circular queue and adaptively enhanced using deformable convolution kernels to form enhanced feature maps. Based on the activation intensity of different regions within the enhanced feature map, a spatial attention mask is generated; By using spatial attention masks to perform weighted filtering on the enhanced feature maps, features in low-activation regions are removed to obtain a condensed feature package. Query the preset cloud model update log to obtain the feature dimension requirements on which the latest cloud recognition model depends; The dimensions of the condensed feature package are compared with the feature dimension requirements of the latest recognition model in the cloud. If the dimensions do not match, the condensed feature package is mapped to the target dimension through a projection network to form a feature vector to be transmitted. The feature vector to be transmitted, the corresponding timestamp, and the source device identifier are encapsulated into a transmission data unit; Monitor the queue depth of multiple processing threads within the edge node, and dynamically adjust the rate at which intermediate feature maps are retrieved from the circular queue based on the queue depth.
2. The real-time image processing method based on edge computing as described in claim 1, characterized in that, The specific steps for applying spatial pyramid pooling to the image stream with added labeling information to generate intermediate feature maps containing multi-scale information include: Define a set of pooling window sizes from small to large, where the pooling window size corresponds to the feature scale from local to global; For the received single-frame image, sliding pooling is performed on the entire image using the smallest pooling window size, and the pooling method is average pooling, to obtain the feature response map with the richest details. For the same frame of image, the moving average pooling operation is repeated with a larger pooling window size in turn, and each operation generates a feature response map of the corresponding scale. All feature response maps of different scales are stitched together in ascending order of scale along the channel dimension. The multidimensional feature map formed after splicing is subjected to channel normalization to eliminate the dimensional differences between different scales. The result after normalization is the intermediate feature map containing multi-scale information.
3. The real-time image processing method based on edge computing as described in claim 1, characterized in that, The specific steps for adaptive feature enhancement using deformable convolutional kernels include: Take an intermediate feature map from the circular queue and feed it into a lightweight offset prediction network; The offset prediction network outputs an offset field with the same shape as the intermediate feature map. Each value in the offset field represents the vector that the convolution kernel needs to sample and offset at the corresponding position. Prepare a standard-sized square convolution kernel, whose weight parameters are fixed after initialization; A deformable convolution kernel is formed by combining a standard-sized square convolution kernel with the predicted offset field. The actual sampling position of the deformable convolution kernel is determined by the original grid position plus the offset field. The extracted intermediate feature maps are convolved using deformable convolution kernels. During the convolution operation, the weights of the convolution kernels remain unchanged, but the position of the pixel affected by each weight changes adaptively according to the offset field. The output feature map after convolution is added element-wise to the original intermediate feature map to achieve residual connection. The result of the addition is the enhanced feature map.
4. The real-time image processing method based on edge computing as described in claim 3, characterized in that, The specific steps for generating the spatial attention mask include: The absolute average value of the enhanced feature map is calculated along the channel dimension to obtain a two-dimensional single-channel average activation map; Gaussian smoothing filtering is applied to the single-channel average activation map to eliminate local extrema caused by noise. Calculate the global average value of the single-channel average activation map after Gaussian smoothing filtering, and use it as the activation threshold; The activation value at each position in the single-channel average activation map is compared with the activation threshold. If it is greater than the activation threshold, a mask value of one is generated at the corresponding position; otherwise, a mask value of zero is generated. The generated binarized mask is convolved with a preset dilation kernel. Regions with a value of one are appropriately dilated to ensure that the target edge region is completely covered. The resulting binary image after the dilation operation is the spatial attention mask.
5. The real-time image processing method based on edge computing as described in claim 1, characterized in that, The specific steps of mapping the condensed feature package to the target dimension through a projection network to form the feature vector to be transmitted include: Read the latest feature dimensions required by the recognition model from the cloud model update log, and denote them as the target feature dimensions; Obtain the current feature dimension of the condensed feature package, denoted as the source feature dimension; The condensed feature package is input into the first fully connected layer of the projection network, which transforms the source feature dimension into an intermediate hidden layer dimension. A nonlinear activation function is applied to the feature vectors transformed to the intermediate hidden layer dimension to introduce nonlinear transformation capability. The feature vector processed by the nonlinear activation function is input into the second fully connected layer of the projection network. The second fully connected layer transforms the dimensions of the intermediate hidden layers into the dimensions of the target features. The feature vector output by the second fully connected layer is standardized to conform to a standard normal distribution. The standardized feature vector is the feature vector to be transmitted.
6. The real-time image processing method based on edge computing as described in claim 5, characterized in that, The training and updating steps of the projection network are completed in the cloud, specifically including: The cloud periodically collects transmission data units from multiple edge nodes and extracts the feature vector to be transmitted and its corresponding timestamp from the transmission data units; The cloud takes the collected feature vectors to be transmitted as input and feeds them into the latest recognition model to obtain the recognition result; The cloud uses the recognition results obtained from the latest recognition model and the actual annotation results to calculate the model loss, and updates the parameters of the latest recognition model through the backpropagation algorithm; During the process of updating the latest recognition model parameters, the parameters of the fixed projection network remain unchanged, and the gradient of the feature vector to be transmitted with respect to the projection network parameters is calculated. The gradients from multiple batches of training data are aggregated in the cloud, and the aggregated gradients are used to update the parameters of the projection network, generating updated projection network parameters. The cloud sends the updated projection network parameters and update logs to all edge nodes, which then use them to replace their local projection network parameters.
7. The real-time image processing method based on edge computing as described in claim 1, characterized in that, The specific steps for dynamically adjusting the rate of retrieving intermediate feature maps from the circular queue based on the queue depth include: Periodically sample the task queues of all image processing threads within the edge nodes and calculate the number of tasks waiting to be processed in each task queue; The number of waiting tasks in each task queue is compared with a preset queue length threshold, and the number of task queues that exceed the queue length threshold is counted and recorded as the number of overloaded queues. Calculate the ratio of the number of overloaded queues to the total number of task queues to obtain the instantaneous overload rate; Query a historical overload rate record table, which records the instantaneous overload rate for multiple sampling periods over a recent period; Calculate the moving average of all instantaneous overload rates in the historical overload rate record table and use it as the reference value for steady-state overload rate; The instantaneous overload rate obtained in the current sampling period is compared with the steady-state overload rate reference value, and the difference between the instantaneous overload rate and the steady-state overload rate is calculated. Based on the sign and magnitude of the difference, a global rate control parameter is adjusted, which directly determines the time interval for retrieving intermediate feature maps from the circular queue. Based on the adjusted global rate control parameters, the operation of retrieving intermediate feature maps from the circular queue is controlled so that the frequency of the retrieval operation matches the processing capacity of the processing thread.
8. The real-time image processing method based on edge computing as described in claim 7, characterized in that, Before counting the number of task queues exceeding the queue length threshold and recording them as overloaded queues, the process includes a step of classifying and managing the task queues: Define task priorities for different types of image processing tasks, with three levels: high, medium, and low. Each task priority is assigned an independent physical processing thread group and a corresponding task queue; When sampling task queues, different sampling frequencies are used for task queues with different priorities. The sampling frequency is highest for high-priority task queues and lowest for low-priority task queues. When counting the number of overloaded queues, different queue length thresholds are set for task queues with different priorities. The queue length threshold for high-priority task queues is the smallest, and the queue length threshold for low-priority task queues is the largest. When calculating the instantaneous overload rate, the overload queues of different priorities are weighted and counted. The high-priority overload queues have the largest weight, and the low-priority overload queues have the smallest weight.
9. The real-time image processing method based on edge computing as described in claim 1, characterized in that, After receiving the transmitted data unit in the cloud, the process also includes execution: The cloud analyzes the transmitted data unit and separates the feature vector to be transmitted, timestamp, and source device identifier; Based on the source device identifier, retrieve the feature distribution template stored in the historical interaction records of edge nodes in the cloud; Align the feature vector to be transmitted with the retrieved feature distribution template to eliminate feature offset between different edge nodes; The aligned feature vector to be transmitted is input into a gated recurrent unit network. The gated recurrent unit network combines the state vector of the edge node in the previous processing request to perform temporal context modeling on the current feature. The final state vector output by the gated recurrent unit network is input into the head of the multi-task learning network in the cloud. The cloud-based multi-task learning network head simultaneously performs three sub-tasks: target classification, position fine-tuning, and behavior prediction, generating comprehensive cloud-based analysis results. The specific steps for aligning the feature vector to be transmitted with the retrieved feature distribution template include: Read the feature vector to be transmitted and calculate its mean vector and covariance matrix; Read the feature distribution template pre-stored for the current source device identifier. The feature distribution template contains the mean vector and covariance matrix of the historical feature vectors of the edge nodes. Calculate the difference vector between the mean vector of the feature vector to be transmitted and the mean vector of the feature distribution template; The difference vector is whitened by a transformation matrix constructed from the square root of the inverse of the covariance matrix of the feature distribution template. Add the whitening-transformed difference vector to the feature vector to be transmitted to obtain an intermediate alignment vector; Finally, the intermediate alignment vector is rescaled so that its covariance matrix is approximately the same as the covariance matrix of the feature distribution template. The rescaled vector is the feature vector to be transmitted after the alignment operation is completed.
10. A real-time image processing system based on edge computing, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the real-time image processing method based on edge computing as described in any one of claims 1 to 9.