Image processing-based monitoring image intelligent recognition method and system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing sparse weights based on the co-conflictual reliability of temporal-domain variable potential energy and spatiotemporal structure, and combining them with the weighted tensor robust principal component analysis algorithm, the problem of low computational efficiency in high-resolution and dynamic environments of surveillance image scenes is solved, and efficient intelligent recognition of surveillance images is achieved.

CN122244575APending Publication Date: 2026-06-19广东九安智能科技股份有限公司

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: 广东九安智能科技股份有限公司
Filing Date: 2026-04-16
Publication Date: 2026-06-19

Smart Images

Figure CN122244575A_ABST

Patent Text Reader

Abstract

This invention relates to the field of image processing technology, and more specifically, to a method and system for intelligent recognition of surveillance images based on image processing. The method includes: acquiring the grayscale temporal sequence of each pixel in a surveillance image within a preset N frames; constructing the temporal variation potential energy of the pixel, wherein the temporal variation potential energy is calculated by multiplying the mean square error of the grayscale temporal sequence by the logarithm of the absolute value of the difference between each grayscale value and the mean in the grayscale temporal sequence; and constructing the spatiotemporal structure collaborative reliability of the pixel. This invention utilizes generated sparse weights to weight the L1 norm of the foreground sparse tensor, guiding the tensor decomposition algorithm to apply different penalty levels to different regions during iteration. This significantly reduces computational complexity while maintaining high-resolution video processing accuracy, and improves the robustness and convergence speed of the algorithm in dynamic and complex environments.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology. More specifically, this invention relates to a method and system for intelligent recognition of surveillance images based on image processing. Background Technology

[0002] Intelligent image recognition in surveillance systems refers to the process of automatically analyzing, extracting features, and understanding content from continuous video streams collected by surveillance equipment using computer vision and digital image processing technologies. This technology has core application value in fields such as public safety, intelligent transportation, and smart park management, enabling automatic early warning of intrusion targets, abnormal behaviors, and critical events. With the increasing density of urban surveillance networks, the massive amount of video data makes traditional manual patrol monitoring inefficient and prone to missed reports. Therefore, developing intelligent image recognition technology for surveillance systems that can operate automatically around the clock and accurately perceive environmental changes is of urgent practical significance for improving emergency response speed and reducing social governance costs.

[0003] In existing technologies, the Tensor Robust Principal Component Analysis (TRPCA) algorithm utilizes the spatiotemporal correlation of video data to construct a high-dimensional tensor model. Through low-rank and sparse decomposition techniques, it separates the video stream into a low-rank background layer and a sparse foreground target layer, thereby achieving the identification and extraction of monitored targets. It exhibits good robustness in target detection against static backgrounds. However, real-world surveillance image scenes typically feature high resolution, large data throughput, and dynamic environmental changes (such as sudden changes in lighting or swaying trees). Directly using the original TRPCA algorithm for identification results in extremely high computational time costs and low processing efficiency due to the high frequency of tensor singular value decomposition (t-SVD) and complex iterative loop operations involved in its core solution process. This makes it unsuitable for the stringent requirements of real-time intelligent identification of high-definition video streams in surveillance systems. Summary of the Invention

[0004] This invention provides a method and system for intelligent recognition of surveillance images based on image processing. It aims to solve the problem that actual surveillance image scenes in related technologies usually have characteristics such as high resolution, large data throughput, and dynamic environmental changes (such as sudden changes in lighting and swaying trees). When the original TRPCA algorithm is used directly for recognition, the core solution process involves high-frequency tensor singular value decomposition (t-SVD) and complex iterative loop operations, resulting in extremely high computation time cost and low processing efficiency.

[0005] In a first aspect, the present invention provides an intelligent recognition method for surveillance images based on image processing, comprising: acquiring the gray-level temporal sequence of each pixel in the surveillance image within a preset N frames; constructing the temporal variation potential energy of the pixel, wherein the temporal variation potential energy is calculated by multiplying the mean square error of the gray-level temporal sequence with the logarithm of the absolute value of the difference between each gray value and the mean in the gray-level temporal sequence; and constructing the spatiotemporal structure collaborative reliability of the pixel, wherein the spatiotemporal structure collaborative reliability is calculated by multiplying the temporal variation potential energy of each pixel in the local neighborhood centered on the target pixel with the mean value of the gray-level temporal sequence. The sum of the products of the changing potential energy and the temporal changing potential energy of the target pixel is divided by the sum of the absolute values of the differences between the temporal changing potential energy of each pixel in the local neighborhood and the temporal changing potential energy of the target pixel. Sparse weights are generated based on the spatiotemporal structure collaborative reliability, wherein the sparse weights are inversely proportional to the spatiotemporal structure collaborative reliability. A weighted tensor robust principal component analysis (TPCA) algorithm is used to separate the foreground and background of the monitoring image to obtain the foreground target. The TPCA algorithm uses the sparse weights to weight the L1 norm of the foreground sparse tensor. By constructing a temporal changing potential energy combining mean square error and logarithmic gain, and utilizing the spatiotemporal structure collaborative reliability based on neighborhood consistency, the algorithm accurately distinguishes between real targets and dynamic background noise (such as swaying leaves). A sparse weight matrix is generated using this reliability to adaptively weight the foreground sparse term in the tensor decomposition process, thereby significantly solving the problems of high computational complexity and poor adaptability to dynamic environments in the traditional TPCA algorithm while maintaining high accuracy.

[0006] Furthermore, the method for generating sparse weights includes: multiplying the ratio of the maximum value of the spatiotemporal structure co-confidence of all pixels to the sum of the spatiotemporal structure co-confidence of the target pixel plus a preset parameter, with the logarithmic term of the ratio of the maximum value to the global mean of the spatiotemporal structure co-confidence of all pixels. This method comprehensively considers the local confidence extrema and the global mean of pixels, and uses a logarithmic function to construct an inverse proportional relationship, so that regions belonging to salient foreground targets receive minimal weights (preserving details), while background or noise regions receive extremely large weights (forced to zero).

[0007] Furthermore, the alternating direction multiplier method is employed to solve the objective function of the weighted tensor robust principal component analysis (ADMM) algorithm. Compared to other solvers, ADMM exhibits better convergence and numerical stability when handling problems with L1 norm and nuclear norm constraints, ensuring that the system can quickly obtain the optimal solution when performing large-scale tensor operations, thus meeting the timeliness requirements of the monitoring system.

[0008] Furthermore, the method also includes: converting the original acquired RGB color image into a grayscale image before acquiring the grayscale time series, and performing Gaussian filtering on the grayscale image. Performing color space conversion and Gaussian filtering preprocessing before feature extraction effectively removes random thermal noise generated by the image sensor and high-frequency noise caused by environmental factors. This smoothing step prevents noise from being erroneously amplified as motion features in subsequent time-domain potential energy calculations, thereby improving the signal-to-noise ratio and accuracy of subsequent feature analysis from the source.

[0009] Furthermore, the formula for calculating the time-domain variable potential energy is as follows: In the formula, Representing coordinates The potential energy changing in the time domain at that location; This represents the mean square error of a grayscale time series. Indicates the total number of frames sampled; The index of the current frame; Indicates the first Coordinates in frame image The grayscale value of the pixel at that location; Representing coordinates exist Average gray level within the frame; This represents the natural constant. The temporal variation potential energy is defined through a specific mathematical formula, creatively combining the mean square error (MSE) with a logarithmic gain term. The MSE term suppresses high-frequency noise in bright areas with large mean values, while the logarithmic difference term leverages the characteristic that the derivative of the logarithmic function decreases with increasing data, significantly amplifying the signal in dark areas or when the difference is large. This combination makes the algorithm insensitive to light intensity, effectively capturing the temporal fluctuations of moving targets in both bright and dark areas, overcoming the problem of single indicators failing in unevenly lit scenes.

[0010] Furthermore, the formula for calculating the spatiotemporal structure co-operational reliability is as follows: In the formula, Representing coordinates Spatiotemporal structure co-operational reliability at the location; coordinates The potential energy changing in the time domain at that location; Represented by coordinates The set of pixels within a local neighborhood window centered on the pixel; Indicates the concentration of potential energy. The temporal variation potential energy of each pixel; To prevent the use of preset hyperparameters with a denominator of zero, this formula mathematically strengthens the synergy between the center pixel and its neighboring pixels: when a pixel and its neighborhood have similar high kinetic potential energy, i.e., the spatial continuity of a real object, the confidence value increases rapidly and non-linearly; while isolated noise points, lacking neighborhood support, have their confidence values suppressed. This effectively solves the problem that traditional methods struggle to distinguish between dynamic background clutter and real moving targets.

[0011] Furthermore, the local neighborhood is centered on the target pixel. area.

[0012] Furthermore, the method for calculating the logarithmic term includes: adding a natural constant to the absolute value. Then, the natural logarithm is taken. This mathematical treatment avoids calculation divergence or negative value errors caused by values close to 0, and enhances the stability of the algorithm under various extreme data conditions.

[0013] Furthermore, the value of N ranges from 20 to 50.

[0014] In a second aspect, a monitoring image intelligent recognition system based on image processing is also provided, including a processor and a memory, wherein the memory stores a computer program, and the processor executes the computer program to implement the monitoring image intelligent recognition method based on image processing described in any of the above embodiments.

[0015] Beneficial effects: Pre-calculated temporal variation potential energy and spatiotemporal structure co-confidence are introduced to generate sparse weights. By utilizing the temporal variation potential energy, combined with mean square error and logarithmic terms, the signal characteristics of weak moving targets can be effectively amplified while suppressing background fluctuations caused by illumination. The spatiotemporal structure co-confidence utilizes the consistency of spatial neighborhoods to filter out isolated random noise, such as sensor noise. Finally, the generated sparse weights are used to weight the L1 norm of the foreground sparse tensor, guiding the tensor decomposition algorithm to apply different penalty levels to different regions during iteration. This significantly reduces computational complexity while maintaining high-resolution video processing accuracy, and improves the algorithm's robustness and convergence speed in dynamic and complex environments. Attached Figure Description

[0016] Figure 1 This is a schematic diagram illustrating a target recognition flowchart according to an embodiment of the present invention; Figure 2 This is a schematic illustration of a confidence feature map according to an embodiment of the present invention; Figure 3 This is a schematic diagram illustrating the target recognition results according to an embodiment of the present invention. Detailed Implementation

[0017] The specific embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

[0018] like Figures 1 to 3 As shown, S101: Acquire monitoring images and perform preprocessing.

[0019] In this embodiment, the system captures video streams in real time using high-definition digital surveillance cameras deployed in the monitoring area. To accommodate subsequent tensor operations and reduce data dimensionality, the acquired RGB color images are first converted to grayscale images. Considering that random thermal noise generated by the sensor and high-frequency noise caused by ambient light can interfere with subsequent feature extraction, a Gaussian filter is used to smooth the grayscale images. Then, the continuously acquired... The frames are stacked in chronological order to form a three-dimensional tensor. ,in , and These represent the height and width of the image, respectively.

[0020] In this embodiment, The preferred value is between 20 and 50, for example, 25 frames. If If the value is too small, the amount of information in the time dimension is insufficient, making it difficult to effectively reflect the long-term stability of the background; if An excessively large value will increase the computational burden and cause lag in response to fast-moving targets. The optimal kernel size for the Gaussian filter is [missing value]. or To balance noise reduction and edge preservation capabilities.

[0021] S102: Construct the temporal change potential energy of a pixel in the time dimension.

[0022] In one embodiment, to quantify the activity of pixels over time and distinguish between dynamic foreground and static background, a temporal change potential energy index needs to be constructed. For a monitoring image at any given time, the preceding data, including that time, is obtained. Frame monitoring image coordinates The grayscale values of the pixels at the specified location are used to form a sequence, which is denoted as the grayscale temporal sequence.

[0023] Specifically, the time-domain varying potential energy is constructed, and the calculation formula is as follows: In the formula, Representing coordinates The potential energy changing in the time domain at that location; This represents the mean square error of a grayscale time series. Indicates the total number of frames sampled; The index of the current frame; Indicates the first Coordinates in frame image The grayscale value of the pixel at that location; Representing coordinates The pixel at that location Average gray level within the frame; Represents the natural constant.

[0024] As can be seen from the above formula, this index combines mean square error and logarithmic gain. The term can counteract the effect of light intensity: in bright areas, the mean value is larger, which can suppress high noise in bright areas; in dark areas, the mean value is smaller, which can amplify noise in dark areas. Value. Logarithmic term Utilizing the property that the derivative of the logarithmic function decreases as data increases, when a pixel belongs to a moving target, the grayscale value... with the mean The large difference causes both the square term and the logarithmic term to increase simultaneously, resulting in a larger calculated value. The value is significantly increased, thereby effectively amplifying the signal characteristics of weak moving targets, while suppressing background fluctuation interference caused by slow changes in illumination.

[0025] S103: Construct spatiotemporal structural collaborative reliability by combining spatial neighborhood information.

[0026] In this embodiment, since random noise (such as sensor noise or swaying leaves) is usually spatially isolated, while real monitoring targets (such as vehicles and pedestrians) have spatial structural continuity, relying solely on temporal indicators may lead to misjudgments. To filter out false noise, it is necessary to construct a spatiotemporal structural collaborative reliability by combining spatial neighborhood information.

[0027] Specifically, based on the time-domain varying potential energy obtained from the above steps, a potential energy map is constructed. For each pixel in the potential energy map, a region of size centered on the pixel is defined. A local window is defined, and the temporal variation potential energy of all pixels within the local window is extracted to construct the spatiotemporal structure collaborative confidence. The calculation formula is as follows: In the formula, Representing coordinates Spatiotemporal structure co-operational reliability at the location; coordinates The potential energy changing in the time domain at that location; Represented by coordinates The set of pixels within a local neighborhood window centered on the pixel; Indicates the concentration of potential energy. The temporal variation potential energy of each pixel; To prevent the preset hyperparameter from having a denominator of zero, the value in this embodiment is set to [value missing]. .

[0028] As the formula shows, this index reflects the consistency in motion activity between the central pixel and its neighboring pixels. When a pixel belongs to a real target entity, its temporal change potential energy... It is relatively high, and due to spatial continuity, its neighboring pixels The difference between the two is also relatively high, resulting in a larger product of the numerator terms; at the same time, the difference between the two is also relatively high. The smaller the value, the smaller the denominator. This combination leads to a calculated... The value increases rapidly and non-linearly, effectively distinguishing connected regions generated by real targets from isolated noise points. Following the above method, the spatiotemporal structural co-confidence of all pixels can be obtained, thus constructing a confidence matrix.

[0029] S104: Construct sparse weights based on confidence features.

[0030] In this embodiment, to guide the subsequent tensor decomposition algorithm in applying different sparsity penalties to different regions, the feature analysis at the front end needs to be transformed into a basis for weight allocation within the algorithm. Based on the spatiotemporal structure collaborative confidence obtained in step S103, sparse weights are constructed.

[0031] First, the spatiotemporal structure collaborative confidence of all pixels in the feature map is normalized using the min-max normalization method. Then, the sparse weights are calculated using the following formula: In the formula, Representing coordinates The sparse weights corresponding to the algorithm iteration process; The maximum value in the confidence matrix; This represents the global mean of the confidence matrix. coordinates Spatiotemporal structure co-operational reliability at the location; To prevent the preset hyperparameter from having a denominator of zero, the value in this embodiment is set to [value missing]. ; Represents the natural constant.

[0032] As the formula shows, sparse weights are inversely proportional to the co-conflict reliability of spatiotemporal structure. The more distinct the unique features of the data (i.e., the more significant the foreground), the better. Approaching As the denominator increases, the calculated sparse weights become smaller. In subsequent optimization models, smaller weights mean a reduced sparsity penalty at that location, allowing it to retain more foreground pixel information; conversely, for background regions with extremely low spatiotemporal structural co-confidence, the sparse weights are extremely large, forcing that location to converge to zero quickly.

[0033] S105: Target recognition using weighted tensor robust principal component analysis algorithm.

[0034] In one embodiment, the matrix formed by the sparse weights of each pixel is denoted as the sparse weight matrix. A weighted improvement is applied to the traditional Tensor Robust Principal Component Analysis (TRPCA) algorithm. The objective function of the weighted tensor optimization model is as follows: In the formula, Represent the objective function; This represents a low-rank tensor that represents a static background. This represents a sparse tensor representing the foreground objective; Represents low-rank tensors nuclear norm number; Representing sparse tensors The weighted L1 norm; It represents the Hadamard Product.

[0035] Specifically, the alternating direction multiplier method (ADMM) is used to solve the above model. This involves updating the sparse tensor. In the steps, sparse weights are used. Dynamically adjust the threshold value of the soft threshold operator. For sparse weights... With a large background area and an extremely high threshold, noise and background residuals are directly filtered out. For sparse weights Small foreground areas and low thresholds preserve complete target details.

[0036] After solving, output the sparse tensor. For sparse tensors Binarization is performed to generate a mask image of the moving foreground target. Finally, a connected component analysis algorithm is used to label the regions in the mask, completing the intelligent identification of intruding objects, vehicles, or pedestrians in the surveillance image.

[0037] This invention also provides an intelligent image recognition system for surveillance cameras based on image processing. The system includes a processor and a memory, the memory storing computer program instructions. When the processor executes the computer program instructions, it implements the intelligent image recognition method for surveillance cameras based on image processing according to the first aspect of this invention.

[0038] The system also includes other components well known to those skilled in the art, such as communication buses and communication interfaces, the settings and functions of which are known in the art and therefore will not be described in detail here.

[0039] In this invention, the aforementioned memory can be any tangible medium containing or storing a program that can be used or combined with an instruction execution system, apparatus, or device. For example, a computer-readable storage medium can be any suitable magnetic or magneto-optical storage medium, such as Resistive Random Access Memory (RRAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High-Bandwidth Memory (HBM), Hybrid Memory Cube (HMC), etc., or any other medium that can be used to store desired information and can be accessed by an application, module, or both. Any such computer storage medium can be part of a device or accessible to or connected to a device. Any application or module described in this invention can be implemented using computer-readable / executable instructions stored or otherwise maintained on such a computer-readable medium.

[0040] The embodiments described above are merely examples of several implementations of the present invention, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention.

Claims

1. A method for intelligent recognition of surveillance images based on image processing, characterized in that, include: For each pixel in the monitoring image, its gray-level temporal sequence within a preset N frames is obtained; the temporal change potential energy of the pixel is constructed, and the temporal change potential energy is calculated by multiplying the mean square error of the gray-level temporal sequence with the logarithm of the absolute value of the difference between each gray value and the mean in the gray-level temporal sequence. The spatiotemporal structure collaborative reliability of pixels is constructed. The calculation method of spatiotemporal structure collaborative reliability is as follows: the sum of the products of the temporal change potential energy of each pixel in the local neighborhood centered on the target pixel and the temporal change potential energy of the target pixel is divided by the sum of the absolute values of the differences between the temporal change potential energy of each pixel in the local neighborhood and the temporal change potential energy of the target pixel. Sparse weights are generated based on the spatiotemporal structure co-confidence, wherein the sparse weights are inversely proportional to the spatiotemporal structure co-confidence; a weighted tensor robust principal component analysis algorithm is used to separate the foreground and background of the monitoring image to obtain the foreground target; wherein the weighted tensor robust principal component analysis algorithm uses the sparse weights to weight the L1 norm of the foreground sparse tensor.

2. The intelligent recognition method for monitoring images according to claim 1, characterized in that, The method for generating sparse weights includes multiplying the ratio of the maximum value of the spatiotemporal structure co-confidence of all pixels to the sum of the spatiotemporal structure co-confidence of the target pixel plus a preset parameter by the logarithm of the ratio of the maximum value to the global mean of the spatiotemporal structure co-confidence of all pixels.

3. The intelligent recognition method for monitoring images according to claim 1, characterized in that, The objective function of the weighted tensor robust principal component analysis algorithm is solved using the alternating direction multiplier method.

4. The intelligent recognition method for monitoring images according to claim 1, characterized in that, The method further includes: converting the original acquired RGB color image into a grayscale image before acquiring the grayscale time series, and performing Gaussian filtering on the grayscale image.

5. The intelligent recognition method for monitoring images according to claim 1, characterized in that, The formula for calculating time-domain varying potential energy is: ； In the formula, Representing coordinates The potential energy changing in the time domain at that location; This represents the mean square error of a grayscale time series. Indicates the total number of frames sampled; The index of the current frame; Indicates the first Coordinates in frame image The grayscale value of the pixel at that location; Representing coordinates exist Average gray level within the frame; Represents the natural constant.

6. The intelligent recognition method for monitoring images according to claim 1, characterized in that, The formula for calculating the spatiotemporal structure coherence reliability is: ； In the formula, Representing coordinates Spatiotemporal structural co-operational reliability at the location; coordinates The potential energy changing in the time domain at that location; Represented by coordinates The set of pixels within a local neighborhood window centered on the pixel; Indicates the concentration of potential energy. The temporal variation potential energy of each pixel; To prevent preset hyperparameters with a denominator of zero.

7. The intelligent recognition method for monitoring images according to claim 1, characterized in that, The local neighborhood is centered on the target pixel. area.

8. The intelligent image recognition method for image processing according to claim 1 or 2, characterized in that, The method for calculating the logarithmic term includes: adding a natural constant to the absolute value. Then, take the natural logarithm.

9. The intelligent recognition method for monitoring images according to claim 8, characterized in that, The value of N ranges from 20 to 50.

10. A monitoring image intelligent recognition system for image processing, comprising a processor and a memory, characterized in that, The memory stores a computer program, and the processor executes the computer program to implement the intelligent image recognition method for image processing as described in any one of claims 1-9.