A machine learning based visual target tracking method
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN LIGUAN DIGITAL TECHNOLOGY CO LTD
- Filing Date
- 2025-02-11
- Publication Date
- 2026-06-12
AI Technical Summary
Traditional visual target tracking methods suffer from severe computational resource waste, low accuracy, and inability to effectively express the spatial and temporal relationships between targets in large-scale monitoring scenarios. Furthermore, they have a small monitoring range and cannot coordinate information from multiple cameras, affecting system applicability and response speed.
Based on video acquisition from multiple cameras, frame filtering and target detection are performed to construct a monitoring graph structure. Anomaly detection is then performed using a pre-trained anomaly detection model, and machine learning methods are used to optimize computation and detection accuracy.
It improves monitoring efficiency and accuracy, enabling real-time monitoring and automated anomaly detection over a wide area, thus enhancing the safety and response speed of public places.
Smart Images

Figure CN120071252B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of anomaly monitoring technology, and more specifically to a visual target tracking method based on machine learning. Background Technology
[0002] Traditional methods typically process all video frames, lacking the filtering of irrelevant or unimportant frames. This leads to the processing of a large amount of unimportant information, wasting computational resources and reducing the overall efficiency of the system. In large-scale monitoring scenarios, processing time and computational burden may increase significantly, affecting real-time performance and accuracy. Object detection in traditional methods often relies on simple image processing algorithms or manually designed features, making them susceptible to complex environmental conditions (such as changes in lighting, occlusion, and rapid target movement), resulting in inaccurate detection results and a high risk of missed or false detections. Lacking support from deep learning and advanced object detection models, traditional methods suffer from low detection accuracy and cannot meet the requirements of [the relevant standards / requirements]. Modern surveillance systems demand high accuracy and robustness. Traditional methods often lack effective ways to express the spatial and temporal relationships between targets within a monitored area. Even when target detection is possible, it is difficult to fully understand the interactions and relationships between multiple targets, resulting in fragmented analysis of surveillance data and hindering in-depth target behavior analysis and prediction. Traditional methods typically rely on a single camera or limited surveillance equipment, resulting in a small monitoring range and an inability to effectively coordinate information from multiple cameras. Consequently, it is difficult to conduct real-time monitoring and comprehensive analysis of large areas, and the synergistic effect of multiple cameras cannot be fully utilized, thus affecting the applicability and response speed of the system in large-scale monitoring. Summary of the Invention
[0003] The technical problem to be solved by the present invention is to overcome the shortcomings of the prior art and provide a visual target tracking method based on machine learning.
[0004] The technical solution adopted to solve the above-mentioned technical problems is: a visual target tracking method based on machine learning, including:
[0005] Multiple surveillance cameras capture video from multiple monitored areas to obtain multiple surveillance videos. The surveillance videos are then divided into frames to obtain the surveillance video frame sequence corresponding to each surveillance video.
[0006] The monitored video frame sequence is filtered to obtain a set of monitored video frames;
[0007] The target detection is performed on the set of surveillance video frames based on a pre-trained target detection model to obtain the target detection result for each surveillance video frame.
[0008] Based on the target detection results of each monitoring video frame, a monitoring graph structure corresponding to the multiple monitoring areas is constructed.
[0009] Anomaly detection is performed on the monitoring graph structure based on a pre-trained anomaly detection model to obtain the anomaly detection results corresponding to the monitoring area.
[0010] Preferably, the target detection result set includes the detection box coordinates of the detected target, the target detection result confidence score, and the target category, wherein the target category includes objects, people, and environment, and the monitoring graph structure includes a node set and an edge set, wherein the nodes in the node set correspond to the monitoring area, and the edges in the edge set correspond to spatial adjacency relationships, wherein the spatial adjacency relationships are used to indicate that the detected target moves from one monitoring area to another monitoring area.
[0011] Preferably, the monitored video frame sequence is subjected to target filtering to obtain a set of monitored video frames, including:
[0012] The monitoring video frame sequence is traversed, and the background of the initial monitoring video frame in the monitoring video frame sequence is modeled based on Gaussian distribution to obtain the background model corresponding to the initial monitoring video frame.
[0013] When traversing to the current monitoring video frame, the background model compares each pixel of the current monitoring video frame one by one to classify the pixels of the current monitoring video frame and determine the type of the pixels. The types of pixels include background pixels and foreground pixels.
[0014] The number of foreground points identified in the current monitoring video frame is counted to obtain the number of foreground points. Based on the number of foreground points, the retention coefficient of the current monitoring video frame is calculated to obtain the retention coefficient of the current monitoring video frame.
[0015] The retention coefficient of the current monitoring video frame is compared with a preset retention threshold. If the retention coefficient of the current monitoring video frame is greater than the preset retention threshold, the current monitoring video frame is added to the monitoring video frame set.
[0016] The background model is updated based on the current monitoring video frame to obtain a new background model. The above operation is repeated until all monitoring video frames in the monitoring video frame sequence have been traversed to obtain the monitoring video frame set corresponding to the monitoring video frame sequence.
[0017] Preferably, the background model is as follows:
[0018]
[0019] Where, x j,tLet P represent the pixel value of the j-th pixel at time t in the surveillance video frame, and let P represent the background distribution of the pixels. This represents the weight value of the i-th Gaussian distribution at time t in the background model. Let represent the average value of the i-th Gaussian distribution of the j-th pixel at time t in the monitored video frame. Let represent the covariance matrix of the i-th Gaussian distribution of the j-th pixel at time t in the surveillance video frame, where and This represents the average pixel value of the j-th pixel at time t in the RGB color space, comprising the R, G, and B components. and Let represent the standard deviation of the pixel values of the j-th pixel at time t in the RGB color space, and let η represent the probability density function of a Gaussian distribution.
[0020] The formula for calculating the retention factor is as follows:
[0021]
[0022] Where τ represents the retention coefficient, P represents the background distribution of pixels, and m*n represents the total number of pixels in the surveillance video frame;
[0023] The update formula for the background model is as follows:
[0024]
[0025] Among them, M i,t Indicates whether a pixel is a foreground point, M i,t =1 indicates that the pixel is the foreground point, M i,t =0 indicates that the pixel is a background pixel, X t This represents the pixel value of a pixel, and α and ρ represent preset weight coefficients.
[0026] Preferably, the target detection model extracts features of different scales from the surveillance video frame using three efficient convolutional modules, denoted as C1, C2, and C3, respectively. Multi-scale feature fusion is employed to perform target recognition on features of different scales. Specifically, for the smallest scale feature C3, a 1×1 convolutional operation is first used to transform C3, followed by upsampling to obtain a feature of the same size as C2. This feature is then fused with C2 and input into an efficient convolutional module to extract the fused feature. Next, an upsampling operation maps the fused feature to the same size as C1, and it is concatenated with C1 at the channel level. Finally, an efficient convolutional module is used to extract the final feature, which is then input into a classification module to obtain the target detection result set of the surveillance video frame.
[0027] Preferably, the efficient convolutional module uses two different branches to process the input features. The first branch first extracts the inter-channel dependency features of the input through a 3×3 depthwise convolutional kernel, and then extracts the spatial dependency features of the input through a 1×1 pixel-level convolutional operation. The second branch first extracts the spatial dependency features of the input through a 1×1 pixel-level convolutional operation, then extracts the inter-channel dependency features of the input features using a 3×3 depthwise convolutional operation, then weights each channel through an attention module to extract the weighted features, and finally extracts the spatial dependency features of the weighted features through a 1×1 pixel-level convolutional operation. The features of the two branches are integrated through channel-level concatenation, and the features of the two branches are integrated through a 3×3 depthwise convolutional operation and a 1×1 pixel-level convolutional operation.
[0028] Preferably, the anomaly detection model employs a graph neural network. The anomaly detection model includes a graph convolutional network, a convolutional network, a hybrid module, and an anomaly judgment module. The graph convolutional network is used to perform convolution operations on the monitoring graph structure to obtain the temporal correlation features of the monitoring graph structure. The convolutional network is used to perform convolution operations on the state feature matrix of the monitoring graph structure during the preset time period to obtain the periodic features of the monitoring graph structure. The hybrid module is used to fuse the temporal correlation features and periodic features of the monitoring graph structure to obtain an anomaly probability result. The anomaly judgment module is used to determine whether the monitoring graph structure is abnormal based on the anomaly probability result.
[0029] Preferably, the graph convolutional network sequentially includes an input layer, a first graph convolutional layer, a second graph convolutional layer, a first fully connected layer, and an output layer. The convolutional network sequentially includes a first convolutional layer, a spatial attention layer, a second convolutional layer, a pooling layer, and a second fully connected layer. The spatial attention layer performs average pooling and max pooling on the input features to obtain average pooling vectors and max pooling vectors. The average pooling vectors and max pooling vectors are then processed by the convolutional layer and a Sigmoid activation function to obtain spatial attention output weights. The spatial attention output weights are then multiplied element-wise with the input features to obtain a spatial attention vector. The hybrid module concatenates the temporal correlation features and the periodic features to obtain a concatenated vector feature. A fully connected layer with one neuron extracts the temporal features of the concatenated vector feature, and a Sigmoid activation function outputs the anomaly probability of the temporal features.
[0030] The beneficial effects of the present invention are as follows: (1) By screening the surveillance video frames, the present invention can effectively remove irrelevant or unimportant frames, reduce the amount of computation and optimize the subsequent processing. By extracting the effective frame sequence related to target tracking, the system can process important information more centrally, thereby improving tracking efficiency and accuracy. Moreover, by using a pre-trained target detection model, the surveillance video frames can be accurately detected, ensuring the accurate identification and positioning of targets in each surveillance video frame. The target detection results help to construct the subsequent monitoring graph structure, form the dynamic performance of the target, and help further target analysis and behavior recognition. (2) The present invention can better represent the monitoring graph structure by constructing the monitoring graph structure based on the target detection results of each surveillance video frame. The structured representation of the relationship between targets in the controlled area can clearly show the spatial and temporal correlation between targets. Furthermore, by using a pre-trained anomaly detection model to detect anomalies in the monitoring graph structure, the system can automatically identify events or behaviors that do not conform to normal behavior. For example, in public safety monitoring, it can automatically detect abnormal behavior and provide real-time alarm functions. (3) By combining multi-camera monitoring, target detection and anomaly detection, this invention can achieve real-time monitoring and automated anomaly detection in a large monitoring area, promptly discover potential security threats and respond quickly. This is crucial for improving the security of public places or important areas, especially in the case of unattended operation, which can significantly improve the efficiency and accuracy of security monitoring. Attached Figure Description
[0031] Figure 1 This is a schematic diagram of the overall method steps in one embodiment of the present invention. Detailed Implementation
[0032] Example 1, as Figure 1As shown, the present invention proposes a machine learning-based visual target tracking method, comprising:
[0033] S1. Based on multiple surveillance cameras, video is captured from multiple surveillance areas to obtain multiple surveillance videos. The surveillance videos are then divided into frames to obtain the surveillance video frame sequence corresponding to each surveillance video.
[0034] S2. Filter the monitored video frame sequence to obtain a set of monitored video frames;
[0035] S3. Perform target detection on the set of surveillance video frames based on the pre-trained target detection model to obtain the target detection result for each surveillance video frame;
[0036] S4. Construct a monitoring graph structure corresponding to multiple monitoring areas based on the target detection results of each monitoring video frame;
[0037] S5. Based on the pre-trained anomaly detection model, perform anomaly detection on the monitoring graph structure to obtain the anomaly detection results corresponding to the monitoring area.
[0038] In this invention, surveillance cameras are devices used for real-time video acquisition, commonly used in public safety monitoring, traffic management, and other fields. They can capture video signals from designated areas. The monitoring area refers to the specific spatial range covered by the surveillance camera; each camera typically monitors one or more areas, such as shopping malls, streets, and parking lots. Target filtering is the process of identifying and classifying different objects in video frames, aiming to filter out targets of interest from the video frames to provide effective data for subsequent analysis. The monitoring graph structure is a graphical data structure built based on the targets, objects, or events detected within the monitoring area. It represents the relationship between the monitoring area and targets as nodes and edges, where each node represents a target or monitoring area, and each edge represents the relationship between different nodes. After performing anomaly detection, the system outputs the anomaly detection results for each monitoring area, usually presented as a marker or report, indicating whether an abnormal event occurred in a certain monitoring area, and the possible types or severity of these events.
[0039] Example 2: The present invention proposes a machine learning-based visual target tracking method. Compared with Example 1, this example further includes: a target detection result set including the detection box coordinates of the detected target, the target detection result confidence score, and the target category, wherein the target category includes objects, people, and environment; a monitoring graph structure including a node set and an edge set, wherein the nodes in the node set correspond to the monitoring area, and the edges in the edge set correspond to the spatial adjacency relationship, wherein the spatial adjacency relationship is used to indicate that the detected target moves from one monitoring area to another monitoring area.
[0040] In this embodiment, the target detection box coordinates refer to the position of the target object in the image identified by the target detection algorithm. The target is usually represented by a rectangular box (i.e., the detection box), and the coordinates of the box represent the positions of the four vertices of the rectangle. The target detection result confidence is the probability value output by the target detection model, which represents the degree of confidence in detecting a certain target. For example, if the confidence is 0.95, it means that the model believes that the object in the box has a 95% probability of belonging to a certain category. High confidence usually indicates that the target detection results are relatively reliable; Target category: refers to the category or type to which the detected target belongs. For example, a target can be an "object" (such as a car, a suitcase, etc.), a "person" (such as a pedestrian, a specific person, etc.), or an "environment" (such as a wall, a ceiling, etc.). These categories help the system distinguish and process different types of targets; Spatial adjacency is used to represent the spatial relationship between different monitoring areas. Specifically, it describes the scenario in which a detected target moves from one monitoring area to another. For example, if two areas covered by a surveillance camera (such as two rooms or two streets) are connected to each other, then there is a spatial adjacency relationship between them. When a target moves from one area to another, the spatial adjacency relationship helps to understand the target's path and behavior; This adjacency relationship not only focuses on physical connections but may also include the target's behavior in dynamic monitoring. For example, if a person is detected in monitoring area A and then appears in the adjacent area B, then the dynamic process of this target moving from area A to area B can be described by spatial adjacency.
[0041] In an optional embodiment, target filtering is performed on the sequence of monitored video frames to obtain a set of monitored video frames, including:
[0042] A1. Traverse the sequence of surveillance video frames and model the background of the initial surveillance video frame in the sequence based on Gaussian distribution to obtain the background model corresponding to the initial surveillance video frame.
[0043] A2. When traversing to the current monitoring video frame, the background model compares each pixel of the current monitoring video frame one by one to classify the pixels of the current monitoring video frame and determine the type of the pixels. The types of pixels include background pixels and foreground pixels.
[0044] A3. Count the number of foreground points in the current monitoring video frame to obtain the number of foreground points. Calculate the retention coefficient of the current monitoring video frame based on the number of foreground points to obtain the retention coefficient of the current monitoring video frame.
[0045] A4. Compare the retention coefficient of the current monitoring video frame with the preset retention threshold. If the retention coefficient of the current monitoring video frame is greater than the preset retention threshold, then add the current monitoring video frame to the monitoring video frame set.
[0046] A5. Update the background model based on the current monitoring video frame to obtain a new background model, and repeat the above operation until all monitoring video frames in the monitoring video frame sequence have been traversed to obtain the monitoring video frame set corresponding to the monitoring video frame sequence.
[0047] It's important to note that the Gaussian distribution is a probability distribution in statistics, also known as the normal distribution. Background modeling uses the Gaussian distribution to describe background pixels in video frames. These background pixels typically follow certain statistical laws, and their pixel value range can be estimated using the Gaussian distribution. Background modeling is used to extract background information from video frame sequences, thereby distinguishing dynamic foreground elements (i.e., changing or moving objects). This method is commonly used to separate dynamic targets from static environments, such as moving pedestrians or vehicles. Background pixels refer to pixels in an image that belong to the background region; these pixels remain stable and unchanged across multiple frames of a surveillance video. Foreground pixels refer to pixels in an image that belong to the foreground region. These pixels are usually dynamically changing, representing moving objects or events in the image; the retention factor is a coefficient that determines the degree of influence of the current video frame on the background model based on the number of foreground points. A frame with a higher retention factor indicates that the current frame contains more foreground information (such as people or objects passing by), while the opposite may only contain background information. The retention factor is used to adjust the weight of the video frame on the background model update; the background model update refers to adjusting the background model according to the content of the current video frame (especially the change of foreground points) to make it more in line with the current environment. Usually, the background model is updated gradually to adapt to changes in the environment (such as changes in lighting, seasonal changes, or long-term scene changes).
[0048] In an optional embodiment, the background model is as follows:
[0049]
[0050] Where, x j,t Let P represent the pixel value of the j-th pixel at time t in the surveillance video frame, and let P represent the background distribution of the pixels. This represents the weight value of the i-th Gaussian distribution at time t in the background model. Let represent the average value of the i-th Gaussian distribution of the j-th pixel at time t in the monitored video frame. Let represent the covariance matrix of the i-th Gaussian distribution of the j-th pixel at time t in the surveillance video frame, where and This represents the average pixel value of the j-th pixel at time t in the RGB color space, comprising the R, G, and B components. and Let represent the standard deviation of the pixel values of the j-th pixel at time t in the RGB color space, and let η represent the probability density function of a Gaussian distribution.
[0051] The formula for calculating the retention factor is as follows:
[0052]
[0053] Where τ represents the retention coefficient, P represents the background distribution of pixels, and m*n represents the total number of pixels in the surveillance video frame;
[0054] The update formula for the background model is as follows:
[0055]
[0056] Among them, M i,t Indicates whether a pixel is a foreground point, M i,t =1 indicates that the pixel is the foreground point, M i,t =0 indicates that the pixel is a background pixel, X t This represents the pixel value of a pixel, and α and ρ represent preset weight coefficients.
[0057] In an optional embodiment, the target detection model extracts features at different scales from the surveillance video frame using three efficient convolutional modules, denoted as C1, C2, and C3, respectively. Multi-scale feature fusion is employed to perform target recognition on features at different scales. Specifically, for the smallest scale feature C3, a 1×1 convolutional operation is first used to transform the smallest scale feature C3. Then, an upsampling operation is performed to obtain a feature of the same size as C2. This feature is then fused with C2 and input into an efficient convolutional module to extract the fused feature. Next, an upsampling operation is used to map the fused feature to the same size as C1, and it is concatenated with C1 at the channel level. Finally, an efficient convolutional module is used to extract the final feature, and this feature is input into a classification module to obtain a set of target detection results for the surveillance video frame.
[0058] It should be noted that efficient convolutional modules refer to convolutional layer designs that improve computational efficiency by optimizing convolutional operations. Efficient convolutional modules typically improve the performance of traditional convolutional layers by reducing computational cost or increasing computational speed. For example, using depthwise separable convolution, grouped convolution, attention mechanisms, etc., can significantly reduce computational cost and improve model efficiency while maintaining performance. Multi-scale feature fusion refers to combining information from different scales (i.e., features of different resolutions or levels) to form a unified feature representation. Upsampling is an operation that increases the resolution of an image or feature map, usually used to recover details of smaller images. Channel concatenation refers to the operation of stitching two or more feature maps along the channel (i.e., depth) dimension.
[0059] In an optional embodiment, the efficient convolution module employs two distinct branches to process the input features. The first branch first extracts inter-channel dependency features using a 3×3 depthwise convolutional kernel, and then extracts spatial dependency features using a 1×1 pixel-level convolutional operation. The second branch first extracts spatial dependency features using a 1×1 pixel-level convolutional operation, then extracts inter-channel dependency features using a 3×3 depthwise convolutional operation, and then weights each channel using an attention module to extract weighted features. Finally, it extracts spatial dependency features of the weighted features using a 1×1 pixel-level convolutional operation. The features from the two branches are integrated through channel-level concatenation, and then further integrated using a 3×3 depthwise convolutional operation and a 1×1 pixel-level convolutional operation.
[0060] In an optional embodiment, the anomaly detection model employs a graph neural network. The anomaly detection model includes a graph convolutional network, a convolutional network, a hybrid module, and an anomaly judgment module. The graph convolutional network is used to perform convolution operations on the monitoring graph structure to obtain the temporal correlation features of the monitoring graph structure. The convolutional network is used to perform convolution operations on the state feature matrix of the monitoring graph structure over a preset time period to obtain the periodic features of the monitoring graph structure. The hybrid module is used to fuse the temporal correlation features and periodic features of the monitoring graph structure to obtain the anomaly probability result. The anomaly judgment module is used to determine whether the monitoring graph structure is abnormal based on the anomaly probability result.
[0061] It's important to note that Graph Neural Networks (GNNs) are a type of neural network model used to process graph-structured data. In traditional neural networks, data is typically a fixed two-dimensional matrix (such as pixel values in an image), while GNNs can process graph data composed of nodes and edges. In monitoring systems, the monitoring graph structure may represent the relationships between different monitoring points. Through GNNs, the network can learn the relationships between nodes and their dynamic changes, in order to extract valuable features from the graph. Graph Convolutional Networks (GCNs) are convolutional neural networks specifically designed for graph data. Unlike traditional convolutional neural networks (CNNs), GCNs capture the structural information of the graph by performing convolution operations on nodes and their neighboring nodes. GCNs propagate information through the adjacency relationships of nodes, thereby learning the features of nodes in the graph and effectively extracting the spatial structural features of graph data. GCNs are used to perform convolution operations on monitoring graph structures to capture the temporal correlation features of the monitoring graph structure. Anomaly probability results refer to a probability value calculated by the model, indicating whether a certain monitoring graph structure is abnormal.
[0062] In an optional embodiment, the graph convolutional network sequentially includes an input layer, a first graph convolutional layer, a second graph convolutional layer, a first fully connected layer, and an output layer. The convolutional network sequentially includes a first convolutional layer, a spatial attention layer, a second convolutional layer, a pooling layer, and a second fully connected layer. The spatial attention layer performs average pooling and max pooling on the input features to obtain average pooling vectors and max pooling vectors. The average pooling vectors and max pooling vectors are then processed by the convolutional layer and the Sigmoid activation function to obtain spatial attention output weights. The spatial attention output weights are multiplied element-wise with the input features to obtain a spatial attention vector. The hybrid module concatenates temporally relevant features and periodic features to obtain concatenated vector features. The temporal features of the concatenated vector features are extracted based on a fully connected layer with one neuron. The anomalous probability of the temporal features is output through the Sigmoid activation function.
[0063] It should be noted that the fully connected layer is one of the most common layers in neural networks. Each neuron is connected to all neurons in the previous layer, hence the name "fully connected". Its function is to linearly transform the features from the previous layer and perform non-linear mapping through the activation function, ultimately outputting activated features. The output of the spatial attention layer is a weight vector representing different spatial locations (image regions or nodes in the graph). These weight values are used to emphasize important parts of the input feature map. It determines the features that the model should focus on at each location and strengthens the feature representation of important regions by multiplying it with the original features.
[0064] The embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited thereto. Various changes can be made within the scope of knowledge possessed by those skilled in the art without departing from the spirit of the present invention.
Claims
1. A visual target tracking method based on machine learning, characterized in that, include: Multiple surveillance cameras capture video from multiple monitored areas to obtain multiple surveillance videos. The surveillance videos are then divided into frames to obtain the surveillance video frame sequence corresponding to each surveillance video. The monitored video frame sequence is filtered to obtain a set of monitored video frames; The target detection is performed on the set of surveillance video frames based on a pre-trained target detection model to obtain the target detection result for each surveillance video frame. Based on the target detection results of each monitoring video frame, a monitoring graph structure corresponding to the multiple monitoring areas is constructed. Anomaly detection is performed on the monitoring graph structure based on a pre-trained anomaly detection model to obtain the anomaly detection results corresponding to the monitoring area. The anomaly detection model employs a graph neural network. The model includes a graph convolutional network, a convolutional network, a hybrid module, and an anomaly judgment module. The graph convolutional network performs convolution operations on the monitoring graph structure to obtain its temporal correlation features. The convolutional network performs convolution operations on the state feature matrix of the monitoring graph structure over a preset time period to obtain its periodic features. The hybrid module fuses the temporal correlation features and periodic features of the monitoring graph structure to obtain an anomaly probability result. The anomaly judgment module determines whether the monitoring graph structure is abnormal based on the anomaly probability result. The monitoring graph structure includes a node set and an edge set. The nodes in the node set correspond to the monitoring areas, and the edges in the edge set correspond to spatial adjacency relationships. The spatial adjacency relationships are used to indicate that the detected target moves from one monitoring area to another.
2. The visual target tracking method based on machine learning according to claim 1, characterized in that, The target detection results include the detection box coordinates of the detected target, the target detection result confidence score, and the target category, wherein the target category includes objects, people, and environment.
3. The visual target tracking method based on machine learning according to claim 2, characterized in that, The monitored video frame sequence is filtered to obtain a set of monitored video frames, including: The monitoring video frame sequence is traversed, and the background of the initial monitoring video frame in the monitoring video frame sequence is modeled based on Gaussian distribution to obtain the background model corresponding to the initial monitoring video frame. When traversing to the current monitoring video frame, the background model compares each pixel of the current monitoring video frame one by one to classify the pixels of the current monitoring video frame and determine the type of the pixels. The types of pixels include background pixels and foreground pixels. The number of foreground points identified in the current monitoring video frame is counted to obtain the number of foreground points. Based on the number of foreground points, the retention coefficient of the current monitoring video frame is calculated to obtain the retention coefficient of the current monitoring video frame. The retention coefficient of the current monitoring video frame is compared with a preset retention threshold. If the retention coefficient of the current monitoring video frame is greater than the preset retention threshold, the current monitoring video frame is added to the monitoring video frame set. The background model is updated based on the current monitoring video frame to obtain a new background model. The above operation is repeated until all monitoring video frames in the monitoring video frame sequence have been traversed to obtain the monitoring video frame set corresponding to the monitoring video frame sequence.
4. The visual target tracking method based on machine learning according to claim 3, characterized in that, The background model is as follows: ; in, Indicates the monitoring video frame Time of the first The pixel value of each pixel. Represents the background distribution of pixels. In the background model Time of the first The weights are distributed according to a Gaussian distribution. Indicates the monitoring video frame Time of the first The first pixel The average of a Gaussian distribution, Indicates the monitoring video frame Time of the first The first pixel The covariance matrix of Gaussian distributions, where, , , and Indicates the monitoring video frame Time of the first The average pixel value of each pixel in the RGB color space, representing the R, G, and B components. , , and Indicates the monitoring video frame Time of the first The standard deviation of the pixel values of the R, G, and B components in the RGB color space. Let represent the probability density function of a Gaussian distribution, where ; The formula for calculating the retention factor is as follows: ; in, Indicates the retention factor. Represents the background distribution of pixels. This indicates the total number of pixels in the monitored video frame; The update formula for the background model is as follows: ; in, Indicates whether a pixel is a foreground point. This indicates that the pixel is the foreground point. This indicates that the pixel is the background pixel. Represents the pixel value of a pixel. and This indicates the preset weighting coefficients.
5. A machine learning-based visual target tracking method according to claim 2, characterized in that, The target detection model extracts features at different scales from the surveillance video frames using three efficient convolutional modules, denoted as follows: , and Multi-scale feature fusion is employed to identify targets based on features at different scales, where the smallest scale feature is used... First, a 1×1 convolution operation is used to process the smallest scale features. Perform feature transformation, and then upsampling to obtain the same as... Features of the same size, then compare the feature with The features are fused together and then fed into an efficient convolutional module to extract the fused features. Finally, an upsampling operation maps the fused features to the input of the input. On the same size, and with Channel-level concatenation is performed, and finally an efficient convolutional module is used to extract the final features. These features are then input into a classification module to obtain the target detection result set of the surveillance video frames.
6. The visual target tracking method based on machine learning according to claim 5, characterized in that, The efficient convolution module employs two distinct branches to process input features. The first branch first extracts inter-channel dependency features using a 3×3 depthwise convolution kernel, followed by a 1×1 pixel-level convolution operation to extract spatial dependency features. The second branch first extracts spatial dependency features using a 1×1 pixel-level convolution operation, then extracts inter-channel dependency features using a 3×3 depthwise convolution operation. An attention module then weights each channel to extract weighted features. Finally, a 1×1 pixel-level convolution operation extracts the spatial dependency features of the weighted features. The features from both branches are integrated through channel-level concatenation, and further integrated using a 3×3 depthwise convolution operation and a 1×1 pixel-level convolution operation.
7. The visual target tracking method based on machine learning according to claim 6, characterized in that, The graph convolutional network sequentially includes an input layer, a first graph convolutional layer, a second graph convolutional layer, a first fully connected layer, and an output layer. The convolutional network also sequentially includes a first convolutional layer, a spatial attention layer, a second convolutional layer, a pooling layer, and a second fully connected layer. The spatial attention layer performs average pooling and max pooling on the input features to obtain average pooling vectors and max pooling vectors. These average pooling vectors and max pooling vectors are then processed by the convolutional layer and a sigmoid activation function to obtain spatial attention output weights. These spatial attention output weights are then multiplied element-wise with the input features to obtain a spatial attention vector. The hybrid module concatenates the temporal correlation features and the periodic features to obtain a concatenated vector feature. A fully connected layer with one neuron extracts the temporal features of the concatenated vector feature, and a sigmoid activation function outputs the anomaly probability of the temporal features.