Target detection method, device, apparatus and storage medium

By dividing the differential image dataset and performing multi-model detection, combined with frame filtering and interpolation, the problem of low target detection accuracy in existing technologies is solved, achieving higher detection accuracy.

CN116310993BActive Publication Date: 2026-06-19CHENGDU BOE SMART TECH CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHENGDU BOE SMART TECH CO LTD
Filing Date
2023-03-27
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The target detection process in existing technologies is relatively simple, resulting in low detection accuracy.

Method used

By dividing the differential image dataset, a target image dataset and a background image dataset are constructed. Different detection models are then used to detect these images. Combined with filtering and frame interpolation, accurate target detection results are obtained.

Benefits of technology

It improves the accuracy of target detection, can extract features of each target type in the differential image dataset in a more granular way, filter out false detection results and supplement lost frames, and achieves more accurate target detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116310993B_ABST
    Figure CN116310993B_ABST
Patent Text Reader

Abstract

This application discloses a target detection method, apparatus, device, and storage medium. The method includes: acquiring and processing a video to be detected containing a dynamic background to construct a differential image dataset; dividing the differential image dataset to obtain a target image dataset and a background image dataset; passing the target image dataset through a trained first detection model to obtain a first detection result, and passing the background image dataset through a trained second detection model to obtain a second detection result; filtering and performing frame interpolation on the second detection result to obtain a second target detection result; and obtaining the detection result of the target relative to the background information based on the first and second target detection results. This scheme can filter out false detections and supplement lost frame results after detecting different types of targets, so as to accurately determine the detection result of the target relative to the background information, thereby improving the accuracy of target detection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention generally relates to the field of image processing technology, and specifically to a target detection method, apparatus, device, and storage medium. Background Technology

[0002] With the continuous development of image processing technology, target detection, as the foundation of practical technologies such as stereo vision, motion analysis, and data fusion, has been widely applied in various fields, including industrial quality inspection, intelligent navigation, assisted driving, map and terrain registration, and natural resource analysis. In the application of aluminum sheet quality inspection, videos of the aluminum sheets can be captured. To ensure the quality of the aluminum sheets before they leave the factory or during use, the detection of targets in the videos is particularly important.

[0003] Currently, related technologies can process the video to be detected to obtain a difference image, and then use a convolutional neural network to perform target detection, thereby realizing the detection of moving targets in the video. However, this approach is rather one-dimensional in the target detection process, resulting in low accuracy in target detection. Summary of the Invention

[0004] In view of the aforementioned defects or deficiencies in the existing technology, it is desirable to provide a target detection method, apparatus, device, and storage medium that can specifically detect different categories of targets in a differential image dataset, thereby improving the detection accuracy of the target. The technical solution is as follows:

[0005] According to one aspect of this application, a target detection method is provided, the method comprising:

[0006] A video to be detected containing a dynamic background is acquired and processed to construct a differential image dataset;

[0007] The differential image dataset is divided into a detection target image dataset containing the target to be detected and a background image dataset containing background information.

[0008] The target image dataset is processed by a trained first detection model to obtain a first detection result, and the background image dataset is processed by a trained second detection model to obtain a second detection result.

[0009] The second detection result is filtered and frame-padding processed to obtain the second target detection result;

[0010] Based on the first detection result and the second target detection result, the first target detection result is obtained;

[0011] Based on the second target detection result, the first target detection result is subjected to coordinate transformation processing to obtain the detection result of the target to be detected relative to the background information.

[0012] According to another aspect of this application, a target detection device is provided, the device comprising:

[0013] The dataset construction module is used to acquire a video to be detected containing a dynamic background and process the video to be detected to construct a differential image dataset.

[0014] The segmentation processing module is used to segment the differential image dataset to obtain a detection target image dataset containing the target to be detected and a background image dataset containing background information.

[0015] The detection module is used to perform target detection processing on the target image dataset through a trained first detection model to obtain a first detection result, and to perform background detection processing on the background image dataset through a trained second detection model to obtain a second detection result;

[0016] The filtering and processing module is used to filter and perform frame interpolation on the second detection result to obtain the second target detection result;

[0017] The result determination module is used to obtain a first target detection result based on the first detection result and the second target detection result;

[0018] The transformation processing module is used to perform coordinate transformation processing on the first target detection result based on the second target detection result to obtain the detection result of the target to be detected relative to the background information.

[0019] According to another aspect of this application, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the target detection method as described above.

[0020] According to another aspect of this application, a computer-readable storage medium is provided having a computer program stored thereon for implementing the target detection method as described above.

[0021] The target detection method, apparatus, device, and storage medium provided in this application embodiment acquire a video to be detected containing a dynamic background and process the video to be detected to construct a differential image dataset. Then, the differential image dataset is divided to obtain a target image dataset containing the target to be detected and a background image dataset containing background information. The target image dataset is processed by a trained first detection model to obtain a first detection result, and the background image dataset is processed by a trained second detection model to obtain a second detection result. The second detection result is then filtered and frame-filled to obtain a second target detection result. The first target detection result is obtained based on the first and second target detection results. Finally, the first target detection result is processed by coordinate transformation based on the second target detection result to obtain the detection result of the target to be detected relative to the background information. Compared to existing technologies, the technical solution in this application, on the one hand, by dividing the differential image dataset into a target image dataset and a background image dataset, allows for more granular extraction of features corresponding to each target type in the differential image dataset. This enables the identification of the target and background information based on more detailed features. Furthermore, by employing detection models such as the first detection model and the second detection model, different categories of targets in the differential image dataset are detected in a targeted manner, effectively improving the accuracy of target detection. On the other hand, by filtering and interpolating the second detection results, false detections can be filtered out, and missing frames in the differential image can be supplemented. This combines more comprehensive features to determine the second target detection result. Based on the first and second target detection results, the first target detection result is accurately obtained. Subsequently, coordinate transformation processing can be performed on the first target detection result based on the second target detection result, resulting in an accurate detection result of the target relative to the background information. This also significantly improves the target detection accuracy of the method provided in this application compared to existing technologies.

[0022] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description

[0023] Other features, objects, and advantages of this application will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:

[0024] Figure 1 This is a system architecture diagram of the target detection system provided in the embodiments of this application;

[0025] Figure 2 A schematic flowchart of the target detection method provided in the embodiments of this application;

[0026] Figure 3 This is a schematic diagram of the structure of a method for determining the detection result of a target to be detected relative to background information, provided in an embodiment of this application.

[0027] Figure 4 A flowchart illustrating the method for training a sub-detection model provided in an embodiment of this application;

[0028] Figure 5 A schematic diagram of the structure of a difference image containing point-type targets and background targets provided in an embodiment of this application;

[0029] Figure 6 A schematic diagram of the structure of a difference image including background targets and stain-like targets provided in an embodiment of this application;

[0030] Figure 7 A schematic diagram illustrating the detection result of the target to be detected relative to background information, provided in an embodiment of this application;

[0031] Figure 8 A flowchart illustrating the method for determining the detection result of a target relative to background information, as provided in an embodiment of this application;

[0032] Figure 9 This is a schematic diagram of the target detection device provided in the embodiments of this application;

[0033] Figure 10 This is a schematic diagram of the target detection device provided in the embodiments of this application;

[0034] Figure 11 This is a schematic diagram of the structure of a computer device as shown in an embodiment of this application. Detailed Implementation

[0035] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.

[0036] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0037] With the research and advancement of artificial intelligence (AI) technology, AI is being studied and applied in various fields, such as smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, autonomous driving, drones, robots, smart healthcare, smart customer service, and industrial quality inspection. It is believed that with the development of technology, AI will be applied in more fields and play an increasingly important role.

[0038] The solutions provided in this application involve technologies such as neural networks in artificial intelligence, which are specifically illustrated in the following embodiments.

[0039] Currently, one approach in related technologies involves processing the video to be detected to obtain a difference image, and then using a convolutional neural network to perform target detection on the difference image, thereby achieving the detection of moving targets in the video. However, this approach is rather one-dimensional in its target detection process, resulting in low accuracy in target detection.

[0040] To address the aforementioned shortcomings, this application provides a target detection method, apparatus, device, and storage medium. Compared with existing technologies, on the one hand, by dividing the differential image dataset into a target image dataset and a background image dataset, features corresponding to each target type in the differential image dataset can be extracted with finer granularity. This allows for the identification of the target and background information based on more detailed features. Furthermore, by employing detection models such as a first detection model and a second detection model, different categories of targets in the differential image dataset can be detected in a targeted manner, effectively improving the target detection accuracy. On the other hand, by filtering and interpolating the second detection results, false detections can be filtered out, and missing frames in the differential image can be supplemented. This combines more comprehensive features to determine the second target detection result. Based on the first and second target detection results, the first target detection result is accurately obtained. Furthermore, the first target detection result can be coordinate transformed based on the second target detection result, resulting in accurate detection of the target relative to the background information. This also significantly improves the target detection accuracy of the method provided in this application compared to existing technologies.

[0041] Figure 2 This is an implementation environment architecture diagram of a target detection method provided in an embodiment of this application. For example... Figure 2 As shown, the implementation environment architecture includes: terminal 10 and server 20.

[0042] In the field of object detection, the process of detecting the target region of an object to be identified in a video can be performed either on terminal 10 or on server 20. For example, by acquiring the video to be detected through terminal 10, the video can be processed locally on terminal 10 to construct a differential image dataset, and then the target region to be identified can be detected based on the differential image dataset to obtain the detection result of the target relative to the background information; alternatively, the video to be detected can be sent to server 20, so that server 20 can acquire the video, perform target region detection based on the differential image dataset, obtain the detection result of the target relative to the background information, and then send the detection result to terminal 10 to realize the detection of the target to be identified in the video.

[0043] In addition, the terminal 10 may run an operating system, which may include, but is not limited to, Android, iOS, Linux, Unix, Windows, etc. It may also include a user interface (UI) layer, which can provide the display of the video to be detected and the display of the detection results of the target relative to the background information. In addition, the video to be detected required for target detection can be sent to the server 20 based on the application programming interface (API).

[0044] Optionally, terminal 10 can be a terminal device in various AI application scenarios. For example, terminal 10 can be a laptop, tablet, desktop computer, in-vehicle terminal, intelligent voice interaction device, smart home appliance, mobile device, aircraft, etc. Mobile devices can be various types of terminals such as smartphones, portable music players, personal digital assistants, dedicated messaging devices, portable gaming devices, etc. This application embodiment does not specifically limit this.

[0045] Server 20 can be a single server, a server cluster or distributed system consisting of several servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms.

[0046] Terminal 10 and server 20 establish a communication connection via a wired or wireless network. Optionally, the aforementioned wireless or wired network uses standard communication technologies and / or protocols. The network is typically the Internet, but can also be any network, including but not limited to a Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, wired or wireless network, private network or virtual private network, and any combination thereof.

[0047] For ease of understanding and explanation, the following will use... Figures 2 to 11 This application provides a detailed description of the target detection method, apparatus, equipment, and storage medium provided in its embodiments.

[0048] Figure 2 The diagram shown is a schematic flowchart of a target detection method according to an embodiment of this application. This method can be executed by a computer device, which can be the aforementioned... Figure 1 The system shown can be either server 20 or terminal 10, or the computer device can be a combination of terminal 10 and server 20. Figure 2 As shown, the method includes:

[0049] S101. Obtain the video to be detected containing a dynamic background and process the video to be detected to construct a differential image dataset.

[0050] The aforementioned video to be detected, which includes a dynamic background, can be any video requiring target detection. This video may include both a dynamic background and the target to be detected. The dynamic background is the fundamental content of the video, used to identify the background information of each frame. The target to be detected refers to the image information in each frame of the video, excluding the background information. For example, in industrial applications, the dynamic background could be a moving disk, and the target to be detected could be dots and stains; dots could be spots, marks, etc.

[0051] In this embodiment of the application, taking a moving disk as the dynamic background as an example, the video to be detected can be obtained by calling an image acquisition device to acquire images of the moving disk in real time; it can also be obtained by extracting frames from a pre-acquired original video; it can also be obtained through the cloud, or through a database or blockchain, or by importing the video to be detected through an external device.

[0052] Optionally, the video to be tested can be a video captured in various different scenarios, such as a video captured under different light intensities and angles, showing a scene with a dynamic background. The video to be tested can be in video image format or 3D point cloud image format.

[0053] In one possible implementation, the image acquisition device can be a camera or a video camera, or it can be a radar device such as a lidar or millimeter-wave radar. The image acquisition device can be located in a production workshop. The camera can be a monocular camera, a binocular camera, a depth camera, a 3D camera, etc.

[0054] In another possible implementation, the original video including the dynamic background can be pre-captured, and then the original video can be processed by frame extraction. By extracting several frames at certain intervals, the video to be detected containing the dynamic background can be obtained.

[0055] The process of constructing a differential image dataset from the video to be detected can be understood as converting video data into image data. After acquiring the video to be detected, the inter-frame differencing method can be used to process it and construct the differential image dataset. Inter-frame differencing is a method to obtain the contour of a moving target by performing a difference operation on two or more adjacent frames in the video to be detected. By subtracting two frames, the absolute value of the brightness difference between the two frames is obtained. Whether this difference exceeds a threshold is used to analyze the motion characteristics of the video and determine whether there is moving objects in the video to be detected. Frame-by-frame differencing in the video to be detected is equivalent to performing a high-pass filter in the time domain.

[0056] It should be noted that the difference image dataset consists of multiple difference images obtained by subtracting adjacent frames from the video to be detected. A difference image is formed by subtracting images of the target scene at consecutive time points. In a general sense, a difference image is defined as the target scene at time point t. k and t k+L The difference between the resulting images is obtained by subtracting images of the target scene at adjacent time points, thus revealing the background changes over time. When a moving object is present in the video, there will be a difference in grayscale between two (or three) adjacent frames. The absolute value of the grayscale difference between two frames is calculated. Stationary objects appear as zeros in the difference image, while moving objects, especially their outlines, show non-zero grayscale values ​​due to changes in grayscale. When the absolute value exceeds a certain threshold, the object can be identified as a moving target, thus achieving target detection.

[0057] In this embodiment of the application, after obtaining the video to be detected, a difference operation can be performed between the pixels of each two or more adjacent frames in the video to be detected to obtain the operation result. The operation result is compared with a preset threshold. When it is greater than the preset threshold, the pixel value is set to 1, otherwise it is set to 0, thereby obtaining the difference image dataset.

[0058] S102. Divide the differential image dataset to obtain a target image dataset containing the target to be detected and a background image dataset containing background information.

[0059] It should be noted that the target image dataset refers to a difference image dataset containing the target to be detected, while the background image dataset refers to a difference image dataset containing background information. The target to be detected can include various target types, such as dots and stains.

[0060] After obtaining the differential image dataset, it can be divided according to the feature types contained in the differential image dataset. For each differential image in the differential image dataset, if the feature type contained in the differential image is background information, the differential image is divided into the background image dataset; if the feature type contained in the differential image is the target to be detected, the differential image is determined to be divided into the target image dataset.

[0061] Optionally, in determining the types of features contained in the difference images, a neural network model can be used to extract features and classify each difference image in the difference image dataset.

[0062] S103. The target image dataset is processed by the trained first detection model to obtain the first detection result, and the background image dataset is processed by the trained second detection model to obtain the second detection result.

[0063] The first detection model mentioned above may include at least two sub-detection models, each of which can be a pre-trained and independently existing model. Each sub-detection model detects a different target type when detecting a target in the target image dataset. The first detection result refers to the processing result obtained by feature extraction and detection processing on the target image dataset, used to identify the detection information of the difference images in the target image dataset, so as to quickly obtain the target type information and target attribute characteristics of the target. The first detection result may include target candidate boxes, the probability value that the target candidate box is the target, and the probability value that the category is the target. The second detection result refers to the processing result obtained by feature extraction and detection processing on the background image dataset, used to identify the detection information of the difference images in the background image dataset. The second detection result may include background detection boxes, the probability value that the background detection box is background information, and the probability value that the category is background information.

[0064] In this embodiment, during the target detection processing of the target image dataset through the first detection model, the target type of the target to be detected can be obtained first. According to the target type, the target image dataset is divided into at least two sub-datasets corresponding to the target type. The at least two sub-datasets are then processed by at least two sub-detection models corresponding to the target type to obtain the sub-detection results corresponding to each sub-detection model. The sub-detection results corresponding to each sub-detection model are then used as the first detection result.

[0065] The target to be detected can have at least two target types, and the number of target types is the same as the number of sub-detection models.

[0066] It should be noted that each sub-detection model in the first detection model described above is a neural network model that takes a difference image from the target image dataset as input, outputs the recognition result of the target to be detected, and has the ability to identify the type of the target to be detected and predict the sub-detection result. This sub-detection model is responsible for establishing the relationship between the difference image in the target image dataset and the sub-detection result, and its model parameters are already in an optimal state. The sub-detection result may include, but is not limited to, convolutional layers, fully connected layers, and activation functions. Convolutional layers and fully connected layers may include one layer or multiple layers.

[0067] The aforementioned second detection model is a neural network model that takes a difference image from a background image dataset as input, outputs the recognition result of background information, and has the ability to identify the type of background information and predict the second detection result. Different sub-detection models detect different target types and have different internal network structures. The internal network structure of the second detection model also differs from that of the different sub-detection models.

[0068] For example, taking a dynamic background as a moving disk, and assuming the target types to be detected include point-type targets and stain-type targets, after acquiring the video to be detected containing a dynamic background, the video inter-frame difference method is used to process it, constructing a difference image dataset. Then, the difference image dataset is divided to obtain a detection target image dataset containing the target and a background image dataset containing background information. Based on the target type, the detection target dataset is further divided into a point-type target image dataset and a stain-type target image dataset. The point-type target image dataset is then processed through a pre-trained first sub-detection model corresponding to point-type targets to obtain the first sub-detection result. Conversely, the stain-type target image dataset is processed through a pre-trained second sub-detection model corresponding to stain-type targets to obtain the second sub-detection result. Finally, the background image dataset is processed through a pre-trained second detection model to obtain the second detection result.

[0069] Please see below. Figure 3 As shown, taking two sub-detection models and two target types of the target to be detected as an example, when the target video 3-1 containing a dynamic background is obtained, the inter-frame difference method is used to construct the difference image dataset 3-2. Then, the difference image dataset 3-2 is divided according to the target type to obtain the first type of target image dataset 3-3, the first type of target image dataset 3-4, and the background image dataset 3-5. The first type of target image dataset 3-3 is then used to perform target detection through the first sub-detection model 3-6 to obtain the first sub-detection result 3-9, and the second type of target image dataset 3-4 is used to perform target detection through the second sub-detection model 3-7. The second sub-detection result 3-10 is obtained. The background image data 3-5 is processed by the second detection model 3-8 to detect background targets, resulting in the second detection result 3-11. Then, the second detection result 3-11 is filtered and frame-filled to obtain the second target detection result 3-12. Based on the first sub-detection result 3-9, the second sub-detection result 3-10, and the second target detection result 3-12, the first target detection result 3-13 is obtained. Then, based on the second target detection result 3-12, the first target detection result 3-13 is subjected to coordinate transformation to obtain the relative position coordinates 3-14 of the detection result relative to the center point of the moving background.

[0070] Optionally, the first sub-detection model can be a CenterNet model, the second sub-detection model can be a Faster-RCNN model, and the third sub-detection model can be a YOLOX model.

[0071] It should be noted that for point image datasets, detection is performed using the CenterNet model based on heatmap feature extraction. Since the characteristic of the CenterNet model is that it treats the target to be detected as the target center point and extends outward from this point in combination with the target region size to generate a Gaussian heatmap, the point image dataset used in this embodiment is characterized by small and relatively discrete point targets. As a result, the heatmaps generated by the CenterNet model have less overlap, thus improving the detection accuracy of small targets such as point targets.

[0072] For stain-type image datasets, the Faster-RCNN model is used for detection. Since the Faster-RCNN model uses a two-stage object detection framework, it can extract the features of the target more accurately. At the same time, the object detection box regression based on anchor points also makes the detection results more accurate. Therefore, this patent uses this network to train and detect stain-type targets that are relatively large in size and have obvious features.

[0073] For the background (disc) image dataset, the YOLOX model was used for detection. Since this network is a single-stage object detection network, it can complete the object detection task faster and more accurately than the previous version of YOLO.

[0074] In this embodiment, the target image dataset is divided into at least two sub-datasets corresponding to the target type. Different detection models are selected for different datasets and different target type features, resulting in more granular detection results and improving the accuracy of target detection.

[0075] S104. The second detection result is filtered and frame-filled to obtain the second target detection result.

[0076] S105. Based on the first detection result and the second target detection result, the first target detection result is obtained.

[0077] S106. Based on the second target detection result, perform coordinate transformation on the first target detection result to obtain the detection result of the target to be detected relative to the background information.

[0078] After obtaining the first and second detection results, the second detection result can be filtered and frame-padding processed to obtain the second target detection result. Specifically, it is first determined whether the second detection result meets the network post-processing conditions, and the second detection results in the background image dataset that meet the network post-processing conditions are filtered out and used as the second target detection results. The difference images corresponding to the second detection results in the background image dataset that do not meet the network post-processing conditions are used as the frames to be padded and padded. Based on the second detection results of the frames before and after the frames to be padded in the background image dataset, the detection results of the frames to be padded are supplemented to obtain the second target detection result of the frames to be padded.

[0079] It should be noted that the above post-processing conditions can determine whether the background target confidence score and background target area in the second detection result meet the requirements. The above second target detection result is a second detection result that meets the post-processing conditions.

[0080] After obtaining the second target detection result, the first target detection result that does not meet the preset conditions can be filtered out, and the first target detection result that does meet the preset conditions can be selected and retained. The first target detection result that meets the preset conditions is then used as the first target detection result. Then, the first target detection result is processed by coordinate transformation based on the second target detection result to obtain the detection result of the target relative to the background information.

[0081] The aforementioned first target detection result is a first detection result that meets preset conditions. These preset conditions may include determining whether the confidence level of the detected target in the first detection result is less than a preset confidence threshold, filtering out first detection results that are less than the preset confidence threshold. The preset confidence threshold can be customized according to actual needs. The aforementioned detection result of the target relative to the background information may be the position coordinates of the target relative to the background information.

[0082] This application provides a target detection method. Compared with existing technologies, on the one hand, by dividing the differential image dataset into a target image dataset and a background image dataset, it is possible to extract features corresponding to each target type in the differential image dataset at a finer granular level. This allows for the identification of the target and background information based on more detailed features. Furthermore, by employing detection models such as a first detection model and a second detection model, it can specifically detect different categories of targets in the differential image dataset, effectively improving the target detection accuracy. On the other hand, by filtering and interpolating the second detection results, false detections can be filtered out, and missing frames in the differential image can be supplemented. This combines more comprehensive features to determine the second target detection result. Based on the first and second target detection results, the first target detection result is accurately obtained. Then, coordinate transformation processing can be performed on the first target detection result based on the second target detection result, resulting in an accurate detection result of the target relative to the background information. This also significantly improves the target detection accuracy of the method provided in this application compared to existing technologies.

[0083] In another embodiment of this application, a specific implementation method for training the sub-detection model is also provided. Please refer to [link to relevant documentation]. Figure 4 As shown, the training method includes:

[0084] S201. Obtain a sample video containing a dynamic background.

[0085] S202. Process the sample videos to construct a sample difference image dataset.

[0086] S203. Divide the sample difference image dataset to obtain a sample detection target image dataset containing the detection target and a sample background image dataset containing background information; the sample detection target image dataset is labeled with the detection target recognition result.

[0087] S204. Input the sample detection target image dataset into the sub-detection model to be trained for target detection processing to obtain the sub-prediction results of the target to be detected.

[0088] S205. Based on the sub-prediction results and the detection target identification results, calculate the loss function, minimize the loss function, and use an iterative algorithm to iteratively adjust the parameters of the sub-detection model to be trained to obtain the sub-detection model.

[0089] Specifically, sample videos containing dynamic backgrounds can be acquired. These sample videos may include both the dynamic background and the target to be detected, and can be selected from historical surveillance videos containing dynamic backgrounds. Then, the sample videos are processed using the inter-frame difference method. By performing difference operations on two or more adjacent frames in the video to be detected, a sample difference image set is obtained. This sample difference image dataset is then divided into a sample target image dataset containing the target and a sample background image dataset containing background information. The sample target image dataset is labeled with the target recognition result. This target recognition result can be a manually labeled target region.

[0090] The sample target image dataset is input into the sub-detection model to be trained for target detection processing to obtain the sub-prediction results of the detected target. Based on the sub-prediction results and the target recognition results, the loss function is calculated. According to the minimization of the loss function, the parameters of the sub-detection model to be trained are iteratively adjusted using an iterative algorithm to obtain the sub-detection model.

[0091] The iterative training of the sub-detection model described above can be understood as updating the parameters in the sub-detection model, which may include updating matrix parameters such as the weight matrix and bias matrix. These weight and bias matrices include, but are not limited to, the matrix parameters in the convolutional layers, normalization layers, deconvolutional layers, feedforward layers, and fully connected layers of the sub-detection model.

[0092] In this context, the loss between the sub-prediction result and the target recognition result is used to iteratively train the sub-detection model. This can be achieved by adjusting the parameters of the model to make it converge, based on the loss function, if the model has not converged. Convergence of the sub-detection model can be defined as the difference between its output and the target recognition result on the sample target image dataset being less than a preset threshold, or the rate of change of the difference between the output and the target recognition result approaching a lower value. The sub-detection model is considered convergent when the calculated loss function is small, or when the difference between the calculated loss function and the loss function from the previous iteration approaches 0.

[0093] The loss function mentioned above can be the cross-entropy loss function, the normalized cross-entropy loss function, or Focalloss, etc. The iterative methods mentioned above can include gradient descent, Newton's method, and quasi-Newton methods for optimizing the loss function. It should be noted that there are no restrictions on the iterative methods used for iterative optimization.

[0094] It should be noted that the sub-detection models for different target types mentioned above can be trained independently. The training samples used by each sub-detection model during the training process can be a sample detection target image dataset, and the corresponding labeled detection target recognition result is the target to be detected for that target type; or the sample detection target image dataset can be divided into datasets corresponding to each target type, and the corresponding labeled detection target recognition result is the target to be detected for that target type.

[0095] In this embodiment, taking a disk as the background target and point targets and stain targets as the detection targets, with the CenterNet model used for training point targets and the Faster-RCNN model used for training stain targets as an example, after obtaining the sample difference image dataset, the sample difference image dataset is divided to obtain a sample detection target image dataset containing the detection target and a sample disk image dataset containing disk information. Then, the sample detection target image dataset is divided according to the target type to obtain a sample point target dataset and a sample stain target dataset. See also... Figure 5 and Figure 6 As shown, Figure 5 The diagram shown is a schematic representation of a difference image including background targets and point-like targets. Figure 6 The diagram shows a differential image including background targets and stain-like targets.

[0096] When training the CenterNet model using a sample point class target dataset, the differenced images of the sample point class target dataset are first preprocessed. This preprocessing involves random cropping and data augmentation to obtain a point class augmentation dataset. This augmentation dataset is then normalized for both data and size, resulting in a normalized point class dataset. This normalized dataset is then processed by the CenterNet model to be trained. First, features are extracted from the backbone network of the CenterNet model. These features are then upsampled through a deconvolutional network to obtain deconvolutional features. These deconvolutional features are then used for prediction through three prediction heads to obtain prediction results. These prediction results can include the target center point location, the width and height of the candidate bounding box, and the predicted class. Each prediction head can include a normalization layer, a fully connected layer, and an activation function. Finally, a loss function is calculated based on the labeled point class target recognition results and the output results. The parameters of the CenterNet model to be trained are iteratively adjusted using an iterative algorithm to minimize the loss function, resulting in the final CenterNet model.

[0097] When training the Faster-RCNN model using a sample stain-type target dataset, the difference images of the sample stain-type target dataset are first preprocessed. Noise filtering and data augmentation are performed to obtain an augmented stain-type dataset. This augmented dataset is then normalized for both data and size, resulting in a normalized stain-type dataset. This normalized dataset is then processed by the Faster-RCNN model to be trained. The anchor parameters of the region proposal network are jointly set based on the downsampling ratio of the Faster-RCNN model's backbone network and the statistical results of the stain target candidate boxes. Feature extraction is performed through the backbone network of the Faster-RCNN model to obtain a feature map. This feature map is then processed through the Region Proposal Network (RPN) in the Faster-RCNN model to obtain predicted candidate boxes. Finally, the original image and the feature map are mapped using the Region of Interest Pooling (RoI Pooling) layer to obtain the predicted candidate box features. Classification is then performed based on these predicted candidate box features to obtain the output result. The loss function is calculated based on the labeled stain-type target recognition results and output results. The parameters of the Faster-RCNN model to be trained are iteratively adjusted according to the minimization of the loss function, so as to obtain the Faster-RCNN model.

[0098] When training the YOLOX model using a sample disk image dataset, the difference images of the sample disk image dataset are first preprocessed. Data augmentation (random noise, random horizontal flipping, and random vertical flipping) is then applied to obtain an augmented disk dataset. This augmented dataset is then normalized in both data and size to obtain a normalized disk dataset. This normalized disk dataset is then used by the YOLOX model to be trained for detection, yielding the output results. Based on the labeled disk target recognition results and the output results, a loss function is calculated. By minimizing the loss function, an iterative algorithm is used to adjust the parameters of the YOLOX model to be trained, resulting in the YOLOX model.

[0099] Further, please see Figure 7As shown, after acquiring the video to be detected containing a dynamic background, the video frame difference method is used to process it to construct a difference image dataset. Then, the difference image dataset is divided to obtain a detection target image dataset containing the target to be detected and a background image dataset containing background information. According to the target type, the detection target dataset is divided into a point target image dataset and a stain target image dataset. Then, the point target image dataset is processed by a pre-trained CenterNet model corresponding to the point target. The backbone network of the CenterNet model is used to extract features to obtain features. Then, the features are upsampled through a deconvolution network to obtain deconvolution features. The deconvolution features are then used for prediction processing through three prediction heads to obtain the first detection result of the point target. The first detection result includes the point target prediction box, the probability value of the prediction box and the probability value of the point class.

[0100] Feature maps are extracted from the dataset of images of stain-type targets using the backbone network of a pre-trained Faster-RCNN model corresponding to the stain-type targets. These feature maps are then processed into candidate boxes using the Region Proposal Network (RPN) in the Faster-RCNN model to be trained, resulting in predicted candidate boxes. Next, a Region of Interest (RoI) pooling layer is used to map the original image and the feature map, yielding the predicted candidate box features. Finally, classification is performed based on these predicted candidate box features to obtain the first detection result for the stain-type targets. This first detection result includes the predicted bounding box of the stain-type target, the probability value of the predicted bounding box being the stain-type target, and the probability value of the stain class.

[0101] Taking a disk as the background and a disk image dataset as the background image dataset as an example, the disk image dataset is processed by the trained YOLOX model to obtain the second detection result.

[0102] After obtaining the first detection results corresponding to each different target type, joint discrimination processing can be performed on them and output synchronously. The second detection results are then filtered and frame-filled to obtain the second target detection results. Then, the detection results are post-processed according to the second target detection results and the first detection results corresponding to different target types to obtain the first target detection results. Based on the second target detection results, the first target detection results are subjected to coordinate transformation processing to obtain the relative position coordinates of the detection results relative to the center point of the moving background.

[0103] In this embodiment, target detection is performed on a video containing a dynamic background by using multiple target detection networks. This allows for more granular detection of the target and background information, improving the accuracy of target detection. It also provides precise data guidance for subsequent post-processing based on the first and second detection results, thereby accurately obtaining the detection result of the target relative to the background information.

[0104] In one embodiment, after obtaining the first detection result and the second detection result, a specific implementation method for filtering and frame interpolation of the second detection result to obtain the second target detection result is also provided. Please refer to [link to relevant documentation]. Figure 8 As shown, the method includes:

[0105] S301. Based on the background detection box in the second detection result, determine the background target confidence and background target area corresponding to the background detection box. The background target confidence is used to characterize the probability value that the image within the background detection box is background information.

[0106] It should be noted that the first detection result mentioned above is obtained by performing target detection processing on the target image dataset, and may include target candidate boxes, probability values ​​of target candidate boxes being the target to be detected, and probability values ​​of the category being the target to be detected. The second detection result is obtained by performing background detection on the background image dataset, and may include background detection boxes, probability values ​​of background detection boxes being background information, and probability values ​​of the category being background information. The area of ​​the background target region refers to the area of ​​the background detection box. Specifically, each difference image in the background image dataset, after detection processing, yields a corresponding second detection result.

[0107] After obtaining the second detection result, the area of ​​the background detection box can be calculated based on the background detection box in the second detection result to obtain the area of ​​the background target area. Then, the probability value of the background detection box being background information and the probability value of the category being background information are calculated to obtain the confidence score of the background target.

[0108] S302. For each second detection result, determine whether the background detection box meets the network post-processing conditions based on the background target confidence and the area of ​​the background target region, and obtain the corresponding judgment result.

[0109] After obtaining the second detection result corresponding to each difference image in the background image dataset, for each second detection result, the intersection-union ratio (CIRR) between any two background detection boxes in the second detection result can be determined. Then, it is determined whether the CIRR is greater than a preset CIRR threshold. If it is greater than the preset CIRR threshold, the background detection box with the higher confidence of the background target in the two background detection boxes is determined to meet the network post-processing conditions; if it is not greater than the preset CIRR threshold, the background detection box whose background target region area in the two background detection boxes meets the preset area condition is determined to meet the network post-processing conditions. The aforementioned preset CIRR threshold can be customized according to actual needs.

[0110] It should be noted that the intersection-union ratio (IUGR) is the ratio of the area of ​​the intersecting region to the area of ​​the merged region. In this embodiment, for any two background detection boxes, the areas of their intersecting region and their merged region can be calculated. Then, the ratio of the intersecting region area to the merged region area is calculated to obtain the IUGR of each pair of background detection boxes. Target filtering can be performed based on the IUGR, and the IUGR can be determined by whether it is greater than a preset IUGR threshold T. IOU When the crossover ratio is greater than the preset threshold T IOU If the background target confidence scores of the two background detection boxes are determined, the scores of the two background target confidence scores are compared, the background detection boxes with lower background target confidence scores are removed, and the background detection boxes with higher background target confidence scores are retained. The background detection boxes with higher background target confidence scores are determined to meet the network post-processing conditions, while the background detection boxes with lower background target confidence scores are determined to not meet the network post-processing conditions.

[0111] When the crossover ratio (CUP) is not greater than the preset CUP threshold T IoU At that time, the area of ​​the background target region of the two background detection boxes is calculated, and then it is determined whether the area of ​​the background target region meets the preset area condition. If the background detection box meets the preset area condition, it is determined to meet the network post-processing condition. The preset area condition is whether the area of ​​the background target region is greater than the total area T of the difference image. area The area of ​​the background target region is greater than T times the total area of ​​the difference image. area A background detection bounding box that is 1 / T times the area of ​​the target background region is defined as meeting the preset area condition; the area of ​​the target background region is defined as not exceeding T times the total area of ​​the difference image. area The background detection bounding box that is times larger than the preset area condition is determined to be non-compliant. Among them, T... area The multiplier can be customized according to actual needs, for example, it can be 0.2 < T. area <0.3.

[0112] For example, when the second detection result obtained by performing background detection processing on the difference image in the background image dataset includes three, the corresponding background target detection boxes are the first detection box, the second detection box, and the third detection box, respectively. Then, the cross-union ratio can be calculated for these three detection boxes, that is, the cross-union ratio between the first and second detection boxes, between the first and third detection boxes, and between the second and third detection boxes can be calculated. For example, the cross-union ratio between the first and second detection boxes is A1, the cross-union ratio between the first and third detection boxes is A3, and the cross-union ratio between the second and third detection boxes is A3. The preset cross-union ratio threshold can be M.

[0113] For the first and second detection boxes, when A1 > M, the background target confidence scores a1 and a2 of the first and second detection boxes are obtained, respectively. Then, the values ​​of a1 and a2 are compared. If a1 > a2, it is determined that the second detection box does not meet the network post-processing conditions, while the first detection box does. When A1 ≤ M, the background target region areas of the first and second detection boxes are obtained, respectively, as b1 and b2. Then, it is determined whether b1 and b2 are greater than T of the total area of ​​the difference image. area Assuming the area b1 of the background target region in the first detection box is greater than T times the total area of ​​the difference image. area When the area of ​​the background target region b2 in the second detection box is multiplied by T, it is not greater than T times the total area of ​​the difference image. area If the value is multiplied by 1, then the second detection box is determined to not meet the network post-processing conditions, while the first detection box does. Similarly, using the same steps as the first and second detection boxes, it is determined whether the first and third detection boxes, and the second and third detection boxes, meet the network post-processing conditions, thereby obtaining the background detection boxes that meet the network post-processing conditions.

[0114] In this embodiment, by determining whether the background detection box meets the network post-processing conditions based on the background target confidence level and the area of ​​the background target region, it is possible to accurately filter out targets that may be falsely detected, thereby more accurately retaining the background detection boxes that meet the network post-processing conditions.

[0115] S303. Based on the judgment result, the second detection result is filtered and frame-filling processed to determine the second target detection result.

[0116] The aforementioned second target detection result is the second detection result that meets the network post-processing conditions. After obtaining the judgment result, when the judgment result indicates that the background detection box meets the network post-processing conditions, the second detection result corresponding to the background detection box that meets the network post-processing conditions is selected and used as the second target detection result; when the judgment result indicates that the background detection box does not meet the network post-processing conditions, the difference image to be frame-filled is determined based on the second detection result corresponding to the background detection box that does not meet the network post-processing conditions, and frame-filling processing is performed on it to obtain the second target detection result of the difference image to be frame-filled. Here, the background detection box that does not meet the network post-processing conditions can be understood as a dropped frame image, which needs to be frame-filled.

[0117] As one possible approach, for each difference image in the background image dataset, its second detection result can be sequentially filtered and used as the current frame difference image. It can then be determined whether the second detection result of the current frame difference image meets the network post-processing conditions. If the second detection result meets the network post-processing conditions, it can be used as the background target, and its corresponding second detection result can be used as the second target detection result. If the second detection result of the current frame difference image does not meet the network post-processing conditions, it can be understood that no background detection box that meets the network post-processing conditions has been detected. In this case, the difference image in the background image dataset corresponding to the background detection box that does not meet the network post-processing conditions is used as the difference image to be supplemented and supplemented. By obtaining at least three consecutive difference images from the background image dataset and performing joint decision processing based on these at least three difference images, the second target detection result of the difference image to be supplemented is determined.

[0118] As another possible approach, the second detection result of the current frame difference image can be compared with the second detection result of the previous frame difference image to determine if the intersection-union ratio (IU / R) is greater than a threshold. If the IU / R of the current frame difference image and the previous frame difference image is less than the threshold, it indicates that the difference between the current frame difference image and the previous frame difference image is large. In this case, the second detection result of the current frame difference image does not meet the network post-processing conditions. At least three consecutive difference images are then obtained from the background image dataset, and joint decision processing is performed based on these at least three difference images to determine the second target detection result. If the IU / R of the current frame difference image and the previous frame difference image is greater than the threshold, it indicates that the difference between the current frame difference image and the previous frame difference image is small. In this case, the second detection result of the current frame difference image meets the network post-processing conditions.

[0119] The aforementioned three consecutive differential images include the differential image to be supplemented and the differential images of the frames before and after the differential image to be supplemented. The differential images of the frames before and after the differential image include the differential image of the previous frame and the differential image of the next frame. The acquisition time of the differential image of the previous frame is before the acquisition time of the differential image to be supplemented, and the acquisition time of the differential image of the next frame is after the acquisition time of the differential image to be supplemented. At least one frame of the differential image of the previous frame satisfies the network post-processing conditions, and at least one frame of the differential image of the next frame satisfies the network post-processing conditions.

[0120] Specifically, in the process of jointly deciding based on at least three frames of difference images to determine the second target detection result of the difference image to be supplemented, the second detection result corresponding to each frame of difference image in the preceding and following frames can be obtained. Based on the second detection result, the area of ​​the corresponding background target region is determined. Then, the second intermediate detection result in the preceding and following frames of difference images that meets the preset conditions is selected. The background center coordinates and background target region area of ​​the second intermediate detection result are determined. According to the acquisition time sequence, the background center coordinates and background target region area of ​​the difference image to be supplemented are supplemented according to the background center coordinates and background target region area in the second intermediate detection result to obtain the second target detection result of the difference image to be supplemented.

[0121] After determining the second detection result corresponding to each frame of the difference image in the preceding and following frames, the area of ​​the background target region in the preceding and following frame difference images is obtained from the second detection result, and it is determined whether the area of ​​the background target region is greater than the total area T of its difference image. area The area T of the difference images between the preceding and following frames is greater than the total area T of the difference images. area The background detection bounding box that is times larger than the preset area condition is determined; the area of ​​the T-shaped area in the difference image between the preceding and following frames is not greater than the total area of ​​the difference image. area The background detection bounding box that is times larger than the preset area condition is determined to be non-compliant. Among them, T... area The multiplier can be customized according to actual needs, for example, it can be 0.2 < T. area <0.3. Then, the second detection result that meets the preset area condition is selected from the difference images of the previous and next frames as the second intermediate detection result, and the background center coordinates and background target area of ​​the second intermediate detection result are determined. According to the time axis of the acquisition time sequence, the background center coordinates in the second intermediate detection result are recorded in sequence and arranged to form a motion trajectory. Then, based on the formed motion trajectory, the background center coordinates of the difference image to be supplemented are supplemented, and the background target area of ​​the difference image to be supplemented (i.e., the difference image with the lost target) is supplemented based on the background target area of ​​the difference image of the previous frame and the difference image of the next frame. For example, it can be obtained by calculating the average background target area of ​​the difference image of the previous frame and the difference image of the next frame.

[0122] In this dataset, the background image dataset is used as the disk image dataset. When the current difference image in the disk image dataset does not detect a disk detection box that does not meet the network post-processing conditions, a joint decision can be made based on N consecutive difference images. Specifically, this can be achieved by obtaining the motion disk detection results of N consecutive difference images, determining the corresponding disk target region area based on the motion disk detection results, and judging whether the disk target region area meets the target region area condition, i.e., judging whether the disk target region area is greater than the total area T of the difference images. area The area of ​​the target region on the disk is greater than the total area T of the difference image. area A circular detection bounding box that is times larger than the target area is determined to meet the preset target region area condition, and the area of ​​the circular target region is not greater than the total area T of the difference image. area The disk detection frame that is times larger than the preset target area condition is determined. Then, according to the acquisition time sequence, the coordinates of the center point of the moving disk on the time axis are recorded sequentially. The coordinates of the center point of the moving disk are connected to form a motion trajectory. Based on the formed motion trajectory, the coordinates of the center point of the moving disk in the differential image of the lost target (the differential image to be supplemented) are supplemented. The disk target area of ​​the differential image of the previous frame and the differential image of the differential image of the lost target (the differential image to be supplemented) are obtained. Then, the average value of the disk target area of ​​the differential image of the previous frame and the differential image of the differential image of the differential image of the differential image of the differential image to be supplemented is calculated to obtain the disk target area of ​​the differential image of the differential image to be supplemented. The total number of frames per second > N≥3, where the total number of frames per second refers to the total number of frames of the differential image transmitted per second, for example, it can be 30.

[0123] For example, when the difference image to be supplemented is the third frame image in the difference images of the background image dataset, N consecutive difference images are obtained from the background image dataset. Based on these N difference images, joint decision processing is performed to determine the second target detection result. The total number of frames per second > N≥3. Assuming that three consecutive difference images are obtained from the background image dataset, namely the second frame image, the third frame image, and the fourth frame image, the corresponding second detection results are obtained, and the corresponding background target area is determined based on the second detection results. When the background target area of ​​the second frame image and the fourth frame image both meet the preset conditions, the second detection results of the second frame image and the fourth frame image are used as the second intermediate detection results. According to the acquisition time order, the background center coordinates and background target area of ​​the second frame image and the fourth frame image on the time axis are recorded sequentially to form a trajectory. Then, the background center coordinates and background target area of ​​the third frame image are supplemented according to the formed trajectory to obtain the second target detection result of the third frame image.

[0124] In this embodiment, by filtering and frame interpolation of the second detection results, false detections can be filtered out, and the missing frame differential image can be supplemented. This combines more comprehensive features to determine the second target detection result, improving the accuracy of the second target detection result and providing accurate data guidance for subsequent determination of the detection result of the target relative to the background information.

[0125] Furthermore, after using different sub-detection models to process the target dataset and obtain sub-detection results corresponding to different target types, and outputting them, it can be determined whether the target to be detected is located outside the region where the background information is located based on the first detection result and the second target detection result. Then, the first detection result where the target to be detected is located outside the region where the background information is located is filtered out to obtain the filtered first detection result. Based on the filtered first detection result, the detection target confidence corresponding to the target candidate box is determined. The detection target confidence is used to characterize the probability value that the image within the target candidate box is the target to be detected. The first detection result where the detection target confidence is less than the preset confidence threshold is filtered out to obtain the first target detection result.

[0126] Specifically, in determining whether a target to be detected is located outside the area where the background information is located, a coarse discrimination method can be used. The process involves first obtaining the first detection result of the target, then using the maximum and minimum values ​​of the distance between the target region and the center coordinates of the background in the first detection result. If the maximum and minimum values ​​are not within a threshold range, it is determined whether the target is located outside the area where the background information is located. If the maximum and minimum values ​​are within the threshold range, it is determined that the target is located within the area where the background information is located. Then, the first detection results where the target is located outside the area where the background information is located are filtered out, resulting in a filtered first detection result. Next, the filtered first detection result is filtered for target confidence. This is done by determining the probability value of the candidate box being the target and the probability value of the category being the target from the filtered first detection result, multiplying these two values ​​to calculate the target confidence corresponding to the candidate box. Then, it is determined whether the target confidence is less than a preset confidence threshold. First detection results with target confidence less than the preset confidence threshold are filtered out, resulting in the first target detection result.

[0127] For example, taking the target to be detected as including point targets and stain targets, and the background information as a disk, after processing the point target image dataset through the CenterNet model to obtain the first detection result of the point targets, and processing the stain target image dataset through the trained Faster-RCNN model corresponding to the stain targets to obtain the first detection result of the stain targets, we can first determine whether the target to be detected is located in the area where the disk information is located, filter out the first detection results where the target to be detected is located outside the area where the disk information is located, and obtain the filtered first detection result. Then, we obtain the detection target confidence of the filtered first detection result, and then determine whether the detection target confidence is less than a preset confidence threshold, filter out the first detection results where the detection target confidence is less than the preset confidence threshold, and obtain the first target detection result.

[0128] After obtaining the first target detection result and the second target detection result, it is necessary to perform coordinate transformation on the first target detection result. A rectangular coordinate system is established with the background center coordinates corresponding to the second target detection result as the origin and the horizontal direction to the right as the positive x-axis. Then, according to the rectangular coordinate system, the first target detection result is transformed based on the second target detection result to obtain the detection result of the target relative to the background information. This detection result of the target relative to the background information is the position coordinate of the target relative to the background center coordinates, which can be expressed by the following coordinate transformation formula:

[0129]

[0130] Where (x1,y1) are the pixel coordinates of the target to be detected when the coordinate system is established with the top left corner vertex of the difference image as the origin before coordinate transformation. They can be obtained by the first detection model for target detection processing. (x0,y0) are the coordinates of the background center. These coordinates can be obtained by the second detection model for background detection processing. (x2,y2) are the position coordinates output after coordinate transformation.

[0131] In addition, in this embodiment, the results obtained by the coordinate transformation formula can be further combined with the attributes of the background information to determine whether the target to be detected is located in the area where the background information is located. When the background is a disk, the attributes of the background information may include the disk size, the size of the inscribed circle, etc.

[0132] In this embodiment, multiple target detection networks are used to detect targets of different categories in the differential image, which alleviates the detection performance limitations of a single target detection network. This approach can comprehensively and specifically determine the detection results of different categories, filter out redundant information in video target detection, and filter out false detections by jointly discriminating the detection results of different categories. The detection results are accurately converted into relative position information, which solves the problem that moving backgrounds are difficult to eliminate by differential after alignment, thereby improving the accuracy of relative position information determination.

[0133] It should be noted that although the operations of the method of the present invention are described in a specific order in the accompanying drawings, this does not require or imply that these operations must be performed in that specific order, or that all of the operations shown must be performed to achieve the desired result. On the contrary, the steps depicted in the flowchart may be performed in a different order. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.

[0134] on the other hand, Figure 9 This is a schematic diagram of a target detection device provided in an embodiment of this application. The device can be a component within a terminal or server, such as… Figure 9 As shown, the device includes:

[0135] The dataset construction module 710 is used to acquire and process the video to be detected, which contains a dynamic background, to construct a differential image dataset.

[0136] The segmentation processing module 720 is used to segment the differential image dataset to obtain a detection target image dataset containing the target to be detected and a background image dataset containing background information.

[0137] The detection module 730 is used to perform target detection processing on the target image dataset through a trained first detection model to obtain a first detection result, and to perform background detection processing on the background image dataset through a trained second detection model to obtain a second detection result;

[0138] The filtering and processing module 740 is used to filter and perform frame interpolation on the second detection result to obtain the second target detection result;

[0139] The result determination module 750 is used to obtain the first target detection result based on the first detection result and the second target detection result;

[0140] The transformation processing module 760 is used to perform coordinate transformation processing on the first target detection result based on the second target detection result to obtain the detection result of the target to be detected relative to the background information.

[0141] In some embodiments, the detection module 730 is specifically used for:

[0142] Obtain the target type of the target to be detected;

[0143] Based on the target type, the target image dataset is divided into at least two sub-datasets corresponding to the target type;

[0144] At least two subsets of data are processed by at least two sub-detection models corresponding to the target type to obtain the sub-detection results corresponding to each sub-detection model, and the sub-detection results corresponding to each sub-detection model are used as the first detection result.

[0145] In some embodiments, see Figure 10 As shown, the above-mentioned filtering processing module 740 includes:

[0146] The first determining unit 741 is used to determine the background target confidence and background target region area corresponding to the background detection box based on the background detection box in the second detection result. The background target confidence is used to characterize the probability value that the image in the background detection box is background information.

[0147] The judgment unit 742 is used to determine whether the background detection box meets the network post-processing conditions based on the background target confidence and the area of ​​the background target region for each second detection result, and obtain the corresponding judgment result.

[0148] The second determining unit 743 is used to filter and perform frame interpolation processing on the second detection result based on the judgment result, and determine the second target detection result;

[0149] In some embodiments, the determination unit 742 is specifically used for:

[0150] Based on the second detection result, determine the intersection-union ratio between any two background detection boxes in the second detection result;

[0151] Determine whether the cross-union ratio (CUNR) is greater than a preset CUNR threshold.

[0152] When the crossover ratio is greater than the preset threshold, the background detection box with the higher confidence of the background target in the two background detection boxes is determined to meet the network post-processing conditions;

[0153] When the area of ​​the background target region in the two background detection boxes is not greater than the preset intersection-union ratio threshold, the background detection box that meets the preset area condition is determined to meet the network post-processing condition.

[0154] In some embodiments, the second determining unit 743 is specifically used for:

[0155] When the judgment result is used to characterize that the background detection box meets the network post-processing conditions, the second detection result corresponding to the background detection box that meets the network post-processing conditions is taken as the second target detection result.

[0156] When the judgment result is used to characterize the background detection box as not meeting the network post-processing conditions, the second detection result corresponding to the background detection box that does not meet the network post-processing conditions is used to determine the differential image to be framed and perform frame interpolation processing to obtain the second target detection result of the differential image to be framed.

[0157] In some embodiments, the second determining unit 743 described above is further configured to:

[0158] The difference image in the background image dataset corresponding to the background detection box that does not meet the network post-processing conditions is used as the difference image of the frame to be supplemented.

[0159] Obtain at least three consecutive difference images from the background image dataset; the at least three consecutive difference images include the difference image to be supplemented and the difference images of the frames before and after the difference image to be supplemented, the difference images of the frames before and after the difference image include the difference image of the frame before and the difference image of the frame after; the acquisition time of the difference image of the frame before is before the acquisition time of the difference image to be supplemented, and the acquisition time of the difference image of the frame after is after the acquisition time of the difference image to be supplemented; at least one frame of the difference image of the frame before satisfies the network post-processing condition, and at least one frame of the difference image of the frame after satisfies the network post-processing condition.

[0160] Obtain the second detection result corresponding to each frame of the difference image in the difference image between the previous and next frames, and determine the area of ​​the corresponding background target region based on the second detection result;

[0161] Filter the second intermediate detection results in the difference images of the preceding and following frames where the area of ​​the background target region meets the preset conditions, and determine the background center coordinates and the area of ​​the background target region of the second intermediate detection results;

[0162] According to the acquisition time sequence, the background center coordinates and background target area of ​​the image to be interpolated are supplemented based on the background center coordinates and background target area of ​​the image to be interpolated, so as to obtain the second target detection result of the image to be interpolated.

[0163] In some embodiments, the result determination module 750 is specifically used for:

[0164] Based on the first detection result and the second target detection result, determine whether the target to be detected is located outside the area where the background information is located;

[0165] The first detection result is obtained by filtering out the first detection result after filtering out the target being detected located outside the area where the background information is located;

[0166] Based on the first detection result after filtering, the detection target confidence score corresponding to the detection target candidate box is determined. The detection target confidence score is used to characterize the probability value that the image within the detection target candidate box is the target to be detected.

[0167] The first detection result is obtained by filtering out the first detection result whose confidence level of the detection target is less than the preset confidence threshold.

[0168] In some embodiments, the conversion processing module 760 is specifically used for:

[0169] Establish a rectangular coordinate system with the center of the background as the origin and the horizontal rightward direction as the positive x-axis;

[0170] Using a Cartesian coordinate system, the detection results of the first target are transformed based on the detection results of the second target to obtain the detection results of the target relative to the background information.

[0171] It is understood that the functions of each functional module of the target detection device in this embodiment can be specifically implemented according to the methods in the above method embodiments. The specific implementation process can be referred to the relevant descriptions in the above method embodiments, and will not be repeated here.

[0172] In summary, the target detection apparatus provided in this application, on the one hand, by dividing the differential image dataset into a target image dataset and a background image dataset, can extract features corresponding to each target type in the differential image dataset with finer granularity. This allows for the identification of the target and background information based on more detailed features. Furthermore, by employing detection models such as the first detection model and the second detection model, it can specifically detect targets of different categories in the differential image dataset, effectively improving the accuracy of target detection. On the other hand, by filtering and interpolating the second detection results, false detections can be filtered out, and missing frames in the differential images can be supplemented. This combines more comprehensive features to determine the second target detection result. Based on the first and second target detection results, the first target detection result is accurately obtained. Subsequently, the coordinate transformation of the first target detection result can be performed based on the second target detection result, resulting in an accurate detection result of the target relative to the background information. This also significantly improves the target detection accuracy of the method provided in this application compared to existing technologies.

[0173] The following is for reference. Figure 11 , Figure 11 This is a schematic diagram of the computer system structure of the terminal device according to an embodiment of this application.

[0174] like Figure 11As shown, the computer system 300 includes a central processing unit (CPU) 301, which can perform various appropriate actions and processes based on programs stored in read-only memory (ROM) 302 or programs loaded from storage section 303 into random access memory (RAM) 303. The RAM 303 also stores various programs and data required for the operation of the system 300. The CPU 301, ROM 302, and RAM 303 are interconnected via a bus 304. An input / output (I / O) interface 305 is also connected to the bus 304.

[0175] The following components are connected to I / O interface 305: an input section 306 including a keyboard, mouse, etc.; an output section 307 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 308 including a hard disk, etc.; and a communication section 309 including a network interface card such as a LAN card, modem, etc. The communication section 309 performs communication processing via a network such as the Internet. A drive 310 is also connected to I / O interface 305 as needed. A removable medium 311, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on drive 310 as needed so that computer programs read from it can be installed into storage section 308 as needed.

[0176] Specifically, according to embodiments of this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this application include a computer program product comprising a computer program carried on a machine-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 303, and / or installed from removable medium 311. When the computer program is executed by central processing unit (CPU) 301, it performs the functions defined above in the system of this application.

[0177] It should be noted that the computer-readable medium shown in this application can be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this application, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media can also be any computer-readable medium other than computer-readable storage media, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.

[0178] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0179] The units or modules described in the embodiments of this application can be implemented in software or hardware. The described units or modules can also be housed in a processor; for example, it can be described as: a processor including: a dataset construction module, a partitioning processing module, a detection module, a filtering processing module, a result processing module, and a transformation processing module. The names of these units or modules do not necessarily limit the unit or module itself; for example, the dataset construction module can also be described as "used to acquire a video to be detected containing a dynamic background, process the video to be detected, and construct a differential image dataset."

[0180] In another aspect, this application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or it may exist independently and not assembled into the electronic device. The aforementioned computer-readable storage medium stores one or more programs, which, when used by one or more processors, execute the target detection method described in this application:

[0181] A video to be detected containing a dynamic background is acquired and processed to construct a differential image dataset;

[0182] The differential image dataset is divided into a detection target image dataset containing the target to be detected and a background image dataset containing background information.

[0183] The target image dataset is processed by a trained first detection model to obtain a first detection result, and the background image dataset is processed by a trained second detection model to obtain a second detection result.

[0184] The second detection result is filtered and frame-padding processed to obtain the second target detection result;

[0185] Based on the first detection result and the second target detection result, the first target detection result is obtained;

[0186] Based on the second target detection result, the first target detection result is subjected to coordinate transformation processing to obtain the detection result of the target to be detected relative to the background information.

[0187] In summary, the target detection method, apparatus, device, and storage medium provided in this application embodiment acquire a video to be detected containing a dynamic background and process the video to be detected to construct a differential image dataset. Then, the differential image dataset is divided to obtain a target image dataset containing the target to be detected and a background image dataset containing background information. The target image dataset is processed by a trained first detection model to obtain a first detection result, and the background image dataset is processed by a trained second detection model to obtain a second detection result. The second detection result is then filtered and frame-filled to obtain a second target detection result. Based on the first and second target detection results, the first target detection result is obtained. Finally, based on the second target detection result, the first target detection result is subjected to coordinate transformation to obtain the detection result of the target to be detected relative to the background information. Compared to existing technologies, the technical solution in this application, on the one hand, by dividing the differential image dataset into a target image dataset and a background image dataset, allows for more granular extraction of features corresponding to each target type in the differential image dataset. This enables the identification of the target and background information based on more detailed features. Furthermore, by employing detection models such as the first detection model and the second detection model, different categories of targets in the differential image dataset are detected in a targeted manner, effectively improving the accuracy of target detection. On the other hand, by filtering and interpolating the second detection results, false detections can be filtered out, and missing frames in the differential image can be supplemented. This combines more comprehensive features to determine the second target detection result. Based on the first and second target detection results, the first target detection result is accurately obtained. Subsequently, coordinate transformation processing can be performed on the first target detection result based on the second target detection result, resulting in an accurate detection result of the target relative to the background information. This also significantly improves the target detection accuracy of the method provided in this application compared to existing technologies.

[0188] The above description is merely a preferred embodiment of this application and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in this application is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the inventive concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features with similar functions disclosed in this application.

Claims

1. A target detection method characterized by, include: A video to be detected containing a dynamic background is acquired and frame extraction is performed on the video to be detected to construct a differential image dataset; The differential image dataset is divided into a detection target image dataset containing the target to be detected and a background image dataset containing background information. The target image dataset is processed by a trained first detection model to obtain a first detection result, and the background image dataset is processed by a trained second detection model to obtain a second detection result. Based on the background detection box in the second detection result, the background target confidence score and background target region area corresponding to the background detection box are determined. The background target confidence score is used to characterize the probability value that the image within the background detection box is background information. For each of the second detection results, the background detection box is judged to meet the network post-processing conditions based on the background target confidence and the background target area, and the corresponding judgment result is obtained. Based on the judgment result, the second detection result is filtered and frame-filled to determine the second target detection result; Based on the first detection result and the second target detection result, the first target detection result is obtained; Establish a rectangular coordinate system with the center of the background as the origin and the horizontal rightward direction as the positive x-axis; According to the Cartesian coordinate system, the first target detection result is transformed based on the second target detection result to obtain the detection result of the target to be detected relative to the background information.

2. The method of claim 1, wherein, The first detection model includes at least two sub-detection models. The target image dataset is processed by the trained first detection model to obtain a first detection result, including: Obtain the target type of the target to be detected; Based on the target type, the detected target image dataset is divided into at least two sub-datasets corresponding to the target type; The at least two subsets of data are processed by at least two sub-detection models corresponding to the target type to obtain sub-detection results corresponding to each sub-detection model, and the sub-detection results corresponding to each sub-detection model are used as the first detection result.

3. The method of claim 1, wherein, Based on the background target confidence score and the area of ​​the background target region, determine whether the background detection box meets the network post-processing conditions, and obtain the corresponding judgment result, including: Based on the second detection result, determine the intersection-union ratio between any two background detection boxes in the second detection result; Determine whether the cross-union ratio is greater than a preset cross-union ratio threshold; When the crossover ratio is greater than the preset threshold, the background detection box with the higher confidence level of the background target in the two background detection boxes is determined to meet the network post-processing conditions; When the area of ​​the background target region in the two background detection frames is not greater than the preset intersection-union ratio threshold, the background detection frame that meets the preset area condition is determined to meet the network post-processing condition.

4. The method of claim 1, wherein, Based on the judgment result, the second detection result is filtered and frame-filled to determine the second target detection result, including: When the judgment result is used to characterize that the background detection box meets the network post-processing conditions, the second detection result corresponding to the background detection box that meets the network post-processing conditions is taken as the second target detection result. When the judgment result is used to characterize that the background detection box does not meet the network post-processing conditions, the second detection result corresponding to the background detection box that does not meet the network post-processing conditions is used to determine the differential image to be framed and perform frame interpolation processing to obtain the second target detection result of the differential image to be framed.

5. The method of claim 4, wherein, Determine the difference image to be interpolated and perform frame interpolation processing to obtain the second target detection result of the difference image to be interpolated, including: The difference image in the background image dataset corresponding to the background detection box that does not meet the network post-processing conditions is used as the difference image to be supplemented frame; At least three consecutive difference images are obtained from the background image dataset; the at least three consecutive difference images include the difference image to be supplemented and the difference images of the frames before and after the difference image to be supplemented, the difference images of the frames before and after the difference image include the difference image of the frame before and the difference image of the frame after the difference image; the acquisition time of the difference image of the frame before the difference image to be supplemented is before the acquisition time of the difference image of the frame to be supplemented, and the acquisition time of the difference image of the frame after the acquisition time of the difference image of the frame to be supplemented is after the acquisition time of the difference image of the frame to be supplemented; at least one frame of the difference image of the frame before the difference image meets the network post-processing conditions, and at least one frame of the difference image of the frame after the difference image meets the network post-processing conditions; Obtain the second detection result corresponding to each frame of the difference image in the preceding and following frame difference images, and determine the area of ​​the corresponding background target region based on the second detection result; Filter the second intermediate detection results in the difference images of the preceding and following frames where the area of ​​the background target region meets the preset conditions, and determine the background center coordinates and the area of ​​the background target region of the second intermediate detection results; According to the acquisition time sequence, the background center coordinates and background target area of ​​the differential image to be supplemented are supplemented based on the background center coordinates and background target area in the second intermediate detection result, so as to obtain the second target detection result of the differential image to be supplemented.

6. The method of claim 1, wherein, Based on the first detection result and the second target detection result, the first target detection result is obtained, including: Based on the first detection result and the second target detection result, it is determined whether the target to be detected is located outside the area where the background information is located; The first detection result is obtained by filtering out the first detection result after filtering out the first detection result, which is located outside the area where the target to be detected is located in the background information. Based on the first detection result after filtering, the detection target confidence level corresponding to the detection target candidate box is determined. The detection target confidence level is used to characterize the probability value that the image within the detection target candidate box is the target to be detected. The first detection result is obtained by filtering out the first detection result whose confidence level of the detection target is less than the preset confidence threshold.

7. A target detection apparatus characterized by comprising: The device includes: The dataset construction module is used to acquire a video to be detected containing a dynamic background and perform frame extraction processing on the video to be detected to construct a differential image dataset. The segmentation processing module is used to segment the differential image dataset to obtain a detection target image dataset containing the target to be detected and a background image dataset containing background information. The detection module is used to perform target detection processing on the target image dataset through a trained first detection model to obtain a first detection result, and to perform background detection processing on the background image dataset through a trained second detection model to obtain a second detection result; The filtering processing module is used to determine the background target confidence score and background target region area corresponding to the background detection box in the second detection result, wherein the background target confidence score is used to characterize the probability value that the image within the background detection box is background information; for each second detection result, it determines whether the background detection box meets the network post-processing conditions according to the background target confidence score and background target region area, and obtains the corresponding judgment result; based on the judgment result, it performs filtering and frame interpolation processing on the second detection result to determine the second target detection result; The result determination module is used to obtain a first target detection result based on the first detection result and the second target detection result; The transformation processing module is used to establish a rectangular coordinate system with the center coordinates of the background as the origin and the horizontal direction to the right as the positive x-axis; according to the rectangular coordinate system, the first target detection result is transformed based on the second target detection result to obtain the detection result of the target to be detected relative to the background information.

8. A computer device, comprising: The computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor is configured to implement the target detection method as described in any one of claims 1-6 when executing the program.

9. A computer-readable storage medium, characterized in that, It stores a computer program for implementing the target detection method as described in any one of claims 1-6.