Target detection method, device, processor and storage medium

By decoupling object detection into parallel operation of the NPU detection pipeline and the CPU motion prediction pipeline, the real-time problem of high-definition video detection caused by insufficient NPU computing power is solved, and high frame rate real-time processing and spatiotemporal consistency of detection results are achieved.

CN122244772APending Publication Date: 2026-06-19PHYTIUM TECH CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
PHYTIUM TECH CO LTD
Filing Date
2026-05-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The existing NPU computing power is insufficient to meet the real-time requirements of frame-by-frame detection of high-definition video, and the synchronous pipeline design leads to a decrease in overall processing power, making it difficult to achieve the requirements of high frame rate real-time processing.

Method used

Object detection is decoupled into an NPU detection pipeline (processing keyframes) and a motion prediction pipeline (processing non-keyframes). The two pipelines work in parallel. The NPU performs object detection on keyframes, and the CPU performs motion prediction on non-keyframes to generate target video frames.

Benefits of technology

It achieves high frame rate real-time processing while ensuring detection accuracy, solves the continuity problem of frame skipping detection, ensures the spatiotemporal consistency of detection results, and improves the utilization of hardware resources.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244772A_ABST
    Figure CN122244772A_ABST
Patent Text Reader

Abstract

This application provides a target detection method, apparatus, processor, and storage medium. The method is applied to a CPU and includes: sequentially performing target detection on each first type of video frame using an NPU to obtain target detection results for the target object in each first type of video frame; acquiring motion features of the target object based on the first type of video frames preceding each second type of video frame; performing target prediction on each second type of video frame sequentially based on the motion features of the target object to obtain target prediction results for the target object in each second type of video frame; and generating target video frames for the target object based on the target detection results and target prediction results. Target detection is decoupled into parallel operation of the NPU detection pipeline and the CPU detection pipeline, ensuring detection accuracy while achieving high frame rate real-time processing. For skipped video frames, target prediction is performed based on historical video frames, solving the continuity problem of skipped frame detection and ensuring spatiotemporal consistency of detection results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and more specifically, to a target detection method, apparatus, processor, and storage medium. Background Technology

[0002] With the rapid development of deep learning technology, neural network-based target detection algorithms (such as the YOLO series) have made significant progress in both accuracy and speed.

[0003] In related technologies, some processors integrate a Neural Processing Unit (NPU). NPUs often adopt a synchronous pipeline design, that is, "acquisition → detection → display" is executed serially to perform target detection.

[0004] However, the NPU's computing power is insufficient to meet the real-time requirements of frame-by-frame detection in high-definition video, and the aforementioned synchronous pipeline design leads to an overall decrease in processing power, making it difficult to achieve the requirements of high frame rate real-time processing. Summary of the Invention

[0005] In view of this, embodiments of this application provide a target detection method, apparatus, processor, and storage medium to solve the problems that the computing power of existing NPUs is insufficient to meet the real-time requirements of frame-by-frame detection of high-definition videos, and that the synchronous pipeline design leads to a decrease in overall processing power, making it difficult to achieve the requirements of high frame rate real-time processing.

[0006] In a first aspect, embodiments of this application provide a target detection method applied to a central processing unit (CPU) in a processor, wherein the processor further includes a neural network dedicated processing unit (NPU), and the CPU is communicatively connected to the NPU. The method includes: Based on the preset frame skipping ratio, multiple first-type video frames and multiple second-type video frames are determined from the original video frames of the target object. The NPU is used to perform target detection on the first type of video frame to obtain the target detection result of the target object in the first type of video frame; Based on the first type of video frames preceding the second type of video frames, the motion features of the target object are obtained; Based on the motion characteristics of the target object, target prediction is performed on the second type of video frame to obtain the target prediction result of the target object in the second type of video frame; Based on the target detection results and the target prediction results, a target video frame of the target object is generated.

[0007] In an optional implementation, obtaining the motion features of the target object based on the first type of video frames preceding the second type of video frames includes: Determine multiple target video frames preceding the second type of video frames from multiple first type of video frames; If the number of multiple target video frames is greater than or equal to a preset number threshold, the motion characteristics of the target object are obtained based on the motion trajectory of the target object in the multiple target video frames.

[0008] In an optional implementation, the method further includes: If the number of multiple target video frames is less than the preset number threshold, then the second type of video frame is adjusted to the first type of video frame.

[0009] In an optional implementation, the step of predicting the target object in the second type of video frame based on the motion characteristics of the target object, to obtain the target prediction result of the target object in the second type of video frame, includes: If the motion characteristics of the target object satisfy the first parameter condition, then based on the motion characteristics of the target object, a linear prediction model is used to predict the target in the second type of video frame to obtain the target prediction result. If the motion characteristics of the target object satisfy the second parameter condition, then based on the motion characteristics of the target object, a quadratic polynomial prediction model is used to predict the target in the second type of video frame, and the target prediction result is obtained.

[0010] In an optional implementation, the method further includes: If the motion characteristics of the target object do not meet the first parameter condition and the second parameter condition, then the second type of video frame is adjusted to the first type of video frame.

[0011] In an optional implementation, the method further includes: Place multiple video frames of the first type into the target queue; The step of performing target detection on the first type of video frames using the NPU to obtain the target detection result of the target object in the first type of video frames includes: The NPU retrieves the first type of video frame from the target queue and performs target detection on the retrieved first type of video frame to obtain the target detection result.

[0012] In an optional implementation, the method further includes: The queue length of the target queue is obtained based on the number of the first type of video frames in the target queue. If the queue length is greater than the first length threshold, the preset frame skipping ratio is increased to obtain the first frame skipping ratio; Based on the first frame skipping ratio, a first video frame is determined from the first type of video frames in the target queue and adjusted to the second type of video frame.

[0013] In an optional implementation, the method further includes: If the queue length is less than the second length threshold, then the preset frame skipping ratio is reduced to obtain the second frame skipping ratio; Based on the second frame skipping ratio, a second video frame is determined from the second type of video frames to be predicted and adjusted to the first type of video frame.

[0014] In an optional implementation, the method further includes: Obtain the average detection speed of the NPU for the detected first type of video frames; Based on the average detection speed, the first length threshold and the second length threshold are obtained.

[0015] In an optional implementation, the method further includes: At preset intervals, the NPU performs target detection on the second type of video frame to be predicted, and obtains the detection result of the target object in the second type of video frame to be predicted. Based on the detection result of the target object in the second type of video frame to be predicted and the target prediction result of the target object in the second type of video frame to be predicted, the prediction confidence is obtained; If the prediction confidence is less than a preset confidence threshold, then the next second type video frame of the second type video frame to be predicted will be adjusted to the first type video frame.

[0016] In an optional implementation, generating a target video frame of the target object based on the target detection result and the target prediction result includes: Based on the target detection results, the first type of video frames are fused to obtain a detected fused video frame; Based on the target prediction result, the second type of video frames are fused to obtain the predicted fused video frames; The detected fused video frame and the predicted fused video frame are used as the target video frame.

[0017] Secondly, embodiments of this application also provide a target detection device, applied to a central processing unit (CPU) in a processor, wherein the processor further includes a neural network dedicated processing unit (NPU), and the CPU is communicatively connected to the NPU. The device includes: The determination module is used to determine multiple first-type video frames and multiple second-type video frames from the original video frames of the target object according to a preset frame skipping ratio. The processing module is used to perform target detection on the first type of video frame through the NPU to obtain the target detection result of the target object in the first type of video frame; The acquisition module is used to acquire the motion features of the target object based on the first type of video frames preceding the second type of video frames; The processing module is further configured to perform target prediction on the second type of video frame based on the motion characteristics of the target object, and obtain the target prediction result of the target object in the second type of video frame; The generation module is used to generate target video frames of the target object based on the target detection results and the target prediction results.

[0018] Thirdly, embodiments of this application also provide a processor, including: a central processing unit (CPU), a neural network dedicated processing unit (NPU), and a storage unit. The CPU is communicatively connected to the NPU, and the CPU, the NPU, and the storage unit are also communicatively connected. The storage unit stores machine-readable instructions executable by the CPU. When the processor is running, the CPU executes the machine-readable instructions to perform the method described in any of the first aspects.

[0019] Fourthly, embodiments of this application also provide a computer-readable storage medium storing a computer program, which is executed by a central processing unit (CPU) to perform the method described in any of the first aspects.

[0020] This application provides a target detection method, apparatus, processor, and storage medium. The method is applied to a CPU and includes: sequentially performing target detection on each first type of video frame using an NPU to obtain target detection results for the target object in each first type of video frame; acquiring motion features of the target object based on the first type of video frames preceding each second type of video frame; performing target prediction on each second type of video frame sequentially based on the motion features of the target object to obtain target prediction results for the target object in each second type of video frame; and generating target video frames for the target object based on the target detection results and target prediction results. Target detection is decoupled into parallel operation of the NPU detection pipeline and the CPU detection pipeline, ensuring detection accuracy while achieving high frame rate real-time processing. For skipped video frames, target prediction is performed based on historical video frames, solving the continuity problem of skipped frame detection and ensuring spatiotemporal consistency of detection results. Attached Figure Description

[0021] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0022] Figure 1 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 1 ; Figure 2 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 2 ; Figure 3 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 3 ; Figure 4 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 4 ; Figure 5 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 5 ; Figure 6 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 6 ; Figure 7 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 7 ; Figure 8 A schematic diagram of the overall system architecture provided in the embodiments of this application; Figure 9 This is a schematic diagram of the target detection device provided in the embodiments of this application; Figure 10 This is a schematic diagram of the processor provided in an embodiment of this application. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. The components of the embodiments of this application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely represents selected embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0024] Currently, the computing power of NPUs is insufficient to meet the real-time requirements of frame-by-frame detection in high-definition videos, and the synchronous pipeline design leads to a decrease in overall processing power, making it difficult to achieve the requirements of high frame rate real-time processing. The use of fixed frame skipping detection results in the loss of key frame information, discontinuous detection, and loss of target tracking, which seriously affects the detection effect. In the synchronous architecture, video acquisition, preprocessing, NPU inference, postprocessing and other stages cannot be executed in parallel, resulting in low utilization of hardware resources (CPU, NPU, memory).

[0025] Based on this, this application decouples object detection into an NPU detection pipeline (processing key frames) and a motion prediction pipeline (processing non-key frames). The two pipelines work in parallel, achieving high frame rate real-time processing while ensuring detection accuracy, thus improving hardware resource utilization. For skipped non-key frames, object detection is performed based on historical video frames, solving the problem of continuity in skipped frame detection and ensuring the spatiotemporal consistency of detection results.

[0026] Figure 1 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 1 In this embodiment, the execution entity can be the CPU in the processor, and the processor also includes an NPU, with the CPU and NPU communicating with each other.

[0027] The processor can be a system-on-a-chip (SoC) chip that integrates a CPU and an NPU. The CPU is responsible for handling general computing tasks, such as running operating systems, office software, multitasking, and logical operations, while the NPU provides computing power for artificial intelligence (AI) applications, such as voice interaction, image processing, and intelligent understanding.

[0028] like Figure 1 As shown, the method may include: S101. Based on the preset frame skipping ratio, determine multiple first-type video frames and multiple second-type video frames from the original video frames of the target object.

[0029] The original video frame of the target object refers to the video frame containing the target object captured by the camera. The frame rate of the original video frame can be, for example, 30fps.

[0030] The preset frame skipping ratio is used to indicate the ratio of the number of first-type video frames and second-type video frames in the original video frames. The first-type video frames are also keyframes, and the second-type video frames are also non-keyframes.

[0031] For example, the preset frame skipping ratio is 1:2, which means that for every 1 keyframe (first type video frame) processed, 2 frames (second type video frames) are skipped.

[0032] For example, the preset frame skipping ratio is 1:3, which means that for every 1 keyframe (first type video frame) processed, 3 frames (second type video frames) are skipped.

[0033] Based on the preset frame skipping ratio, multiple first-type video frames are selected from the original video frames of the target object, and the remaining original video frames are used as multiple second-type video frames.

[0034] S102. Perform target detection on the first type of video frame using NPU to obtain the target detection result of the target object in the first type of video frame.

[0035] The NPU loads and runs pre-deployed target detection models, such as YOLO detection networks, and uses these networks to perform target detection on each type of video frame. This identifies the location of the target object in each type of video and generates target detection results, thereby ensuring the temporal relationship of the frame sequence of each type of video frame.

[0036] The target detection result is used to indicate the position of the target object in the first type of video frame. This position can be represented in the form of a detection box, with the center point of the detection box serving as the position of the target object in the first type of video frame.

[0037] In an optional implementation, an object detection network is used to perform object detection on each first type of video frame sequentially according to frame time sequence.

[0038] In an optional implementation, before performing target detection on the first type of video frame, the first type of video frame may be preprocessed, including but not limited to size scaling, normalization, etc., and converted into the input format required by the NPU.

[0039] S103. Obtain the motion features of the target object based on the first type of video frames preceding the second type of video frames.

[0040] According to the frame sequence, N first-type video frames preceding the second-type video frame are determined from multiple first-type video frames. The preset number can be, for example, 3 to 5, and this embodiment does not make a special limitation on this.

[0041] It should be noted that the N first-type video frames can be the N video frames that precede the second-type video frames and are closest to the second-type video frames.

[0042] The target object's position is obtained within N first-type video frames. Based on this position, the target object's motion trajectory within the corresponding sliding window of the N first-type video frames is calculated. The motion characteristics of the target object are then obtained based on this trajectory. These characteristics may include, for example, average motion speed, acceleration standard deviation, and trajectory curvature. The average motion speed measures the speed of the target object's displacement within the sliding window, with units of pixels per frame. The acceleration standard deviation measures the smoothness of motion; a smaller value indicates more uniform motion, while a larger value indicates more drastic speed fluctuations. The trajectory curvature reflects the degree of directional change; a larger curvature indicates a more curved path, and a curvature close to 0 indicates approximately linear motion.

[0043] S104. Based on the motion characteristics of the target object, perform target prediction on the second type of video frames to obtain the target prediction results of the target object in each second type of video frame.

[0044] Based on the target detection results obtained in the first type of video frames, and combined with the motion features of the target object, target prediction is performed on the second type of video frames located between adjacent first type video frames to obtain the position of the target object in the second type of video frames and generate the target prediction result.

[0045] The target prediction result is used to indicate the position of the target object in the second type of video frame. This position can be represented in the form of a prediction box, with the center point of the prediction box serving as the position of the target object in the second type of video frame.

[0046] S105. Based on the target detection results and target prediction results, generate the target video frame of the target object.

[0047] Based on the target detection results, the positions of the target objects in the first type of video frames are marked, and based on the target prediction results, the positions of the target objects in the second type of video frames are marked, thereby generating target video frames of the target objects. The target video frames include the first type of video frames and the second type of video frames marked with the positions of the target objects.

[0048] In an optional implementation, the target video frame is output at the frame rate of the original video frame.

[0049] In this embodiment, target detection is decoupled into an NPU detection pipeline (processing key frames) and a motion prediction pipeline (processing non-key frames). The two pipelines work in parallel, achieving high frame rate real-time processing while ensuring detection accuracy. Furthermore, for skipped non-key frames, target prediction is performed based on historical video frames, solving the continuity problem of traditional frame skipping detection and ensuring the spatiotemporal consistency of detection results.

[0050] Figure 2 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 2 ,like Figure 2 As shown, in an optional implementation, step S103 above, which involves obtaining the motion features of the target object based on the first type of video frames preceding each second type of video frame, may include: S201. Determine multiple target video frames preceding the second type video frames from multiple first type video frames.

[0051] Among them, multiple target video frames refer to the nearest first-type video frames before the second-type video frames. For example, if the preset frame skip ratio is 1:2, the first frame is a key frame, the second frame is a non-key frame, the third frame is a non-key frame, the fourth frame is a key frame, the fifth frame is a non-key frame, and the sixth frame is a non-key frame, then the nearest first-type video frames before the sixth frame are the fourth frame and the first frame, and the nearest first-type video frame before the second frame is the first frame.

[0052] S202. If the number of multiple target video frames is greater than or equal to a preset number threshold, obtain the motion characteristics of the target object based on the motion trajectory of the target object in the multiple target video frames.

[0053] The preset quantity threshold can be, for example, 3 or 4, etc., and this embodiment does not impose any special limitation on it.

[0054] If the number of multiple target video frames is greater than or equal to a preset threshold, it means that there is a sufficient number to support the calculation of the motion features of the target object within the sliding window corresponding to the multiple target video frames. Then, the prediction process of the target object is entered. That is, based on the position of the target object in N first-type video frames, the motion trajectory of the target object in the multiple target video frames is calculated, and then the motion features of the target object are calculated based on the motion trajectory.

[0055] In an alternative implementation, the method may further include: If the number of multiple target video frames is less than a preset threshold, the second type of video frames will be adjusted to the first type of video frames.

[0056] If the number of multiple target video frames is less than the preset threshold, it means that there are not enough frames to support the calculation of the motion features of the target object within the sliding window corresponding to the multiple target video frames. In this case, each second type of video frame is adjusted to a first type of video frame, and the target is detected by the NPU.

[0057] In this embodiment, when the number of multiple target video frames is less than a preset threshold, motion prediction cannot be performed. In this case, the current second-type video frame is forcibly marked as a first-type video frame and sent to the NPU for target detection, thereby improving the target detection accuracy and achieving high frame rate and high continuity target detection under limited NPU computing power.

[0058] Figure 3 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 3 ,like Figure 3 As shown, in an optional implementation, step S104, which involves predicting the target in the second type of video frame based on the motion characteristics of the target object, to obtain the target prediction result of the target object in the second type of video frame, may include: S301. If the motion characteristics of the target object satisfy the first parameter condition, then based on the motion characteristics of the target object, target prediction is performed on the second type of video frame through a linear prediction model to obtain the target prediction result.

[0059] The first parameter condition refers to the linear motion parameter condition. If the motion characteristics of the target object meet the first parameter condition, it means that the target object exhibits approximately uniform linear motion. Then, a linear prediction model is used to predict the target in the second type of video frame based on the motion characteristics of the target object, and the target prediction result is obtained. This can significantly suppress prediction drift caused by high-frequency noise or brief occlusion.

[0060] For example, if the average motion speed of the target object is <8 pixels / frame, the standard deviation of acceleration is <2.0, and the trajectory curvature is <0.3, these characteristics respectively indicate that the target object moves relatively slowly, accelerates and decelerates relatively gently, and the path is close to a straight line. In this case, the target object exhibits approximately uniform linear motion, and the target prediction result can significantly suppress prediction drift caused by high-frequency noise or brief occlusion.

[0061] S302. If the motion characteristics of the target object satisfy the second parameter condition, then based on the motion characteristics of the target object, target prediction is performed on the second type of video frame through a quadratic polynomial prediction model to obtain the target prediction result.

[0062] The second parameter condition refers to the uniform acceleration motion parameter condition. If the motion characteristics of the target object satisfy the second parameter condition, it means that the target object exhibits approximately uniform acceleration motion. Then, a quadratic polynomial prediction model is used to predict the target in the second type of video frame based on the motion characteristics of the target object, and the target prediction result is obtained.

[0063] For example, if the average motion speed is between 8 and 25 pixels per frame, the acceleration standard deviation is <5.0, and the trajectory curvature is <0.6, these respectively indicate that the target object is in medium-to-high speed motion, has a certain degree of speed variation, and the path has slight curvature. In this case, the target object exhibits approximately uniformly accelerated motion.

[0064] It should be noted that the specific process of target prediction using the linear prediction model and the quadratic polynomial prediction model can be found in the existing relevant descriptions. The core of this embodiment lies in the adaptive selection of the prediction algorithm for target prediction, which solves the continuity problem of traditional frame skipping detection, ensures the spatiotemporal consistency of the detection results, and realizes intelligent motion prediction compensation.

[0065] In an alternative implementation, the method may further include: If the motion characteristics of the target object do not meet the first parameter condition and the second parameter condition, then the second type of video frame will be adjusted to the first type of video frame.

[0066] If the motion characteristics of the target object do not meet the first parameter condition and the second parameter condition, it means that the target object does not exhibit approximately uniform linear motion or approximately uniformly accelerated motion. In this case, the second type of video frame can be adjusted to the first type of video frame, and the target can be detected by the NPU, thereby improving the accuracy of target detection.

[0067] Figure 4 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 4 ,like Figure 4 As shown, in an optional implementation, the method may further include: S401. Place multiple Type I video frames into the target queue.

[0068] The target queue is an NPU detection queue used to cache multiple first-type video frames.

[0069] Step S102 above, which involves performing target detection on the first type of video frames using the NPU to obtain the target detection result of the target object in the first type of video frames, may include: S402. Using the NPU, retrieve the first type of video frame from the target queue, and perform target detection on the retrieved first type of video frame to obtain the target detection result.

[0070] When the NPU performs object detection, it sequentially retrieves the first type of video frames from the object queue and performs object detection on the retrieved first type of video frames to obtain the object detection results.

[0071] Figure 5 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 5 ,like Figure 5 As shown, in an optional implementation, the method may further include: S501. Obtain the queue length of the target queue based on the number of first-type video frames in the target queue.

[0072] During the NPU's target detection process, the CPU can also obtain the number of first-type video frames in the target queue and, based on the number of first-type video frames in the target queue, obtain the queue length L of the target queue.

[0073] The queue length L can be calculated using a preset formula based on the number of first-type video frames. For example, the queue length L is equal to the number of first-type video frames in the target queue. It is used to indicate the number of first-type video frames currently waiting for NPU processing, that is, the first-type video frames that have not yet performed target detection. The first-type video frames that have completed target detection are removed from the target queue.

[0074] S502. If the queue length is greater than the first length threshold, the preset frame skipping ratio is increased to obtain the first frame skipping ratio.

[0075] If the queue length L is greater than the first length threshold, it indicates that too many first-type video frames have been accumulated and the NPU processing capacity is insufficient. In this case, the preset frame skipping ratio can be increased to obtain the first frame skipping ratio. For example, the preset frame skipping ratio of 1:2 can be adjusted to the first frame skipping ratio of 1:3.

[0076] In an optional implementation, the first length threshold is a high-water mark threshold H. Assuming the actual processing capacity of the NPU is V frames per second, the high-water mark threshold H = V. For example, V = 10, meaning 10 first-type video frames can be processed per second. If the queue length L is greater than H, the frame skipping ratio needs to be increased to reduce the enqueue rate of first-type video frames.

[0077] S503. Based on the first skip frame ratio, determine the first video frame from the first type of video frames in the target queue and adjust it to the second type of video frame.

[0078] Based on the first frame skipping ratio, a first video frame is determined from the first type of video frames in the target queue, and the first video frame is adjusted to a second type of video frame. The CPU performs target prediction so that the frame ratio of the first type of video frame and the second type of video frame satisfies the first frame skipping ratio.

[0079] In an alternative implementation, the method may further include: S504. If the queue length is less than the second length threshold, reduce the preset frame skipping ratio to obtain the second frame skipping ratio.

[0080] If the queue length L is less than the second length threshold, it means that the NPU has spare computing power. In this case, the preset frame skipping ratio can be reduced to obtain the second frame skipping ratio. For example, the preset frame skipping ratio of 1:3 can be adjusted to the second frame skipping ratio of 1:2.

[0081] In one optional implementation, the second length threshold is a low-water threshold L, where L = V / 1.5. If the queue length is less than the second length threshold, or if the queue length L is less than V / 1.5, the frame skipping ratio needs to be reduced to allow more video frames to enter the NPU for detection.

[0082] S505. Based on the second jump frame ratio, determine the second video frame from the second type of video frames to be predicted, and adjust it to the first type of video frame.

[0083] The second type of video frame to be predicted refers to the second type of video frame for which the target prediction has not been performed by the CPU, that is, the second type of video frame for which the target prediction has not yet been performed.

[0084] Based on the second frame skipping ratio, a second video frame is determined from the second type of video frames to be predicted, and the second video frame is adjusted to a first type of video frame. Target detection is then performed by the NPU so that the frame ratio of the first type of video frame to the second type of video frame satisfies the second frame skipping ratio.

[0085] In an optional implementation, when the queue length is between the second length threshold (V / 1.5) and the first length threshold (V), it indicates that the NPU load is relatively balanced, and the preset frame skipping ratio remains unchanged.

[0086] In an alternative implementation, the method may further include: Obtain the average detection speed of the NPU for the detected first-type video frames; Based on the average detection speed, obtain the first length threshold and the second length threshold.

[0087] During the process of NPU performing target detection on the first type of video frames, the average detection speed of NPU on the detected first type of video frames can also be obtained. The average detection speed is used to indicate the number of first type of video frames that NPU completes to detect per unit time.

[0088] The average detection speed can be considered as the actual processing capacity of the NPU.

[0089] After dynamically acquiring the average detection speed of the NPU at regular intervals, the first length threshold and the second length threshold are dynamically updated. Alternatively, the average detection speed of the NPU can be acquired in real time to update the first length threshold and the second length threshold in real time. The first length threshold is equal to the average detection speed V, and the second length threshold is V / 1.5.

[0090] It should be noted that the adjustment step size of the frame skipping ratio is 5%-10% of the current frame skipping ratio to avoid drastic fluctuations, and the frame skipping ratio can be limited to a reasonable range (such as a minimum of 1:1 and a maximum of 1:5).

[0091] In this embodiment, the frame skipping ratio is dynamically adjusted according to the length of the NPU detection queue to achieve adaptive frame skipping ratio and keep the NPU load balanced.

[0092] Figure 6 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 6 ,like Figure 6 As shown, in an optional implementation, the method may further include: S601. Every preset time interval, the NPU performs target detection on the second type of video frame to be predicted, and obtains the detection result of the target object in the second type of video frame to be predicted.

[0093] Every so often, a second type of video frame for which the CPU needs to perform target prediction is determined as the second type of video frame to be predicted. Then, the NPU performs target detection on the second type of video frame to be predicted to obtain the detection result of the target object in the second type of video frame to be predicted. Finally, the CPU performs target prediction on the second type of video frame to be predicted to obtain the prediction result of the target object in the second type of video frame to be predicted.

[0094] The CPU obtains the motion features of the target object based on the first type of video frame preceding the second type of video frame to be predicted, and performs target prediction on the second type of video frame to be predicted based on the motion features of the target object, thus obtaining the target prediction result of the target object in the second type of video frame to be predicted.

[0095] S603. Based on the detection results of the target object in the second type of video frame to be predicted and the target prediction results of the target object in the second type of video frame to be predicted, obtain the prediction confidence.

[0096] The detection result is used to indicate the position of the target object in the second type of video frame to be predicted, and the prediction result is used to indicate the position of the target object in the second type of video frame to be predicted.

[0097] The overlap is calculated based on the Euclidean distance between the detection results and the prediction results, and this overlap is used as the prediction confidence, whereby the prediction confidence is used to indicate the similarity between the CPU's prediction results and the NPU's detection results.

[0098] S604. If the prediction confidence is less than the preset confidence threshold, then the next second type video frame of the second type video frame to be predicted will be adjusted to a first type video frame.

[0099] If the prediction confidence is less than the preset confidence threshold, it means that the prediction accuracy of the CPU is insufficient. In order to improve the accuracy of target detection, the next second type video frame of the second type video frame to be predicted in multiple second type video frames is adjusted to a first type video frame.

[0100] In this embodiment, a non-critical frame is detected by both the NPU detection pipeline and the CPU detection pipeline at regular intervals. The overlap between the two results is used as the confidence level. If the confidence level is low, the next non-critical frame is forced to be detected as a critical frame to ensure system robustness.

[0101] Figure 7 A flowchart illustrating the target detection method provided in the embodiments of this application. Figure 7 ,like Figure 7 As shown in an optional embodiment, step S105, which generates a target video frame of the target object based on the target detection result and the target prediction result, may include: S701. Based on the target detection results, the first type of video frames are fused to obtain the detection fused video frames.

[0102] The first type of video frames are fused based on the detection boxes in the target detection results to obtain the detection fused video frames, which are the first type of video frames marked with detection boxes.

[0103] S702. Based on the target prediction results, the second type of video frames are fused to obtain the predicted fused video frames.

[0104] The second type of video frames are fused based on the prediction boxes in the target prediction results to obtain the predicted fused video frames, which are the second type of video frames marked with prediction boxes.

[0105] S703, Use the detected fused video frame and the predicted fused video frame as the target video frame.

[0106] The target video frame includes a detected fused video frame and a predicted fused video frame. The target video frame can also be output at the frame rate of the original video frame.

[0107] In summary, this solution achieves a perfect balance between high efficiency and low latency in target detection through NPU and CPU detection pipelines. Keyframe detection accuracy is on par with the original algorithm, prediction frame accuracy loss is minimal, and detection box movement is smooth and natural, without jumps or flickering. The dual-pipeline design enables parallel operation of the CPU and NPU, maximizing hardware resource utilization. It automatically adapts to various scenarios, from static monitoring to high-speed motion, achieving both adaptability and robustness. It is compatible with domestic operating systems and software ecosystems, achieving deep optimization for domestic platforms and possessing a certain degree of scalability. Dynamically adjusting the frame skipping ratio, prediction model, and resource allocation maximizes system efficiency and energy efficiency under limited NPU computing power.

[0108] Figure 8 A schematic diagram of the overall system architecture provided in the embodiments of this application, as shown below. Figure 8 As shown, it includes: Video input module 801: Used to connect to a camera and capture raw video frames at a first frame rate (e.g., 30fps).

[0109] Frame buffer queue 802: Used to buffer captured video frames to solve the problem of mismatch between production and consumption rates.

[0110] Frame scheduler 803: Decision module, used to divide video frames into two categories: key frames and non-key frames.

[0111] NPU Detection Pipeline 804: Performs full object detection on keyframes using the NPU.

[0112] Motion prediction pipeline 805: also known as the CPU detection pipeline, which uses the CPU to perform motion prediction on non-key frames based on historical results.

[0113] Detection result cache 806: Used to store the latest detection and prediction results for the display module to read.

[0114] Display synthesis module 807: used to fuse detection / prediction boxes with the original video frames.

[0115] Video output module 808: Used to output video frames with detection results at the first frame rate.

[0116] System controller 809: Used to monitor queue length and motion characteristics of target objects, dynamically adjust frame type, and select motion prediction modules.

[0117] In this embodiment, the asynchronous processing flow includes: the video acquisition thread running independently, continuously acquiring video frames at 30fps and storing them in the frame buffer queue; the frame scheduler retrieving frames from the frame buffer queue and determining whether the frame type is a first type video frame or a second type video frame based on the system load and the motion regularity of the target; dual pipeline parallel processing to perform target detection and target prediction; and the result display thread reading the latest results from the result cache, synthesizing them into corresponding frames, and displaying them.

[0118] Figure 9 This is a schematic diagram of the target detection device provided in the embodiments of this application. The device can be integrated into the CPU in the processor, which also includes an NPU. The CPU and the NPU are communicatively connected.

[0119] like Figure 9 As shown, the device may include: The determination module 901 is used to determine multiple first-type video frames and multiple second-type video frames from the original video frames of the target object according to a preset frame skipping ratio. Processing module 902 is used to perform target detection on the first type of video frame through NPU to obtain the target detection result of the target object in the first type of video frame; The acquisition module 903 is used to acquire the motion features of the target object based on the first type of video frames preceding the second type of video frames; The processing module 902 is also used to perform target prediction on the second type of video frame based on the motion characteristics of the target object, and obtain the target prediction result of the target object in the second type of video frame; The generation module 904 is used to generate target video frames of the target object based on the target detection results and target prediction results.

[0120] In an optional implementation, the determining module 901 is specifically used for: Determine multiple target video frames prior to the second type of video frames from multiple first type video frames; If the number of multiple target video frames is greater than or equal to a preset threshold, the motion characteristics of the target object are obtained based on the motion trajectory of the target object in the multiple target video frames.

[0121] In an optional implementation, it further includes: The adjustment module 905 is used to adjust the second type of video frame to the first type of video frame if the number of multiple target video frames is less than a preset number threshold.

[0122] In an optional implementation, the processing module 902 is specifically used for: If the motion characteristics of the target object satisfy the first parameter condition, then based on the motion characteristics of the target object, the target prediction is performed on the second type of video frame through a linear prediction model to obtain the target prediction result; If the motion characteristics of the target object satisfy the second parameter condition, then based on the motion characteristics of the target object, the second type of video frame is predicted using a quadratic polynomial prediction model to obtain the target prediction result.

[0123] In an optional implementation, the adjustment module 905 is further configured to: If the motion characteristics of the target object do not meet the first parameter condition and the second parameter condition, then the second type of video frame will be adjusted to the first type of video frame.

[0124] In an optional implementation, the processing module 902 is further configured to: Place multiple Type 1 video frames into the target queue; The processing module 902 is specifically used to retrieve a first type of video frame from the target queue through the NPU, and to perform target detection on the retrieved first type of video frame to obtain the target detection result.

[0125] In an alternative manner, module 903 is also used for: The queue length of the target queue is obtained based on the number of first-type video frames in the target queue. The processing module 902 is also used to increase the preset frame skipping ratio if the queue length is greater than the first length threshold, so as to obtain the first frame skipping ratio; The adjustment module 905 is also used to determine the first video frame from the first type of video frames in the target queue according to the first frame skipping ratio, and adjust it to the second type of video frame.

[0126] In an alternative manner, module 903 is also used for: The processing module 902 is also used to reduce the preset frame skipping ratio to obtain the second frame skipping ratio if the queue length is less than the second length threshold. The adjustment module 905 is also used to determine the second video frame from the second type of video frames to be predicted according to the second skipping ratio, and adjust it to the first type of video frame.

[0127] In an alternative manner, module 903 is also used for: Obtain the average detection speed of the NPU for the detected first-type video frames; Based on the average detection speed, obtain the first length threshold and the second length threshold.

[0128] In an alternative embodiment, the processing module 902 is further configured to: Every preset time interval, the NPU performs target detection on the second type of video frame to be predicted, and obtains the detection result of the target object in the second type of video frame to be predicted. Based on the detection results of the target object in the second type of video frame to be predicted and the target prediction results of the target object in the second type of video frame to be predicted, the prediction confidence is obtained; The adjustment module is also used to adjust the next second-type video frame of the second-type video frame to be predicted to a first-type video frame if the prediction confidence is less than a preset confidence threshold.

[0129] In an optional implementation, the generation module 904 is specifically used for: Based on the target detection results, the first type of video frames are fused to obtain the detection fused video frames; Based on the target prediction results, the second type of video frames are fused to obtain the predicted fused video frames; The detected fused video frame and the predicted fused video frame are used as the target video frame.

[0130] It should be noted that the determination module 901 can be implemented through the frame scheduler mentioned above, the processing module 902 can be implemented through the NPU detection pipeline and motion prediction pipeline mentioned above, the acquisition module 903 and the adjustment module 905 can be implemented through the system controller mentioned above, and the generation module 904 can be implemented through the display compositing module and the video output module mentioned above.

[0131] The processing flow of each module in the device and the interaction flow between each module can be referred to the relevant descriptions in the above method embodiments, and will not be detailed here.

[0132] Figure 10 This is a schematic diagram of the processor structure provided in the embodiments of this application, such as... Figure 10 As shown, the device may include: CPU 1001, NPU 1002 and storage unit 1003. CPU 1001 is communicatively connected to NPU 1002. CPU 1001 and NPU 1002 are also communicatively connected to storage unit 1003. Storage unit 1003 stores machine-readable instructions that can be executed by CPU 1001.

[0133] When the processor is running, CPU1001 executes machine-readable instructions to perform the above method.

[0134] This application also provides a computer-readable storage medium storing a computer program, which is executed by a processor to perform the above-described method.

[0135] In this embodiment, the computer program, when run by the processor, can also execute other machine-readable instructions to perform other methods as described in the embodiments. For details on the specific execution steps and principles, please refer to the description of the embodiments, which will not be repeated here.

[0136] In the embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Additionally, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some communication interfaces; indirect couplings or communication connections between devices or units may be electrical, mechanical, or other forms.

[0137] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0138] In addition, the functional units in the embodiments provided in this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0139] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0140] It should be noted that similar labels and letters in the following figures indicate similar items. Therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. In addition, the terms "first", "second", "third", etc. are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0141] Finally, it should be noted that the above-described embodiments are merely specific implementations of this application, used to illustrate the technical solutions of this application, and not to limit them. The protection scope of this application is not limited thereto. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the scope of the technology disclosed in this application; and these modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application. All should be covered within the protection scope of this application. Therefore, the protection scope of this application should be determined by the protection scope of the claims.

Claims

1. A target detection method characterized by, A central processing unit (CPU) is used in a processor, the processor further including a neural network dedicated processing unit (NPU), the CPU being communicatively connected to the NPU, the method comprising: Based on the preset frame skipping ratio, multiple first-type video frames and multiple second-type video frames are determined from the original video frames of the target object. The NPU is used to perform target detection on the first type of video frame to obtain the target detection result of the target object in the first type of video frame; Based on the first type of video frames preceding the second type of video frames, the motion features of the target object are obtained; Based on the motion characteristics of the target object, target prediction is performed on the second type of video frame to obtain the target prediction result of the target object in the second type of video frame; Based on the target detection results and the target prediction results, a target video frame of the target object is generated.

2. The method of claim 1, wherein, The step of obtaining the motion features of the target object based on the first type of video frames preceding the second type of video frames includes: Determine multiple target video frames preceding the second type of video frames from a plurality of first type of video frames; If the number of multiple target video frames is greater than or equal to a preset number threshold, the motion characteristics of the target object are obtained based on the motion trajectory of the target object in the multiple target video frames.

3. The method of claim 2, wherein, The method further includes: If the number of multiple target video frames is less than the preset number threshold, then the second type of video frame is adjusted to the first type of video frame.

4. The method of claim 1, wherein, The step of predicting the target object in the second type of video frame based on the motion characteristics of the target object, to obtain the target prediction result of the target object in the second type of video frame, includes: If the motion characteristics of the target object satisfy the first parameter condition, then based on the motion characteristics of the target object, a linear prediction model is used to predict the target in the second type of video frame to obtain the target prediction result. If the motion characteristics of the target object satisfy the second parameter condition, then based on the motion characteristics of the target object, a quadratic polynomial prediction model is used to predict the target in the second type of video frame, and the target prediction result is obtained.

5. The method of claim 4, wherein, The method further includes: If the motion characteristics of the target object do not meet the first parameter condition and the second parameter condition, then the second type of video frame is adjusted to the first type of video frame.

6. The method of claim 1, wherein, The method further includes: Place multiple video frames of the first type into the target queue; The step of performing target detection on the first type of video frames using the NPU to obtain the target detection result of the target object in the first type of video frames includes: The NPU retrieves the first type of video frame from the target queue and performs target detection on the retrieved first type of video frame to obtain the target detection result.

7. The method of claim 6, wherein, The method further includes: The queue length of the target queue is obtained based on the number of the first type of video frames in the target queue. If the queue length is greater than the first length threshold, the preset frame skipping ratio is increased to obtain the first frame skipping ratio; Based on the first frame skipping ratio, a first video frame is determined from the first type of video frames in the target queue and adjusted to the second type of video frame.

8. The method of claim 7, wherein, The method further includes: If the queue length is less than the second length threshold, then the preset frame skipping ratio is reduced to obtain the second frame skipping ratio; Based on the second frame skipping ratio, a second video frame is determined from the second type of video frames to be predicted and adjusted to the first type of video frame.

9. The method of claim 8, wherein, The method further includes: Obtain the average detection speed of the NPU for the detected first type of video frames; Based on the average detection speed, the first length threshold and the second length threshold are obtained.

10. The method of claim 1, wherein, The method further includes: At preset intervals, the NPU performs target detection on the second type of video frame to be predicted, and obtains the detection result of the target object in the second type of video frame to be predicted. Based on the detection result of the target object in the second type of video frame to be predicted and the target prediction result of the target object in the second type of video frame to be predicted, the prediction confidence is obtained; If the prediction confidence is less than a preset confidence threshold, then the next second type video frame of the second type video frame to be predicted will be adjusted to the first type video frame.

11. The method of claim 1, wherein, The step of generating a target video frame for the target object based on the target detection result and the target prediction result includes: Based on the target detection results, the first type of video frames are fused to obtain a detected fused video frame; Based on the target prediction result, the second type of video frames are fused to obtain the predicted fused video frames; The detected fused video frame and the predicted fused video frame are used as the target video frame.

12. A target detection device, characterized in that, A central processing unit (CPU) is used in a processor, the processor further including a neural network dedicated processing unit (NPU), the CPU being communicatively connected to the NPU, the device comprising: The determination module is used to determine multiple first-type video frames and multiple second-type video frames from the original video frames of the target object according to a preset frame skipping ratio. The processing module is used to perform target detection on the first type of video frame through the NPU to obtain the target detection result of the target object in the first type of video frame; The acquisition module is used to acquire the motion features of the target object based on the first type of video frames preceding the second type of video frames; The processing module is further configured to perform target prediction on the second type of video frame based on the motion characteristics of the target object, and obtain the target prediction result of the target object in the second type of video frame; The generation module is used to generate target video frames of the target object based on the target detection results and the target prediction results.

13. A processor, characterized in that, include: The system includes a central processing unit (CPU), a neural network dedicated processing unit (NPU), and a storage unit. The CPU is communicatively connected to the NPU, and the CPU, the NPU, and the storage unit are also communicatively connected to the storage unit. The storage unit stores machine-readable instructions that can be executed by the CPU. When the processor is running, the CPU executes the machine-readable instructions to perform the method according to any one of claims 1 to 11.

14. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, which is executed by the central processing unit (CPU) to perform the method described in any one of claims 1 to 11.