Video stream processing method, video stream processing apparatus, electronic device, and program product
By acquiring keyframes in an embedded device and processing them using interpolation and transparency, the problem of mismatch between algorithm processing capabilities and frame rate is solved, enabling the generation of more video frames containing target boxes under limited resources, thus improving the real-time performance and stability of the system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN STREAMING VIDEO TECH
- Filing Date
- 2026-02-02
- Publication Date
- 2026-06-23
AI Technical Summary
In embedded devices, the algorithm's processing power is not matched with the frame rate required for encoding, resulting in limited computing resources and an inability to generate enough video frames containing the target bounding boxes, leading to system lag or crashes.
By acquiring keyframes from the original video stream, object detection is performed and target bounding boxes are drawn. The attribute information of keyframe pairs is used to infer the target bounding box attributes of non-keyframes, reducing repeated detection of non-keyframes. The complete video stream is generated by combining interpolation and transparency processing.
By generating more video frames containing target frames with limited resources, we can reduce computational resource consumption, improve the real-time performance and stability of the video stream, and avoid system overload.
Smart Images

Figure CN122265902A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of computer vision technology, and in particular relates to a video stream processing method, a video stream processing device, an electronic device, and a program product. Background Technology
[0002] With the rapid development of artificial intelligence technology, deep learning-based intelligent video surveillance systems have been widely used in automotive, security, and industrial fields. However, deploying intelligent video algorithms in embedded devices faces a contradiction between limited computing resources and real-time requirements. This technical challenge severely restricts the popularization and application effectiveness of intelligent surveillance systems.
[0003] In current embedded devices (such as in-vehicle smart terminals and smart electronic rearview mirrors), there is a common problem of mismatch between algorithm processing capabilities and the frame rate required for encoding. For example, for complex object detection algorithms (such as pedestrian and vehicle detection), due to computational resource limitations, a processing frame rate of only 15-18 fps is typically achievable, while high-quality video recording requires a frame rate of 25-30 fps. If object detection algorithms are forced to perform object detection on more video frames, it will far exceed the device's load capacity, leading to system stuttering, overheating, or even crashes. Therefore, how to generate more video frames containing object bounding boxes under limited resources is a pressing technical problem that needs to be solved. Summary of the Invention
[0004] This application provides a video stream processing method, a video stream processing device, an electronic device, and a program product, which can generate a large number of video frames containing target frames with limited resources.
[0005] In a first aspect, embodiments of this application provide a video stream processing method, including:
[0006] Acquire multiple keyframes from the raw video stream; The multiple keyframes are input into the algorithm processing module for target detection, and the first target bounding box of the detected target is drawn on the corresponding keyframe. For at least one keyframe pair among the plurality of keyframes, if the same target exists in any keyframe pair, then based on the attribute information of the first target bounding box of the same target detected in the keyframe pair, the attribute information of the second target bounding box of the same target in each non-keyframe is determined; the keyframe pair includes two keyframes, and each non-keyframe refers to the non-keyframe located between the keyframe pairs in the original video stream, and the non-keyframes between different keyframe pairs are different. For any of the non-key frames, based on the attribute information of the second target bounding box of the same target in the non-key frame, the second target bounding box corresponding to the same target is drawn on the non-key frame.
[0007] In this embodiment, by acquiring multiple keyframes from the original video stream, only the multiple keyframes are input into the algorithm processing module for target detection and drawing of the first target bounding box. For at least one keyframe pair among the multiple keyframes, if the same target exists in any keyframe pair, the attribute information of the second target bounding box of the same target in each non-keyframe can be determined based on the attribute information of the first target bounding box of the same target detected in this keyframe pair. Based on the attribute information, the second target bounding box is drawn on the corresponding non-keyframe, without having to input each non-keyframe into the algorithm processing module for target detection. This significantly reduces the consumption of computing resources and enables the generation of more video frames containing target bounding boxes with limited resources.
[0008] In some embodiments of the first aspect, before determining the attribute information of the second target bounding box of the same target in each non-keyframe based on the attribute information of the first target bounding box of the same target detected in the keyframe pair, the method further includes: For any one of the non-key frames, the interpolation ratio corresponding to the non-key frame is calculated based on the timestamp of the non-key frame and the timestamp of the key frame pair. The step of determining the attribute information of the second target bounding box of the same target in each non-keyframe based on the attribute information of the first target bounding box of the same target detected in the keyframe pair includes: For any one of the non-key frames, based on the interpolation ratio corresponding to the non-key frame, the attribute information of the first target box of the same target detected in the key frame pair is interpolated to obtain the interpolated attribute information. Based on the interpolated attribute information, the attribute information of the second target bounding box of the same target in the non-keyframe is determined.
[0009] In some embodiments of this application, the attribute information includes location information; before determining the attribute information of the second target bounding box of the same target in the non-keyframe based on the interpolated attribute information, the method further includes: Based on the position information of the first target box of the same target detected in the keyframe pair and the timestamp of the keyframe pair, the average motion velocity of the same target between the keyframe pairs is calculated. Based on the average motion velocity between the keyframe pairs and the average motion velocity of the same target between the previous keyframe and its preceding historical keyframes in the keyframe pair, the acceleration of the same target is calculated, where the previous keyframe is the keyframe with the earliest timestamp in the keyframe pair. Based on the position information of the first target bounding box of the same target detected in the previous keyframe, the average motion velocity of the same target between the keyframe pairs, the acceleration of the same target, and a first time interval, the predicted position information of the same target in the non-keyframe is calculated; the first time interval refers to the time interval between the timestamp of the non-keyframe and the timestamp of the previous keyframe. The step of determining the attribute information of the second target bounding box of the same target in the non-keyframe based on the interpolated attribute information includes: The interpolated position information and the predicted position information are fused to obtain the fused position information; The fused location information is determined as the location information of the second target box of the same target in the non-key frame.
[0010] In some embodiments of this application, after the plurality of keyframe input algorithm processing modules perform target detection, the method further includes: For a first target detected in the next keyframe of the keyframe pair but not detected in the previous keyframe, the attribute information of the first target bounding box of the first target detected in the next keyframe is determined as the attribute information of the second target bounding box of the first target in each non-keyframe. The next keyframe is the keyframe with the latest timestamp in the keyframe pair, and the previous keyframe is the keyframe with the earliest timestamp in the keyframe pair. For any one of the non-critical frames, calculate a first ratio of a first time interval to a preset fading time constant; the first time interval refers to the time interval between the timestamp of the non-critical frame and the timestamp of the previous critical frame. The minimum of the first ratio and the maximum transparency is determined as the transparency of the second target box of the first target in the non-keyframe; Based on the attribute information and transparency of the second target bounding box of the first target in the non-keyframe, the second target bounding box corresponding to the first target is drawn on the non-keyframe.
[0011] In some embodiments of this application, after the plurality of keyframe input algorithm processing modules perform target detection, the method further includes: For a second target detected in the previous keyframe but not detected in the next keyframe, the attribute information of the first target bounding box of the second target detected in the previous keyframe is determined as the attribute information of the second target bounding box of the second target in each non-keyframe. For any one of the non-critical frames, based on the second ratio of the first time interval to the second time interval; the second time interval refers to the time interval between the timestamp of the subsequent key frame and the timestamp of the preceding key frame; Subtract the second ratio from the maximum transparency to obtain the target difference; The maximum value between the minimum transparency and the target difference is determined as the transparency of the second target bounding box in the non-keyframe; Based on the attribute information and transparency of the second target bounding box in the non-keyframe, the second target bounding box corresponding to the second target is drawn on the non-keyframe.
[0012] In some embodiments of this application, the video stream processing method further includes: The keyframes of all drawn first target boxes and the non-keyframes of all drawn second target boxes are composited to obtain the target video stream; The target video stream is output to the video encoder.
[0013] In some embodiments of this application, obtaining multiple keyframes from the original video stream includes: The third time interval is obtained by dividing the frame rate of the original video stream by the current frame rate of the algorithm processing module and rounding the result down. The motion intensity of each video frame is calculated based on the pixel values of the pixels in each video frame of the original video stream. Obtain current system load usage metrics; Based on the current system load usage indicators, calculate the current system overall load value; For any of the video frames, a comprehensive evaluation score is calculated based on the third time interval, the motion intensity of the video frame, and the current system load value. If the overall evaluation score of the video frame is greater than the first score threshold, then the video frame is identified as the key frame.
[0014] In some embodiments of this application, calculating the comprehensive evaluation score of the video frame based on the third time interval, the motion intensity of the video frame, and the current system overall load value includes: Based on the sequence number of the video frame and the third time interval, calculate the time interval score of the video frame; Calculate the motion intensity score of the video frame based on the motion intensity of the video frame; Calculate the system load score based on the current overall system load value, upper load threshold, and lower load threshold. Based on the system load score, the time interval score of the video frame, and the motion intensity score, a comprehensive evaluation score for the video frame is calculated.
[0015] In some embodiments of this application, the video stream processing method further includes: Calculate the frame rate adjustment coefficient based on the current system overall load value, upper load threshold, and lower load threshold. Based on the frame rate adjustment coefficient, the current algorithm processing frame rate is adjusted to obtain the adjusted algorithm processing frame rate; If the adjusted algorithm processing frame rate is greater than the lowest algorithm processing frame rate but less than the highest algorithm processing frame rate, then the current algorithm processing frame rate is updated to the adjusted algorithm processing frame rate. If the adjusted algorithm processing frame rate is less than or equal to the minimum algorithm processing frame rate, then the current algorithm processing frame rate is updated to the minimum algorithm processing frame rate. If the adjusted algorithm processing frame rate is greater than or equal to the highest algorithm processing frame rate, then the current algorithm processing frame rate is updated to the highest algorithm processing frame rate.
[0016] In some embodiments of this application, after the plurality of keyframe input algorithm processing modules perform target detection, the method further includes: Calculate the feature similarity between the two targets in the keyframe pair and the cross-union ratio between the first target boxes that enclose the two targets in the keyframe pair; Based on the intersection-union ratio and the feature similarity, the comprehensive matching score of the two targets is calculated; If the overall matching score is greater than the second score threshold, then the two targets are determined to be the same target.
[0017] In some embodiments of this application, the keyframe pair includes two keyframes that are adjacent in the plurality of keyframes.
[0018] Secondly, embodiments of this application provide a video stream processing apparatus, including: The keyframe acquisition module is used to acquire multiple keyframes from the original video stream; The target detection module is used to input the multiple keyframes into the algorithm processing module for target detection, and draw the first target box of the detected target on the corresponding keyframe. An information determination module is used to determine the attribute information of a second target bounding box of the same target in each non-key frame for at least one key frame pair among the plurality of key frames if the same target exists in any key frame pair; the attribute information of a first target bounding box of the same target detected in the key frame pair is used to determine the attribute information of the second target bounding box of the same target in each non-key frame; the key frame pair includes two key frames, and each non-key frame refers to a non-key frame in the original video stream located between the key frame pairs, and the non-key frames between different key frame pairs are different; The target drawing module is used to draw the second target box corresponding to the same target on the non-key frame based on the attribute information of the second target box of the same target in the non-key frame for any one of the non-key frames.
[0019] Thirdly, embodiments of this application provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the electronic device implements the video stream processing method as described in any one of the first aspects above.
[0020] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when executed by a computer, implements the video stream processing method as described in any one of the first aspects above.
[0021] Fifthly, embodiments of this application provide a computer program product, including a computer program, which, when run, causes the video stream processing method as described in any one of the first aspects above to be executed.
[0022] It is understood that the beneficial effects of the second to fifth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here. Attached Figure Description
[0023] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0024] Figure 1 This is a flowchart illustrating a video stream processing method provided in an embodiment of this application; Figure 2 This is another schematic flowchart of the video stream processing method provided in the embodiments of this application; Figure 3This is a schematic diagram of the structure of the video stream processing device provided in the embodiments of this application; Figure 4 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation
[0025] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.
[0026] It should be understood that, when used in this application specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or a collection thereof.
[0027] It should also be understood that the term “and / or” as used in this application specification and the appended claims means any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.
[0028] Furthermore, in the description of this application and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0029] References to "one embodiment" or "some embodiments" in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized.
[0030] The video stream processing method provided in this application can be applied to electronic devices. These electronic devices include, but are not limited to, embedded devices with limited computing resources, such as in-vehicle smart terminals, smart electronic rearview mirrors, wearable devices, in-vehicle devices, augmented reality (AR) / virtual reality (VR) devices, and mobile terminals (such as mobile phones and tablets); they can also include devices with stronger computing capabilities, such as desktop computers, servers, laptops, ultra-mobile personal computers (UMPCs), netbooks, and personal digital assistants (PDAs). This application does not impose any restrictions on the specific type of electronic device.
[0031] To illustrate the technical solution of this application, specific embodiments are described below.
[0032] Please see Figure 1 , Figure 1 The illustration shows a flowchart of a video stream processing method provided in an embodiment of this application. This is provided as an example and not as a limitation. The method includes the following steps: Step 101: Obtain multiple keyframes from the original video stream.
[0033] The aforementioned original video stream can refer to a sequence of video frames that have not been processed by this application and whose video frames themselves do not contain any target boxes. A target box can refer to a bounding box used to identify a target (such as a pedestrian, vehicle, etc.).
[0034] The aforementioned original video stream may originate from a video stream captured in real time by a camera, or it may be a pre-stored video stream read from a storage medium. This application does not limit the specific source of the original video stream.
[0035] The aforementioned keyframes can refer to video frames selected from the original video stream that need to be input into the algorithm processing module.
[0036] The original video stream typically contains a large number of video frames, from which some video frames can be selected as keyframes.
[0037] The aforementioned algorithm processing module can refer to a software module, hardware unit, or a combination of software and hardware that runs the target detection algorithm.
[0038] Step 102: Input multiple keyframes into the algorithm processing module for target detection, and draw the first target bounding box of the detected target on the corresponding keyframe.
[0039] In this embodiment, multiple (i.e. at least two) keyframes are input into the algorithm processing module. The algorithm processing module can perform target detection on each keyframe. If a target is detected in the keyframe, the algorithm processing module outputs attribute information of the target box (i.e., the first target box) to identify the target. Then, the electronic device can draw the first target box on the keyframe based on the attribute information of the first target box.
[0040] It should be noted that a keyframe may contain one or more targets. When there are multiple targets in a keyframe, the first bounding box of each target is drawn on the corresponding keyframe, and target matching is performed between adjacent keyframes.
[0041] Step 103: For at least one keyframe pair among multiple keyframes, if the same target exists in any keyframe pair, then based on the attribute information of the first target bounding box of the same target detected in the keyframe pair, determine the attribute information of the second target bounding box of the same target in each non-keyframe.
[0042] The aforementioned keyframe pair includes two keyframes. The aforementioned non-keyframes can refer to non-keyframes located between keyframe pairs in the original video stream. The aforementioned "same target" can refer to a target that exists simultaneously in both keyframe pairs. For example, if pedestrian A is detected in both keyframe pairs, then pedestrian A is the same target.
[0043] In some embodiments, the non-keyframes between different keyframe pairs are different (e.g., completely different) to avoid drawing the target box multiple times for the same non-keyframe.
[0044] In some embodiments, the two keyframes included in a keyframe pair may be adjacent in a plurality of keyframes (i.e., every two adjacent keyframes in a plurality of keyframes are determined as a keyframe pair), or they may not be adjacent. This application does not limit this.
[0045] As an example, and not a limitation, 15 keyframes are obtained from the original video stream. These 15 keyframes are named Keyframe 1, Keyframe 2, Keyframe 3, ..., Keyframe 15, in ascending order of their timestamps. Each pair of adjacent keyframes is defined as a keyframe pair. Therefore, Keyframe 1 and Keyframe 2 form one keyframe pair, Keyframe 2 and Keyframe 3 form another, ..., Keyframe 14 and Keyframe 15 form another, resulting in 14 keyframe pairs. It should be noted that there may be several non-keyframes between any two keyframes in any pair in the original video stream. Steps 103 and 104 iterate through these 14 keyframe pairs to draw the second bounding boxes for each non-keyframe between these 14 keyframe pairs.
[0046] In this embodiment, when keyframe pairs are limited to adjacent keyframes, the target's motion state exhibits high continuity within a short period due to the close timestamps of these adjacent keyframes. This improves the accuracy of the attribute information of the second target bounding box in each non-keyframe. Allowing keyframe pairs to be formed from any two keyframes enhances the flexibility in keyframe pair selection.
[0047] In some embodiments, the attribute information of the first target bounding box includes, but is not limited to, the position information and size information of the first target bounding box. The position information of the first target bounding box is used to determine the position of the first target bounding box in the corresponding keyframe. The size information of the first target bounding box is used to determine the size of the first target bounding box in the corresponding keyframe. Optionally, the attribute information of the first target bounding box may also include the category, confidence level, etc. of the target enclosed or selected by the first target bounding box.
[0048] Step 104: For any non-key frame among all non-key frames, based on the attribute information of the second target box of the same target in the non-key frame, draw the second target box corresponding to the same target on the non-key frame.
[0049] In some embodiments, the attribute information of the second target bounding box includes, but is not limited to, the position information and size information of the second target bounding box. The position information of the second target bounding box is used to determine the position of the second target bounding box in the corresponding non-keyframe. The size information of the second target bounding box is used to determine the size of the second target bounding box in the corresponding non-keyframe. Optionally, the attribute information of the second target bounding box may also include the category, confidence level, etc. of the target enclosed or selected by the second target bounding box.
[0050] In this embodiment, based on the attribute information of the second target box of the same target in a non-key frame, a graphics drawing operation can be performed on the non-key frame, thereby realizing the drawing of the second target box on the non-key frame.
[0051] In this embodiment, by acquiring multiple keyframes from the original video stream, only the multiple keyframes are input into the algorithm processing module for target detection and drawing of the first target bounding box. For at least one keyframe pair among the multiple keyframes, if the same target exists in any keyframe pair, the attribute information of the second target bounding box of the same target in each non-keyframe can be determined based on the attribute information of the first target bounding box of the same target detected in this keyframe pair. Based on the attribute information, the second target bounding box is drawn on the corresponding non-keyframe, without having to input each non-keyframe into the algorithm processing module for target detection. This significantly reduces the consumption of computing resources and enables the generation of more video frames containing target bounding boxes with limited resources.
[0052] In some embodiments of this application, such as Figure 2 As shown, multiple keyframes in the video stream can be obtained through steps 201 to 206.
[0053] Step 201: Divide the frame rate of the original video stream by the current frame rate of the algorithm processing module and round down to obtain the third time interval.
[0054] The frame rate of the raw video stream can refer to the inherent frame rate of the raw video stream, which is also the frame rate that the system output video stream needs to reach. It is usually 25-30fps to ensure the smoothness of the video.
[0055] The current algorithm processing frame rate of the algorithm processing module refers to the frame rate at which the algorithm processing module can run the target detection algorithm stably in real time under the computing resource limitations of the electronic device. It is usually lower than the frame rate of the original video stream. For example, the current algorithm processing frame rate of the algorithm processing module is 15fps.
[0056] The aforementioned third time interval represents how many video frames the algorithm processing module performs target detection once.
[0057] The formula for calculating the third time interval mentioned above is as follows:
[0058] in, Indicates the third time interval. Indicates the frame rate of the original video stream. This indicates the current frame rate processed by the algorithm processing module. This represents the floor function.
[0059] Step 202: Calculate the motion intensity of each video frame based on the pixel values of each video frame in the original video stream.
[0060] Among them, the motion intensity of a video frame represents the intensity of motion in the image acquisition scene corresponding to that video frame.
[0061] For any video frame, the motion intensity of that video frame can be calculated based on the pixel values of the pixels in that video frame and the pixel values of those pixels in the previous video frame. For example, for a video frame at time t, the motion intensity of the video frame at time t can be calculated based on the pixel values of the pixels in that video frame at time t and the pixel values of those pixels in the video frame at time t-1.
[0062] The formula for calculating the motion intensity of any video frame is as follows:
[0063] in, Indicates the motion intensity of a video frame. Indicates the first frame in the video. The pixel value of each pixel. This indicates the first video frame in the preceding video frame. The pixel value of each pixel. This indicates the total number of pixels in a video frame.
[0064] Step 203: Obtain the current system load usage metrics.
[0065] The aforementioned current system load usage metrics can refer to the current system load usage metrics of electronic devices, including but not limited to at least one of the following: Central Processing Unit (CPU) utilization, Graphics Processing Unit (GPU) utilization, Neural Processing Unit (NPU) utilization, memory utilization, and memory bandwidth utilization.
[0066] Step 204: Calculate the current system load value based on the current system load usage index.
[0067] When there is only one current system load usage indicator, the current system load usage indicator can be determined as the current system comprehensive load value. When there are at least two current system load usage indicators, the current system comprehensive load value can be obtained by weighted summation of the at least two current system load usage indicators based on their weight coefficients. The sum of the weight coefficients of the at least two current system load usage indicators can be 1.
[0068] Optionally, the range of the above-mentioned current system comprehensive load value can be [0, 1].
[0069] Given that the current system load metrics include CPU utilization, GPU utilization, NPU utilization, memory utilization, and memory bandwidth utilization, the formula for calculating the current overall system load is as follows:
[0070] in, This indicates the current overall system load value; Indicates CPU utilization; Indicates GPU utilization; Indicates NPU utilization; Indicates memory usage; Indicates memory bandwidth utilization; , , , , These represent the weighting coefficients for CPU utilization, GPU utilization, NPU utilization, memory utilization, and memory bandwidth utilization, respectively.
[0071] The CPU utilization rate mentioned above can be calculated by reading the system status file, and the calculation formula is as follows:
[0072] in, Indicates the CPU resources already used; This represents the total CPU resources.
[0073] The GPU utilization rate mentioned above can be obtained through the GPU status query interface, and its calculation formula is as follows:
[0074] in, Indicates the GPU resources already in use; This represents the total GPU resources.
[0075] The formula for calculating the NPU utilization rate is as follows:
[0076] in, Indicates the NPU resources that have been used; This represents the total NPU resources.
[0077] The above memory usage rate can be calculated using system memory information, and the calculation formula is as follows:
[0078] in, Indicates the amount of memory already used; This indicates the total memory capacity.
[0079] The formula for calculating the memory bandwidth utilization rate is as follows:
[0080] in, Indicates the memory bandwidth already used; This indicates the total memory bandwidth.
[0081] In some embodiments, the CPU utilization, GPU utilization, NPU utilization, memory utilization, and memory bandwidth utilization of electronic devices can be queried periodically to update the current overall system load value regularly.
[0082] Step 205: For any video frame in each video frame, calculate the comprehensive evaluation score of the video frame based on the third time interval, the motion intensity of the video frame, and the current comprehensive system load value.
[0083] The sequence number of a video frame can be an integer index assigned to each video frame in the original video stream in chronological order. For example, the sequence numbers of video frames start from 1 and increment sequentially.
[0084] In this embodiment, the video frame is comprehensively evaluated based on multiple factors such as the third time interval, the motion intensity of the video frame, and the current overall system load value. This provides an intelligent dynamic decision-making basis for the selection of key frames, thereby achieving adaptive load balancing of computing resources. It avoids overload or idleness according to the real-time status of the system, achieves scene-adaptive guarantee of detection accuracy, automatically increases detection density when the motion is complex, prevents target loss, and lays a high-quality data foundation for downstream interpolation processing. This ensures that the target bounding boxes in the final generated target video stream are continuous, accurate, and smooth, thereby fundamentally improving the system's processing efficiency, output quality, and overall robustness in resource-constrained environments.
[0085] Step 206: If the overall evaluation score of the video frame is greater than the first score threshold, then the video frame is identified as a key frame.
[0086] Optionally, a first score threshold can be set based on actual needs or experience. For example, the first score threshold could be 0.7.
[0087] In this embodiment, the computational load and processing accuracy can be flexibly balanced globally by using the first score threshold, which greatly enhances the system's adaptability to different application scenarios and hardware platforms. This ensures that only video frames with excellent comprehensive evaluation can consume expensive algorithm processing resources, thus optimizing computational efficiency from the source of decision-making.
[0088] In some embodiments, if the overall evaluation score of a video frame is less than or equal to a first score threshold, the video frame can be determined to be a non-key frame, and a target bounding box can be drawn on the non-key frame through a subsequent intelligent interpolation module.
[0089] In some embodiments of this application, the above-mentioned calculation of the comprehensive evaluation score of the video frame based on the third time interval, the motion intensity of the video frame, and the current overall system load value includes: Calculate the time interval score of the video frame based on the video frame number and the third time interval; Calculate the motion intensity score of the video frame based on the motion intensity of the video frame; Calculate the system load score based on the current overall system load value, upper load threshold, and lower load threshold; A comprehensive evaluation score for a video frame is calculated based on system load score, video frame time interval score, and motion intensity score.
[0090] In this embodiment, it can be determined whether the sequence number of the video frame is an integer multiple of the third time interval. If the sequence number of the video frame is an integer multiple of the third time interval, then 1 is determined as the time interval score of the video frame, indicating that target detection should be performed by the algorithm processing module. If the sequence number of the video frame is not an integer multiple of the third time interval, then the remainder after dividing the sequence number of the video frame by the third time interval can be calculated, the ratio of the remainder to the third time interval can be calculated, and the value obtained by subtracting the ratio from 1 is determined as the time interval score of the video frame. The time interval score decreases as the distance to the nearest keyframe increases.
[0091] The formula for calculating the time interval score of the above video frames is as follows:
[0092] in, Indicates a time interval score. Indicates the sequence number of the video frame. This indicates the modulo operation.
[0093] In this embodiment, the discrete positional relationship between video frames and the preset processing rhythm (i.e., the third time interval) can be mapped to a continuous [0, 1] interval score using the aforementioned time interval scoring formula. This avoids abrupt changes or periodic jitter in keyframe selection over time, ensuring that the reference frames relied upon by the subsequent linear interpolation algorithm are evenly distributed over time. This makes the final generated full-frame-rate video stream containing the target bounding box appear more stable and smooth, without jumps or flickering, greatly improving user experience and application reliability.
[0094] The formula for calculating the motion intensity score of the above video frames is as follows:
[0095] in, This represents the exercise intensity score, with a value range of [0, 1]. This represents the motion intensity threshold, which is typically set to 0.1 (normalized pixel difference value). This indicates the exercise intensity score.
[0096] When the motion intensity score is greater than the motion intensity threshold, the motion intensity score is 1, indicating that the scene being captured is moving violently and needs to be detected by the algorithm processing module. When the motion intensity score is less than or equal to the motion intensity threshold, the motion intensity score is reduced proportionally. This means that for almost static video frames, the motion intensity score is close to 0, which will significantly lower the overall evaluation score of the video frame, making it difficult for it to be selected as a key frame, thereby reducing the consumption of computing resources.
[0097] In this embodiment, the calculation formula for motion intensity score described above can automatically identify video frames that undergo significant changes in the image acquisition scene (such as target appearance, rapid movement, scene switching, etc.) and prioritize the allocation of computing resources to these video frames. This achieves precise tilting of computing resources towards high-value frames and greatly improves resource utilization efficiency.
[0098] In this embodiment, if the current system overall load value is greater than the upper load threshold, then 0.0 can be determined as the system load score; if the current system overall load value is less than the lower load threshold, then 1.0 can be determined as the system load score; if the current system overall load value is greater than or equal to the lower load threshold and less than or equal to the upper load threshold, then the first difference between the upper load threshold and the current system overall load value and the second difference between the upper load threshold and the lower load threshold can be calculated, and the ratio of the first difference to the second difference can be determined as the system load score.
[0099] The formula for calculating the system load score is as follows:
[0100] in, This represents the system load score, with a value range of [0, 1]. Indicates the load limit threshold; This indicates the lower load threshold. Optionally, the upper and lower load thresholds can be set based on actual needs or empirical values. For example, the upper load threshold could be 0.8, and the lower load threshold could be 0.6.
[0101] In this embodiment, when the current overall system load value is too high ( The system load score is 0, indicating that the algorithm processing frequency should not be increased and keyframes should be reduced to the minimum. At the current overall system load value ( A system load score of 1 indicates that the algorithm processing frequency and keyframes can be increased. When the current overall system load value is moderate, the system load score changes linearly between [0, 1], enabling dynamic and fine-grained scheduling of computing resources.
[0102] In some embodiments, the system load score, video frame time interval score, and motion intensity score can be weighted and summed based on their respective weighting coefficients to obtain a comprehensive evaluation score for the video frame. The sum of the weighting coefficients for these three scores is 1.
[0103] The formula for calculating the overall evaluation score of the above video frames is as follows:
[0104] in, This indicates the overall assessment score; Indicates a time interval score; Indicates the exercise intensity score; Indicates system load score; , , These represent the weighting coefficients for the time interval score, exercise intensity score, and system load score, respectively.
[0105] In some embodiments of this application, the current algorithm processing frame rate can also be dynamically adjusted based on system load, and the adjustment methods include: Calculate the frame rate adjustment coefficient based on the current system overall load value, upper load threshold, and lower load threshold. Based on the frame rate adjustment coefficient, the current algorithm's processing frame rate is adjusted to obtain the adjusted algorithm's processing frame rate; If the adjusted algorithm processing frame rate is greater than the lowest algorithm processing frame rate but less than the highest algorithm processing frame rate, then the current algorithm processing frame rate is updated to the adjusted algorithm processing frame rate. If the adjusted algorithm processing frame rate is less than or equal to the minimum algorithm processing frame rate, then the current algorithm processing frame rate is updated to the minimum algorithm processing frame rate. If the adjusted algorithm processing frame rate is greater than or equal to the highest algorithm processing frame rate, then the current algorithm processing frame rate is updated to the highest algorithm processing frame rate.
[0106] In this embodiment, if the current system overall load value is greater than the upper load threshold, then the third difference between the current system overall load value and the upper load threshold and the fourth difference between 1 and the upper load threshold are calculated; the ratio of the third difference to the fourth difference is calculated, and the ratio is multiplied by a preset adjustment coefficient to obtain a first value; the value obtained by subtracting the first value from 1 is determined as the frame rate adjustment coefficient; if the current system overall load value is less than the lower load threshold, then the fifth difference between the lower load threshold and the current system overall load value is calculated; the ratio of the fifth difference to the lower load threshold is calculated, and the ratio is multiplied by a preset adjustment coefficient to obtain a second value; the value obtained by adding the second value to 1 is determined as the frame rate adjustment coefficient; if the current system overall load value is greater than or equal to the lower load threshold and less than or equal to the upper load threshold, then 1 is determined as the frame rate adjustment coefficient.
[0107] The formula for calculating the frame rate adjustment coefficient is as follows:
[0108] in, This represents the frame rate adjustment factor; This represents the preset adjustment coefficient, used to control the adjustment range. Optionally, the preset adjustment coefficient can be set according to actual needs or empirical values.
[0109] In this embodiment, the frame rate adjustment coefficient can be multiplied by the current algorithm processing frame rate to obtain the adjusted algorithm processing frame rate.
[0110] The adjusted algorithm calculates the frame rate using the following formula:
[0111] in, This indicates the adjusted frame rate processed by the algorithm. This indicates the current frame rate processed by the algorithm.
[0112] In this embodiment, the current algorithm processing frame rate can be updated under the following frame rate constraints:
[0113] in, This indicates the final frame rate processed by the applied algorithm, i.e., the updated frame rate processed by the current algorithm. Indicates the minimum frame rate processed by the algorithm; This indicates the highest frame rate processed by the algorithm. Optionally, the minimum and maximum frame rates can be set based on actual needs or empirical values. For example, the minimum frame rate could be 12fps, and the maximum frame rate 20fps.
[0114] In this embodiment, the frame rate constraint described above can prevent the current algorithm from processing frame rates that are too low or too high, thereby improving the accuracy of subsequent linear interpolation algorithms.
[0115] After updating the current algorithm's frame rate, the third time interval can be updated based on the updated current algorithm's frame rate, thereby updating the selected keyframes.
[0116] The updated formula for calculating the third time interval is as follows:
[0117] in, This indicates the third time interval after the update.
[0118] In this embodiment, the above-mentioned dynamic adjustment strategy can adaptively adjust the algorithm processing frame rate and key frame interval (i.e., the third time interval) according to the real-time load, ensuring stable system operation.
[0119] In some embodiments of this application, for keyframe pairs containing the same target across multiple keyframes, the attribute information of the target bounding box (i.e., the second target bounding box) of the same target in non-keyframes located between these keyframe pairs can be calculated using a linear interpolation algorithm. Specifically, before determining the attribute information of the second target bounding box of the same target in each non-keyframe based on the attribute information of the first target bounding box of the same target detected in the keyframe pair, the method further includes: For any non-key frame among all non-key frames, calculate the interpolation ratio corresponding to the non-key frame based on the timestamp of the non-key frame and the timestamp of the key frame pair. Based on the attribute information of the first target bounding box of the same target detected in keyframe pairs, the attribute information of the second target bounding box of the same target in each non-keyframe is determined, including: For any non-key frame in each non-key frame, based on the interpolation ratio corresponding to the non-key frame, the attribute information of the first target box of the same target detected in the key frame pair is interpolated to obtain the interpolated attribute information. Based on the interpolated attribute information, determine the attribute information of the second target bounding box of the same target in non-keyframes.
[0120] The timestamp of a keyframe pair includes the timestamps of the two keyframes in the pair.
[0121] In this embodiment, the first ratio of the first time interval to the second time interval corresponding to the non-key frame can be determined as the interpolation ratio corresponding to the non-key frame. The first time interval refers to the time interval between the timestamp of the non-key frame and the timestamp of the previous key frame, and the second time interval refers to the time interval between the timestamp of the subsequent key frame and the timestamp of the previous key frame.
[0122] To facilitate differentiation, the two keyframes in a keyframe pair can be referred to as the preceding keyframe and the following keyframe, respectively, with the timestamp of the preceding keyframe being earlier than that of the following keyframe. That is, the preceding keyframe is the keyframe with the earliest timestamp in the keyframe pair, and the following keyframe is the keyframe with the latest timestamp in the keyframe pair.
[0123] The formula for calculating the interpolation ratio corresponding to the above non-keyframes is as follows:
[0124] in, Indicates the interpolation ratio corresponding to non-keyframes; Timestamps representing non-keyframes; Indicates the timestamp of the previous keyframe; This indicates the timestamp of the next keyframe.
[0125] In some embodiments, the interpolated attribute information includes interpolated position information and interpolated size information.
[0126] The formula for calculating the interpolated position information is as follows:
[0127]
[0128] in, This indicates the interpolated position information (e.g., the coordinates of the target center point after interpolation). This indicates the location information of the first bounding box of the same target detected in the previous keyframe (e.g., the coordinates of the target's center point in the previous keyframe). This indicates the location information of the first bounding box of the same target detected in the next keyframe (e.g., the coordinates of the target's center point in the next keyframe).
[0129] The formula for calculating the interpolated dimensions is as follows:
[0130]
[0131] in, , These represent the interpolated width and height, respectively. , These represent the width and height of the first bounding box of the same target detected in the previous keyframe, respectively; , These represent the height and width of the first bounding box of the same target detected in the subsequent keyframe.
[0132] In this embodiment, the attribute information of the first target box of the same target detected in the key frame pair is interpolated by the interpolation ratio to generate the target box. This ensures that the generated target box achieves a smooth pixel-level transition between consecutive frames. Moreover, the use of deterministic interpolation calculation with extremely low overhead replaces the extremely high overhead target detection algorithm for processing non-key frames, reducing the consumption of computing resources and making it possible to process high frame rate video in real time on resource-constrained embedded devices.
[0133] To improve interpolation accuracy, a motion prediction model is introduced in some embodiments of this application. Specifically, the attribute information includes position information; before determining the attribute information of the second target bounding box of the same target in a non-keyframe based on the interpolated attribute information, the method further includes: Based on the position information of the first target box of the same target detected in the keyframe pair and the timestamp of the keyframe pair, the average motion velocity of the same target between keyframe pairs is calculated. The acceleration of the same target is calculated based on the average motion velocity between keyframe pairs and the average motion velocity of the same target between the previous keyframe and its previous historical keyframes in the keyframe pair. Based on the position information of the first target bounding box of the same target detected in the previous keyframe, the average motion velocity of the same target between keyframe pairs, the acceleration of the same target, and the first time interval, the predicted position information of the same target in non-keyframes is calculated; the first time interval refers to the time interval between the timestamp of the non-keyframe and the timestamp of the previous keyframe. Based on the interpolated attribute information, determine the attribute information of the second target bounding box of the same target in non-keyframes, including: The interpolated location information and the predicted location information are fused to obtain the fused location information; The fused location information is used to determine the location information of the second target box of the same target in the non-key frame.
[0134] Among them, a historical keyframe can refer to a keyframe whose timestamp is earlier than the timestamp of the previous keyframe.
[0135] The average motion velocity of the same target between keyframe pairs (including...) and The formula for calculating the velocity in the direction of motion is as follows:
[0136]
[0137] in, This represents the average motion speed of the same target across keyframe pairs, measured in pixels per second.
[0138] The formula for calculating the acceleration of the same target mentioned above is as follows:
[0139]
[0140] in, , They respectively represent the same goal in and Acceleration in the direction of motion, measured in pixels per second²; This represents the average motion velocity of the same target between the previous keyframe and its preceding historical keyframes in a keyframe pair.
[0141] The formula for calculating the predicted location information is as follows:
[0142]
[0143] in, Indicates predicted location information; Indicates the first time interval.
[0144] In this embodiment, the interpolated location information and the predicted location information can be weighted and summed based on the fusion weight coefficient to obtain the fused location information.
[0145] The formula for calculating the fused location information is as follows:
[0146]
[0147] in, This indicates the merged location information; This represents the fusion weighting coefficient.
[0148] Optionally, the fusion weight coefficient can be set according to actual needs or empirical values. For example, a fusion weight coefficient of 0.7 indicates greater trust in the interpolated location information. When, a linear interpolation algorithm is used entirely; when In this case, a motion prediction model is used entirely. The above fusion strategy combines the advantages of both linear interpolation algorithms and motion prediction models, improving interpolation accuracy.
[0149] In some embodiments, the fusion weight coefficient can be dynamically adjusted according to the motion characteristics of the same target. For example, when the same target moves at a constant speed, the fusion weight coefficient is 0.7, which is biased towards the linear interpolation algorithm; when the same target accelerates, the fusion weight coefficient is 0.5, which balances the two methods; when the same target is detected to suddenly change its motion state, the fusion weight coefficient is 0.3, which is biased towards the motion prediction model.
[0150] In some embodiments of this application, a comprehensive matching strategy based on the intersection-over-union (IoU) ratio and feature similarity can be used to determine whether the targets in two adjacent keyframes are the same target. Specifically, after inputting multiple keyframes into the algorithm processing module for target detection, the algorithm further includes: Calculate the feature similarity between two targets in a keyframe pair and the cross-union ratio between the first bounding boxes that enclose the two targets in the keyframe pair; Based on the intersection-union ratio and feature similarity, the comprehensive matching score of the two targets is calculated; If the overall matching score is greater than the second score threshold, then the two targets are determined to be the same target.
[0151] The intersection-over-union ratio (IoU) measures the degree of overlap between the two first target boxes in a keyframe pair, with a value ranging from [0, 1]. The feature similarity measures the appearance similarity between the two targets in a keyframe pair, with a value ranging from [0, 1].
[0152] The formula for calculating the intersection-union ratio is as follows:
[0153] in, Indicates intersection, union, and ratio; This indicates the first bounding box in the previous keyframe (e.g., the bounding box used to surround target A in the previous keyframe). This indicates the first target bounding box in the next keyframe (e.g., the target bounding box used to surround target B in the next keyframe). This represents the intersection area between the first bounding box in the previous keyframe and the first bounding box in the next keyframe; This represents the union region of the first bounding box in the previous keyframe and the first bounding box in the next keyframe; This represents the area of the intersection region between the first bounding box in the previous keyframe and the first bounding box in the next keyframe; This represents the area of the first bounding box in the previous keyframe; This represents the area of the first bounding box in the next keyframe.
[0154] The formula for calculating the area of the intersection region between the first bounding box in the previous keyframe and the first bounding box in the next keyframe is as follows:
[0155] in, , This indicates the coordinates of the top-left corner of the first bounding box in the previous keyframe; , This indicates the width and height of the first bounding box in the previous keyframe; , This indicates the coordinates of the top-left corner of the first bounding box in the next keyframe. , This indicates the width and height of the first bounding box in the next keyframe.
[0156] The aforementioned feature similarity can be obtained by calculating the cosine similarity of the feature vectors, and the calculation formula is as follows:
[0157] in, Indicates feature similarity; This represents the feature vector of target A, which is enclosed by the first bounding box in the previous keyframe. This represents the eigenvector of the eigenvector. One element, This represents the magnitude of the eigenvector; This represents the feature vector of target B, which is enclosed by the first bounding box in the next keyframe. This represents the eigenvector of the eigenvector. One element, This represents the magnitude of the eigenvector; This indicates the dimension of the feature vector. Optionally, the feature vector can be extracted using a deep learning model.
[0158] In this embodiment, the cross-union ratio (CUNR) and feature similarity can be weighted and summed based on their respective weighting coefficients to obtain a comprehensive matching score for the two targets. Optionally, the weighting coefficients of CUNR and feature similarity can be set according to actual needs or empirical values, with the sum of their weighting coefficients being 1.
[0159] The formula for calculating the overall matching score is as follows:
[0160] in, This represents the overall matching score, with a value range of [0, 1]. , These represent the weight coefficients for the intersection-union ratio and feature similarity, respectively.
[0161] Optionally, a second score threshold can be set based on actual needs or empirical values. For example, the second score threshold could be 0.5.
[0162] In some embodiments, if the overall matching score is less than or equal to a second score threshold, the two targets can be determined to be different targets. Any of the different targets may be a new target (e.g., a target not detected in the previous keyframe but detected in the next keyframe), a vanished target (e.g., a target detected in the previous keyframe but not detected in the next keyframe), or the same target as other targets. Based on this, for each target in the previous keyframe, a comprehensive matching strategy based on intersection-union ratio and feature similarity can be used to traverse all targets in the next keyframe to detect whether there is a target in the next keyframe that matches a target in the previous keyframe. If so, the two matching targets are the same target; if there is no matching target in the next keyframe, the target in the previous keyframe is determined to be a vanished target; if there is a target in the next keyframe that does not match any of the targets in the previous keyframe, that target is determined to be a new target.
[0163] The above comprehensive matching strategy includes the following steps: Step 1: Initialize the matching result list, which includes a list of successfully matched targets, a list of unmatched targets in the previous frame, and a list of unmatched targets in the next frame; Step 2: For each target in the previous keyframe, calculate its overall matching score with all targets in the next keyframe. Step 3: Select the target pair with the highest matching score; Step 4: Check if the highest matching score is greater than the second score threshold; Step 5: If the score is greater than the second score threshold, add the target pair to the successful matching list. Step 6: Mark the matched targets to avoid duplicate matching; Step 7: Repeat steps 2 through 6 above until all possible matches have been processed.
[0164] In some embodiments of this application, a new target can be obtained from the list of unmatched targets in the following frame. For the new target (i.e., the first target), the second target bounding box can be drawn in the following manner: For a first target that is detected in the next keyframe of a keyframe pair (i.e., the next keyframe included in the keyframe pair) but not detected in the previous keyframe, the attribute information of the first target bounding box of the first target detected in the next keyframe is determined as the attribute information of the second target bounding box of the first target in each non-keyframe. For any non-key frame among all non-key frames, calculate the first ratio of the first time interval to the preset fading time constant; the first time interval refers to the time interval between the timestamp of the non-key frame and the timestamp of the previous key frame. The minimum value between the first ratio and the maximum transparency is determined as the transparency of the second target box of the first target in a non-keyframe; Based on the attribute information and transparency of the second target bounding box of the first target in the non-keyframe, the second target bounding box corresponding to the first target is drawn on the non-keyframe.
[0165] If a target is detected in the next keyframe but cannot be matched in the previous keyframe, it can be determined that the target is a newly appearing target. In this case, the target can be displayed in the non-keyframe between the previous keyframe (i.e. the keyframe before the first keyframe in which the first target is detected) and the next keyframe (i.e. the keyframe in which the first target is detected). This can ensure visual smoothness and improve video fluency.
[0166] The formula for calculating the transparency of the second target bounding box of the first target in a non-keyframe is as follows:
[0167] in, This represents the transparency of the second target bounding box in a non-keyframe, with a value range of [0, 1], and is used to control the display of the second target bounding box; Indicates maximum transparency; This indicates the preset fade-in time constant. Optionally, the preset fade-in time constant can be set according to actual needs or empirical values. For example, the preset fade-in time constant is 0.5 seconds.
[0168] In this embodiment, for the first target, the position and size information of the second target box in each non-key frame is consistent with the position and size information of the first target box of the first target detected in the next key frame. Moreover, starting from the previous key frame, the transparency of each non-key frame gradually increases, thereby realizing the gradual appearance processing of each non-key frame.
[0169] In some embodiments, a unique ID can be assigned to each new target and added to the target tracking list. Based on this unique ID, the target lifecycle can be tracked across keyframes, and a target history can be maintained to determine whether the target is temporarily occluded or has truly disappeared.
[0170] In some embodiments of this application, the disappeared target can be obtained from the list of unmatched targets in the previous frame. For the disappeared target (i.e., the second target), the second target bounding box can be drawn in the following manner: For a second target that is detected in the previous keyframe but not detected in the next keyframe, the attribute information of the first target bounding box of the second target detected in the previous keyframe is determined as the attribute information of the second target bounding box of the second target in each non-keyframe. For any non-key frame among all non-key frames, based on the second ratio of the first time interval to the second time interval; the second time interval refers to the time interval between the timestamp of the subsequent key frame and the timestamp of the previous key frame. Subtract the second ratio from the maximum transparency to obtain the target difference; The maximum value between the minimum transparency and the target difference is determined as the transparency of the second target's bounding box in the non-keyframe; Based on the attribute information and transparency of the second target bounding box in the non-keyframe, the second target bounding box corresponding to the second target is drawn on the non-keyframe.
[0171] If a target is detected in the previous keyframe but cannot be matched in the next keyframe, it can be determined that the target may have disappeared. In this case, the target can be displayed in the non-keyframe between the previous keyframe (i.e. the keyframe in which the second target was last detected) and the next keyframe (i.e. the keyframe in which the second target could not be matched for the first time) using a fade-out strategy to ensure visual smoothness and improve video fluency.
[0172] The formula for calculating the transparency of the second target bounding box in a non-keyframe is as follows:
[0173] in, This indicates the transparency of the second target bounding box in non-keyframes, with a value range of [0, 1], and is used to control the display of the second target bounding box; Indicates maximum transparency; This indicates the minimum transparency.
[0174] In this embodiment, for the second target, the position and size information of the second target bounding box in each non-key frame is consistent with the position and size information of the first target bounding box of the second target detected in the previous key frame, and the transparency of each non-key frame gradually decreases from the previous key frame, thereby realizing the fading processing of each non-key frame.
[0175] It should be noted that if no match can be found for the second target in P consecutive keyframes, it is determined that the second target has completely disappeared, and the second target bounding box will no longer be displayed in subsequent non-keyframes. P is an integer greater than 1, for example, P is 2.
[0176] In some embodiments, the number of times a second target is not matched can be detected. If it is the first time it is not matched, the second target is marked as suspected to have disappeared. If it is not matched in two consecutive keyframes, it is confirmed that the second target has disappeared and is removed from the target tracking list.
[0177] In some embodiments, when drawing the first and second target boxes, different box colors can be selected based on the target category (e.g., red is used for target boxes surrounding pedestrians, and green is used for target boxes surrounding vehicles). Category labels and confidence levels can be added above the first and second target boxes.
[0178] In some embodiments of this application, after drawing the second target bounding boxes of the same target, the new target, and the disappeared target between at least one keyframe pair on each non-keyframe, the electronic device may perform the following steps: The keyframes of all drawn first target boxes and the non-keyframes of all drawn second target boxes are composited to obtain the target video stream; Output the target video stream to the video encoder.
[0179] After the target video stream is output to the video encoder, the video encoder can encode and store the target video stream.
[0180] In this embodiment, all keyframes of the first target bounding box and all non-keyframes of the second target bounding box can be synthesized in order of timestamp from earliest to latest. This allows for the generation of a high frame rate video stream (i.e., target video stream) with continuous and accurate target bounding box information under limited computing resources. This enables the output video frame rate to meet the encoder's high frame rate requirements, such as 25-30fps, without changing the algorithm's processing power, ensuring that each frame of the output video contains accurate target bounding box information.
[0181] In this embodiment, since target bounding boxes are drawn for each video frame in the target video stream, outputting the target video stream to the video encoder eliminates the need for target detection and bounding box drawing at the display backend of video processing. This avoids the need for repeated processing development and maintenance at all display backends, reducing manpower and material consumption. The display backends mentioned here include, but are not limited to, web playback terminals, screen playback terminals, cloud platform playback terminals, and mobile playback terminals. Traditional target detection methods require the development of a set of annotation functions for each display backend based on its characteristics. However, this technical solution directly outputs the video file with completed target detection and annotation at the front end, i.e., on devices such as in-vehicle intelligent terminals and intelligent electronic rearview mirrors. Furthermore, it supports the display of the target video stream at a high frame rate (e.g., 30 frames per second) at the image display terminal (i.e., the display backend), solving the problem that existing algorithm recognition information (i.e., target bounding boxes generated by the algorithm processing module based on low frame rates (e.g., 15 frames per second) cannot be matched one-to-one with video frames. This decouples the algorithm processing module from video encoding, effectively solving the problem of real-time video processing in resource-constrained environments and ensuring the smoothness and continuity of video playback.
[0182] In some embodiments, a frame compositor can be used to composite all keyframes of the first target bounding box and all non-keyframes of the second target bounding box.
[0183] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
[0184] Corresponding to the video stream processing method described in the above embodiments, Figure 3 A schematic diagram of the structure of the video stream processing apparatus provided in the embodiments of this application is shown. For ease of explanation, only the parts related to the embodiments of this application are shown.
[0185] Reference Figure 3 The device includes: The keyframe acquisition module 301 is used to acquire multiple keyframes in the original video stream. The target detection module 302 is used to input the multiple keyframes into the algorithm processing module for target detection, and draw the first target box of the detected target on the corresponding keyframe; The information determination module 303 is used to determine the attribute information of the second target bounding box of the same target in each non-key frame for at least one key frame pair among the plurality of key frames if the same target exists in any key frame pair; the key frame pair includes two key frames, and each non-key frame refers to a non-key frame in the original video stream located between the key frame pairs, and the non-key frames between different key frame pairs are different. The target drawing module 304 is used to draw the second target box corresponding to the same target on the non-key frame based on the attribute information of the second target box of the same target in the non-key frame for any one of the non-key frames.
[0186] In some embodiments, the information determination module 303 is specifically used for: For any one of the non-key frames, the interpolation ratio corresponding to the non-key frame is calculated based on the timestamp of the non-key frame and the timestamp of the key frame pair. For any one of the non-key frames, based on the interpolation ratio corresponding to the non-key frame, the attribute information of the first target box of the same target detected in the key frame pair is interpolated to obtain the interpolated attribute information. Based on the interpolated attribute information, the attribute information of the second target bounding box of the same target in the non-keyframe is determined.
[0187] In some embodiments, the attribute information includes location information; the information determination module 303 is specifically used for: Based on the position information of the first target box of the same target detected in the keyframe pair and the timestamp of the keyframe pair, the average motion velocity of the same target between the keyframe pairs is calculated. Based on the average motion velocity between the keyframe pairs and the average motion velocity of the same target between the previous keyframe and its preceding historical keyframes in the keyframe pair, the acceleration of the same target is calculated, where the previous keyframe is the keyframe with the earliest timestamp in the keyframe pair. Based on the position information of the first target bounding box of the same target detected in the previous keyframe, the average motion velocity of the same target between the keyframe pairs, the acceleration of the same target, and a first time interval, the predicted position information of the same target in the non-keyframe is calculated; the first time interval refers to the time interval between the timestamp of the non-keyframe and the timestamp of the previous keyframe. The interpolated position information and the predicted position information are fused to obtain the fused position information; The fused location information is determined as the location information of the second target box of the same target in the non-key frame.
[0188] In some embodiments, the above apparatus further includes a first processing module; the first processing module is configured to: For a first target detected in the next keyframe of the keyframe pair but not detected in the previous keyframe, the attribute information of the first target bounding box of the first target detected in the next keyframe is determined as the attribute information of the second target bounding box of the first target in each non-keyframe. The next keyframe is the keyframe with the latest timestamp in the keyframe pair, and the previous keyframe is the keyframe with the earliest timestamp in the keyframe pair. For any one of the non-critical frames, calculate a first ratio of a first time interval to a preset fading time constant; the first time interval refers to the time interval between the timestamp of the non-critical frame and the timestamp of the previous critical frame. The minimum of the first ratio and the maximum transparency is determined as the transparency of the second target box of the first target in the non-keyframe; Based on the attribute information and transparency of the second target bounding box of the first target in the non-keyframe, the second target bounding box corresponding to the first target is drawn on the non-keyframe.
[0189] In some embodiments, the above-described apparatus further includes a second processing module; the second processing module is used to: For a second target detected in the previous keyframe but not detected in the next keyframe, the attribute information of the first target bounding box of the second target detected in the previous keyframe is determined as the attribute information of the second target bounding box of the second target in each non-keyframe. For any one of the non-critical frames, based on the second ratio of the first time interval to the second time interval; the second time interval refers to the time interval between the timestamp of the subsequent key frame and the timestamp of the preceding key frame; Subtract the second ratio from the maximum transparency to obtain the target difference; The maximum value between the minimum transparency and the target difference is determined as the transparency of the second target bounding box in the non-keyframe; Based on the attribute information and transparency of the second target bounding box in the non-keyframe, the second target bounding box corresponding to the second target is drawn on the non-keyframe.
[0190] In some embodiments of this application, the above-described apparatus further includes a third processing module, the third processing module being used for: The keyframes of all drawn first target boxes and the non-keyframes of all drawn second target boxes are composited to obtain the target video stream; The target video stream is output to the video encoder.
[0191] In some embodiments, the keyframe acquisition module 301 is specifically used for: The third time interval is obtained by dividing the frame rate of the original video stream by the current frame rate of the algorithm processing module and rounding the result down. The motion intensity of each video frame is calculated based on the pixel values of the pixels in each video frame of the original video stream. Obtain current system load usage metrics; Based on the current system load usage indicators, calculate the current system overall load value; For any of the video frames, a comprehensive evaluation score is calculated based on the third time interval, the motion intensity of the video frame, and the current system load value. If the overall evaluation score of the video frame is greater than the first score threshold, then the video frame is identified as the key frame.
[0192] In some embodiments, the keyframe acquisition module 301 is specifically used for: Based on the sequence number of the video frame and the third time interval, calculate the time interval score of the video frame; Calculate the motion intensity score of the video frame based on the motion intensity of the video frame; Calculate the system load score based on the current overall system load value, upper load threshold, and lower load threshold. Based on the system load score, the time interval score of the video frame, and the motion intensity score, a comprehensive evaluation score for the video frame is calculated.
[0193] In some embodiments, the above apparatus further includes a fourth processing module, which is specifically used for: Calculate the frame rate adjustment coefficient based on the current system overall load value, upper load threshold, and lower load threshold. Based on the frame rate adjustment coefficient, the current algorithm processing frame rate is adjusted to obtain the adjusted algorithm processing frame rate; If the adjusted algorithm processing frame rate is greater than the lowest algorithm processing frame rate but less than the highest algorithm processing frame rate, then the current algorithm processing frame rate is updated to the adjusted algorithm processing frame rate. If the adjusted algorithm processing frame rate is less than or equal to the minimum algorithm processing frame rate, then the current algorithm processing frame rate is updated to the minimum algorithm processing frame rate. If the adjusted algorithm processing frame rate is greater than or equal to the highest algorithm processing frame rate, then the current algorithm processing frame rate is updated to the highest algorithm processing frame rate.
[0194] In some embodiments, the above apparatus further includes a target matching module, the target matching module being used to: Calculate the feature similarity between the two targets in the keyframe pair and the cross-union ratio between the first target boxes that enclose the two targets in the keyframe pair; Based on the intersection-union ratio and the feature similarity, the comprehensive matching score of the two targets is calculated; If the overall matching score is greater than the second score threshold, then the two targets are determined to be the same target.
[0195] In some embodiments, the keyframe pair includes two keyframes that are adjacent among the plurality of keyframes.
[0196] It should be noted that the information interaction and execution process between the above-mentioned devices / units are based on the same concept as the method embodiments of this application. For details on their specific functions and technical effects, please refer to the method embodiments section, and they will not be repeated here.
[0197] Figure 4 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 4 As shown, the electronic device 4 of this embodiment includes: at least one processor 40 ( Figure 4 (Only one is shown in the diagram), memory 41, and computer program 42 stored in said memory 41 and executable on said at least one processor 40, which, when executed, implements the steps in any of the above method embodiments.
[0198] The electronic device may include, but is not limited to, a processor 40 and a memory 41. Those skilled in the art will understand that... Figure 4This is merely an example of electronic device 4 and does not constitute a limitation on electronic device 4. It may include more or fewer components than shown, or combine certain components, or different components. For example, it may also include input / output devices, network access devices, etc.
[0199] The processor 40 may be a Central Processing Unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.
[0200] In some embodiments, the memory 41 may be an internal storage unit of the electronic device 4, such as a hard disk or memory of the electronic device 4. In other embodiments, the memory 41 may be an external storage device of the electronic device 4, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the electronic device 4. Furthermore, the memory 41 may include both internal and external storage units of the electronic device 4. The memory 41 is used to store the operating system, applications, bootloader, data, and other programs, such as the program code of the computer program. The memory 41 can also be used to temporarily store data that has been output or will be output.
[0201] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0202] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include at least: any entity or device capable of carrying computer program code to a device / electronic device, a recording medium, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks.
[0203] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0204] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0205] In the embodiments provided in this application, it should be understood that the disclosed devices / electronic devices and methods can be implemented in other ways. For example, the device / electronic device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual couplings or direct couplings or communication connections may be through some interfaces; indirect couplings or communication connections between devices or units may be electrical, mechanical, or other forms.
[0206] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0207] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.
Claims
1. A video stream processing method, characterized in that, include: Acquire multiple keyframes from the raw video stream; The multiple keyframes are input into the algorithm processing module for target detection, and the first target bounding box of the detected target is drawn on the corresponding keyframe. For at least one keyframe pair among the plurality of keyframes, if the same target exists in any keyframe pair, then based on the attribute information of the first target bounding box of the same target detected in the keyframe pair, the attribute information of the second target bounding box of the same target in each non-keyframe is determined; the keyframe pair includes two keyframes, and each non-keyframe refers to the non-keyframe located between the keyframe pairs in the original video stream, and the non-keyframes between different keyframe pairs are different. For any of the non-key frames, based on the attribute information of the second target bounding box of the same target in the non-key frame, the second target bounding box corresponding to the same target is drawn on the non-key frame.
2. The video stream processing method according to claim 1, characterized in that, Before determining the attribute information of the second target bounding box of the same target in each non-keyframe based on the attribute information of the first target bounding box of the same target detected in the keyframe pair, the method further includes: For any one of the non-key frames, the interpolation ratio corresponding to the non-key frame is calculated based on the timestamp of the non-key frame and the timestamp of the key frame pair. The step of determining the attribute information of the second target bounding box of the same target in each non-keyframe based on the attribute information of the first target bounding box of the same target detected in the keyframe pair includes: For any one of the non-key frames, based on the interpolation ratio corresponding to the non-key frame, the attribute information of the first target box of the same target detected in the key frame pair is interpolated to obtain the interpolated attribute information. Based on the interpolated attribute information, the attribute information of the second target bounding box of the same target in the non-keyframe is determined.
3. The video stream processing method according to claim 2, characterized in that, The attribute information includes location information; before determining the attribute information of the second target bounding box of the same target in the non-keyframe based on the interpolated attribute information, the method further includes: Based on the position information of the first target box of the same target detected in the keyframe pair and the timestamp of the keyframe pair, the average motion velocity of the same target between the keyframe pairs is calculated. Based on the average motion velocity between the keyframe pairs and the average motion velocity of the same target between the previous keyframe and its preceding historical keyframes in the keyframe pair, the acceleration of the same target is calculated, where the previous keyframe is the keyframe with the earliest timestamp in the keyframe pair. Based on the position information of the first target bounding box of the same target detected in the previous keyframe, the average motion velocity of the same target between the keyframe pairs, the acceleration of the same target, and a first time interval, the predicted position information of the same target in the non-keyframe is calculated; the first time interval refers to the time interval between the timestamp of the non-keyframe and the timestamp of the previous keyframe. The step of determining the attribute information of the second target bounding box of the same target in the non-keyframe based on the interpolated attribute information includes: The interpolated position information and the predicted position information are fused to obtain the fused position information; The fused location information is determined as the location information of the second target box of the same target in the non-key frame.
4. The video stream processing method according to claim 1, characterized in that, After the target detection is performed by the input algorithm processing module of the multiple keyframes, the method further includes: For a first target detected in the next keyframe of the keyframe pair but not detected in the previous keyframe, the attribute information of the first target bounding box of the first target detected in the next keyframe is determined as the attribute information of the second target bounding box of the first target in each non-keyframe. The next keyframe is the keyframe with the latest timestamp in the keyframe pair, and the previous keyframe is the keyframe with the earliest timestamp in the keyframe pair. For any one of the non-critical frames, calculate a first ratio of a first time interval to a preset fading time constant; the first time interval refers to the time interval between the timestamp of the non-critical frame and the timestamp of the previous critical frame. The minimum of the first ratio and the maximum transparency is determined as the transparency of the second target box of the first target in the non-keyframe; Based on the attribute information and transparency of the second target bounding box of the first target in the non-keyframe, the second target bounding box corresponding to the first target is drawn on the non-keyframe.
5. The video stream processing method according to claim 4, characterized in that, After the target detection is performed by the input algorithm processing module of the multiple keyframes, the method further includes: For a second target detected in the previous keyframe but not detected in the next keyframe, the attribute information of the first target bounding box of the second target detected in the previous keyframe is determined as the attribute information of the second target bounding box of the second target in each non-keyframe. For any one of the non-critical frames, based on the second ratio of the first time interval to the second time interval; the second time interval refers to the time interval between the timestamp of the subsequent key frame and the timestamp of the preceding key frame; Subtract the second ratio from the maximum transparency to obtain the target difference; The maximum value between the minimum transparency and the target difference is determined as the transparency of the second target bounding box in the non-keyframe; Based on the attribute information and transparency of the second target bounding box in the non-keyframe, the second target bounding box corresponding to the second target is drawn on the non-keyframe.
6. The video stream processing method according to claim 5, characterized in that, The video stream processing method further includes: The keyframes of all drawn first target boxes and the non-keyframes of all drawn second target boxes are composited to obtain the target video stream; The target video stream is output to the video encoder.
7. The video stream processing method according to any one of claims 1 to 6, characterized in that, The process of obtaining multiple keyframes from the original video stream includes: The third time interval is obtained by dividing the frame rate of the original video stream by the current frame rate of the algorithm processing module and rounding the result down. The motion intensity of each video frame is calculated based on the pixel values of the pixels in each video frame of the original video stream. Obtain current system load usage metrics; Based on the current system load usage indicators, calculate the current system overall load value; For any of the video frames, a comprehensive evaluation score is calculated based on the third time interval, the motion intensity of the video frame, and the current system load value. If the overall evaluation score of the video frame is greater than the first score threshold, then the video frame is identified as the key frame.
8. The video stream processing method according to claim 7, characterized in that, The calculation of the comprehensive evaluation score of the video frame based on the third time interval, the motion intensity of the video frame, and the current system load value includes: Based on the sequence number of the video frame and the third time interval, calculate the time interval score of the video frame; Calculate the motion intensity score of the video frame based on the motion intensity of the video frame; Calculate the system load score based on the current overall system load value, upper load threshold, and lower load threshold. Based on the system load score, the time interval score of the video frame, and the motion intensity score, a comprehensive evaluation score for the video frame is calculated.
9. The video stream processing method according to claim 7, characterized in that, The video stream processing method further includes: Calculate the frame rate adjustment coefficient based on the current system overall load value, upper load threshold, and lower load threshold. Based on the frame rate adjustment coefficient, the current algorithm processing frame rate is adjusted to obtain the adjusted algorithm processing frame rate; If the adjusted algorithm processing frame rate is greater than the lowest algorithm processing frame rate but less than the highest algorithm processing frame rate, then the current algorithm processing frame rate is updated to the adjusted algorithm processing frame rate. If the adjusted algorithm processing frame rate is less than or equal to the minimum algorithm processing frame rate, then the current algorithm processing frame rate is updated to the minimum algorithm processing frame rate. If the adjusted algorithm processing frame rate is greater than or equal to the highest algorithm processing frame rate, then the current algorithm processing frame rate is updated to the highest algorithm processing frame rate.
10. The video stream processing method according to any one of claims 1 to 6, characterized in that, After the target detection is performed by the input algorithm processing module of the multiple keyframes, the method further includes: Calculate the feature similarity between the two targets in the keyframe pair and the cross-union ratio between the first target boxes that enclose the two targets in the keyframe pair; Based on the intersection-union ratio and the feature similarity, the comprehensive matching score of the two targets is calculated; If the overall matching score is greater than the second score threshold, then the two targets are determined to be the same target.
11. The video stream processing method according to any one of claims 1 to 6, characterized in that, The keyframe pair includes two keyframes that are adjacent in the plurality of keyframes.
12. A video stream processing apparatus, characterized in that, include: The keyframe acquisition module is used to acquire multiple keyframes from the original video stream; The target detection module is used to input the multiple keyframes into the algorithm processing module for target detection, and draw the first target box of the detected target on the corresponding keyframe. An information determination module is used to determine the attribute information of a second target bounding box of the same target in each non-key frame for at least one key frame pair among the plurality of key frames if the same target exists in any key frame pair; the attribute information of a first target bounding box of the same target detected in the key frame pair is used to determine the attribute information of the second target bounding box of the same target in each non-key frame; the key frame pair includes two key frames, and each non-key frame refers to a non-key frame in the original video stream located between the key frame pairs, and the non-key frames between different key frame pairs are different; The target drawing module is used to draw the second target box corresponding to the same target on the non-key frame based on the attribute information of the second target box of the same target in the non-key frame for any one of the non-key frames.
13. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it causes the electronic device to implement the video stream processing method as described in any one of claims 1 to 11.
14. A computer program product, characterized in that, Includes a computer program, which, when run, causes the video stream processing method as described in any one of claims 1 to 11 to be performed.