Video processing method and apparatus, and electronic device and storage medium
By analyzing the changes in image quality between video frames and the results of object detection, and by using video encoding data to evaluate image changes, the number of calls to the object detection algorithm interface is reduced, thus solving the problem of low efficiency in high-resolution video processing and achieving real-time and accurate video processing.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ZHEJIANG UNIVIEW TECH CO LTD
- Filing Date
- 2025-07-07
- Publication Date
- 2026-07-02
AI Technical Summary
With increasing demands for high resolution and real-time performance, existing video processing technologies suffer from low efficiency due to excessive calls to algorithm interfaces, failing to meet the real-time requirements of scenarios such as live streaming and real-time monitoring.
By analyzing the changes in the video frame and the target detection results, the target detection results of existing frames are used to reduce the repeated detection of high-resolution video frames, reduce the number of calls to the target detection algorithm interface, and use motion vectors and residual data from the video encoding process to evaluate the changes in the frame and make reasonable reuse of the target detection results.
It improves the efficiency of video processing, reduces the consumption of computing resources, ensures video processing performance under high resolution and real-time processing requirements, and avoids stuttering problems caused by slow processing speed.
Smart Images

Figure CN2025107232_02072026_PF_FP_ABST
Abstract
Description
Video processing methods, apparatus, electronic devices and storage media
[0001] This application claims priority to Chinese Patent Application No. 202411931058.8, filed with the Chinese Patent Office on December 26, 2024, the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of video processing technology, such as a video processing method, apparatus, electronic device, and storage medium. Background Technology
[0003] In today's era of rapid internet development, while information dissemination and sharing are convenient, the problem of personal privacy leaks is becoming increasingly serious. Video blurring can effectively reduce the risk of privacy leaks. The relevant solutions generally involve calling algorithm interfaces and using image recognition and other technologies to detect target areas in the video, and then applying a mosaic effect to blur the content in those areas. However, with the development of video technology, increased resolution, and growing demands for real-time processing, even though the execution speed of algorithm interfaces has been optimized to its limit, the overall process is still too time-consuming, seriously affecting video processing efficiency. This makes it unsuitable for scenarios with high real-time requirements, such as live streaming and real-time monitoring, becoming a significant problem facing current video processing. Summary of the Invention
[0004] This application provides a video processing method, apparatus, electronic device, and storage medium to achieve both accuracy and real-time performance in mosaic processing of video frames while reducing the number of algorithm interface calls.
[0005] In a first aspect, embodiments of this application provide a video processing method, the method comprising:
[0006] Determine the amount of video frame change that occurs between the video frame corresponding to the first video frame and the video frame corresponding to the second video frame. The first video frame and the second video frame are different forward predictive coding frames belonging to the same image group. The image group to which the first video frame and the second video frame belong only includes independently coded keyframes and forward predictive coding frames.
[0007] Based on the changes in the video frame and the target detection result corresponding to the second video frame, the target detection result corresponding to the first video frame is determined. The target detection result is used to indicate the result of the detection of the target object in the video frame corresponding to the video frame.
[0008] Based on the target detection result corresponding to the first video frame, the video image corresponding to the first video frame is blurred.
[0009] Secondly, embodiments of this application also provide a video processing apparatus, the apparatus comprising:
[0010] The first determining module is configured to determine the amount of video frame change that occurs between the video frame corresponding to the first video frame and the video frame corresponding to the second video frame. The first video frame and the second video frame are different forward predictive coding frames belonging to the same image group. The image group to which the first video frame and the second video frame belong only includes independently coded keyframes and forward predictive coding frames.
[0011] The second determining module is configured to determine the target detection result corresponding to the first video frame based on the change amount of the video frame and the target detection result corresponding to the second video frame. The target detection result is used to indicate the result of the detection of the target object in the video frame corresponding to the video frame.
[0012] The blurring module is configured to blur the video frame corresponding to the first video frame based on the target detection result corresponding to the first video frame.
[0013] Thirdly, this application also provides an electronic device, which includes:
[0014] At least one processor; and
[0015] A memory communicatively connected to the at least one processor; wherein,
[0016] The memory stores a computer program that can be executed by the at least one processor, which enables the at least one processor to perform the video processing method described in any of the above embodiments.
[0017] Fourthly, this application also provides a computer-readable medium storing computer instructions that cause a processor to execute the video processing method described in any one of the above embodiments. Attached Figure Description
[0018] Figure 1 is a flowchart illustrating a video processing method provided in an embodiment of this application;
[0019] Figure 2 is a flowchart illustrating another video processing method provided in an embodiment of this application;
[0020] Figure 3 is a detailed schematic diagram of a video processing process provided in an embodiment of this application;
[0021] Figure 4 is a block diagram of a video frame corresponding to a video scene provided in an embodiment of this application;
[0022] Figure 5 is a schematic diagram of the structure of a video processing device provided in an embodiment of this application;
[0023] Figure 6 is a schematic diagram of the structure of an electronic device for implementing a video processing method according to an embodiment of this application. Detailed Implementation
[0024] The steps described in the method embodiments of this application can be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown.
[0025] The term "comprising" and its variations as used herein are open-ended inclusions, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the description below.
[0026] The concepts of "first" and "second" mentioned in this application are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.
[0027] The terms “a” and “a plurality” used in this application are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as “one or more”.
[0028] Figure 1 is a flowchart illustrating a video processing method provided in an embodiment of this application. The technical solution of this application is applicable to the situation of blurring a portion of a video frame. This method can be executed by a video processing device, which can be implemented in the form of software and / or hardware, and is generally integrated into any electronic device with network communication function. The electronic device can be a mobile terminal, a personal computer (PC) or a server, etc.
[0029] As shown in Figure 1, the video processing method of this application embodiment may include the following process:
[0030] S110. Determine the amount of video image change caused by the video image corresponding to the first video frame relative to the video image corresponding to the second video frame. The first video frame and the second video frame are different forward predictive coding frames belonging to the same image group. The image group to which the first video frame and the second video frame belong only includes independently coded keyframes and forward predictive coding frames.
[0031] For videos encoded using inter-frame compression techniques like H.264 and H.265, the video sequence consists of multiple Groups of Pictures (GOPs). Due to the high real-time requirements in the security field, bidirectional predictive coded frames are typically not used. Therefore, in this case, the first video frame in each GOP is a keyframe, and all video frames within the same GOP following the keyframe in playback order are forward predictive coded frames. Based on this, the GOPs to which the first and second video frames belong only contain independently coded keyframes and forward predictive coded frames. The first and second video frames belong to the same GOP but are different forward predictive coded frames, with the first video frame playing after the second within the GOP. The amount of video frame variation is an indicator of the degree of difference between the first and second video frames.
[0032] By parsing two different forward predictive coding frames belonging to the same image group, the video frames corresponding to the first video frame and the second video frame are obtained. Then, the amount of change in the video frame corresponding to the first video frame relative to the video frame corresponding to the second video frame is calculated, and the degree of change in the video frame corresponding to the first video frame relative to the video frame corresponding to the second video frame is obtained.
[0033] S120. Based on the changes in the video frame and the target detection result corresponding to the second video frame, determine the target detection result corresponding to the first video frame. The target detection result is used to indicate the result of the detection of the target object in the video frame corresponding to the video frame.
[0034] To apply a mosaic effect to target objects in a video, it's typically necessary to perform target object detection and mosaic application on each frame of the video. In other words, the process involves calling an object detection algorithm interface to detect the target object and then applying the mosaic effect. However, with increasing video resolution and real-time processing requirements, the time consumed by calling the object detection algorithm interface has become a bottleneck for system performance. Even though the execution speed of the object detection algorithm interface has been optimized to its limit, the overall process still takes a considerable amount of time, leading to low processing efficiency, especially when processing high-resolution or high-frame-rate videos.
[0035] Based on the above, after calling the object detection algorithm interface to perform object detection on the video frame corresponding to the second video frame and obtaining the object detection result for the second video frame, considering that the amount of change in the video frame can reflect the degree of difference between the first and second video frames, if the amount of change in the video frame is very small, it indicates that the object in the video frame corresponding to the first video frame should be basically the same as the object in the video frame corresponding to the second video frame, and there will not be much change. Therefore, the inherent correlation between the first and second video frames can be utilized to conditionally pass the object detection result of the second video frame to the first video frame for direct use. The object detection result of the first video frame can be determined based on the object detection result of the second video frame, thus reducing the probability of calling the object detection algorithm interface to perform object detection on the first video frame. The object detection result is used to represent the detection status of the object in the video frame corresponding to the video frame, including information such as the position and category of the object.
[0036] By adopting the above scheme, we can minimize the need to repeatedly invoke the object detection algorithm interface to perform complex object detection operations on the first video frame. This reduces the significant computational resources and time consumed by repeatedly invoking the object detection algorithm interface for detection. By utilizing existing object detection results, unnecessary computation is reduced. Especially when processing a large number of video frames, the process of independently detecting each video frame is reduced in the overall video processing flow, accelerating processing speed and significantly improving efficiency. Furthermore, by reasonably utilizing the object detection results and image changes in the second video frame, the object detection results of the first video frame can be determined more accurately. As long as the image change assessment is accurate and the association rules are reasonable, the reliability of object detection can be guaranteed while reducing computation.
[0037] In some embodiments, determining the amount of video frame change relative to the video frame corresponding to the first video frame may include the following steps A1-A2:
[0038] Step A1: Determine the first data quantity associated with the first video frame. The first data quantity is the data quantity generated by accumulating the data quantities associated with at least one first reference frame. The data quantity associated with the first reference frame is the motion vector data and residual value data of the first reference frame relative to the previous video frame in the image group to which the first reference frame belongs. The first reference frame includes the first video frame or a video frame located between the first video frame and the second video frame and containing the first video frame. The first video frame is a video frame that needs to rely on the second reference frame for predictive coding in the video encoding process to recover the complete video picture. The second reference frame is the second video frame or a video frame located between the first video frame and the second video frame and containing the second video frame.
[0039] Step A2: Based on the first data volume associated with the first video frame, determine the amount of video frame change that occurs between the video frame corresponding to the first video frame and the video frame corresponding to the second video frame.
[0040] Forward predictive coded frames (P-frames) are a type of frame in video coding. In video coding using inter-frame compression techniques (such as H.264 and H.265), P-frames are encoded primarily by referencing preceding I-frames (keyframes) or other P-frames. Unlike I-frames, which record complete image information, P-frames only record motion vector data and residual values. The data size of a P-frame is typically much smaller than that of a keyframe (I-frame), and the smaller the change in the video frame, the smaller the data size of the P-frame. This is because P-frames are encoded based on the differences between preceding video frames; smaller changes in the video frame mean smaller differences from the previous frames, thus requiring less motion vector and residual information to be recorded, resulting in a smaller data size.
[0041] Based on the above, it's easy to see that for encoding methods like H.264 and H.265 that use inter-frame compression, only keyframes (I-frames) record complete image information, and their data volume is usually large. In contrast, forward predictive coded frames (P-frames) only record motion vector data and residual values (in the security field, where real-time performance is crucial, bidirectional predictive coded frames are typically not used and are therefore not considered). Their data volume is usually much smaller than that of keyframes, and the smaller the video frame changes, the smaller the data volume of the forward predictive coded frame. Utilizing this characteristic, we can determine whether the target detection results from previous video frames can be reused by checking if the data volume of the forward predictive coded frame is small enough, thereby reducing the number of calls to the target detection algorithm interface.
[0042] In video processing, to accurately assess changes in video frames, the concept of a "first data quantity" is introduced. This first data quantity quantitatively reflects the correlation and difference between the video frame corresponding to the second frame and the video frame corresponding to the first frame. The first reference frame is one of the core elements of the entire calculation process. Specifically, it can be either the first video frame itself, or multiple video frames located between the first and second video frames that contain the first video frame. The first data quantity associated with the first video frame is the sum of the data quantities associated with at least one first reference frame.
[0043] For each first reference frame, the associated data volume needs to be calculated. This data volume consists of motion vector data and residual value data of the first reference frame relative to the previous video frame in its image group. During video encoding, motion vector data and residual value data contain dynamic change information between video frames. For example, motion vector data can represent the direction and speed of movement of objects in the frame, while residual value data reflects the difference in pixel values between the current frame and the previous frame. By acquiring this data, we can understand the changes of each first reference frame relative to its predecessor. By accumulating the data volume associated with all first reference frames, we obtain the first data volume. This accumulation process summarizes the image change information carried by each first reference frame, forming a comprehensive index used to measure the overall degree of change related to the first video frame.
[0044] After obtaining the first data volume associated with the first video frame, the amount of video frame change relative to the corresponding video frame in the second video frame can be determined based on this first data volume. There is an inherent relationship between the first data volume and the amount of video frame change; the size of the first data volume can reflect, to a certain extent, the degree of frame change between the corresponding video frames of the two video frames. This relationship is based on the inherent laws of video encoding and frame change. For example, a large first data volume may indicate a significant change in the frame between the first and second video frames, possibly due to rapid object movement, scene switching, or other factors. Conversely, a small first data volume indicates a relatively small change between the two video frames. Based on this, this characteristic can be utilized to determine whether the target detection result corresponding to the previous video frame can be reused by detecting whether the data volume of the forward predictive coding frame is small enough. This reduces the number of calls to the target detection algorithm interface, thereby optimizing the overall system performance and reducing overall system processing time.
[0045] The above implementation scheme fully utilizes the motion vector data and residual data generated during video encoding, which are inherently present in the video encoding structure. By leveraging the value of this data, there is no need to introduce complex image analysis algorithms to determine image changes, thereby improving computational efficiency and reducing unnecessary computational resource consumption. The amount of video image change is determined by accumulating the data volume of the first reference frame, achieving quantification of image changes. This quantification method is more accurate and detailed than traditional methods based on image similarity, not only determining whether an image has changed but also understanding the degree of change from the amount of data. Simultaneously, it reduces the number of calls to the object detection algorithm interface, thereby optimizing the overall system performance and reducing overall system processing time.
[0046] S130. Based on the target detection result corresponding to the first video frame, blur the video image corresponding to the first video frame.
[0047] Once the target detection result corresponding to the first video frame is determined, the target object location and type indicated in the target detection result can be used to blur the corresponding target object area in the video frame corresponding to the first video frame. For example, specific algorithms or software tools can be used to add mosaic or other forms of occlusion to the area where the target object is located, thereby processing specific target object areas in the video frame to protect privacy or sensitive information. By using the above method, accurate blurring of target objects in the first video frame can effectively prevent the leakage of sensitive information such as privacy information and confidential content in the video.
[0048] In this embodiment, during video blurring, the changes in video frame data are analyzed for the first and second video frames belonging to the same image group but with different forward predictive coding frames. By accurately analyzing these changes and combining them with the target detection results of the second video frame (which has already undergone target detection), the target detection result of the first video frame can be determined without performing a target detection algorithm on the first video frame. This avoids the tedious operation of performing complex target detection algorithms on each video frame, reducing unnecessary computation. Since the repeated target object detection operations are significantly reduced, the frequency of algorithm interface calls is also significantly reduced. This significantly shortens the processing time and improves the efficiency of the entire video blurring process when dealing with complex scenarios requiring a large number of video frames. It effectively alleviates the performance bottleneck problem of time-consuming algorithm interface calls, ensuring good performance even when processing high-resolution videos or requiring real-time processing, and avoiding stuttering and other problems caused by slow processing speed.
[0049] Figure 2 is a flowchart of another video processing method provided in this application embodiment. The technical solution of this embodiment optimizes the process of determining the target detection result corresponding to the first video frame based on the target detection result corresponding to the second video frame in the above embodiment based on the change amount of video image. This embodiment can be combined with various optional solutions in one or more of the above embodiments.
[0050] As shown in Figure 2, the video processing method of this application embodiment may include the following process:
[0051] S210. Determine the amount of video frame change caused by the video frame corresponding to the first video frame relative to the video frame corresponding to the second video frame. The first video frame and the second video frame are different forward predictive coding frames belonging to the same image group.
[0052] S220. Determine the reference video frame change threshold to be used for the first video frame relative to the second video frame. The reference video frame change threshold is the upper limit of the video frame change required to make the similarity between the target detection result corresponding to the first video frame and the target detection result corresponding to the second video frame greater than the preset similarity.
[0053] In video processing, to efficiently and accurately determine whether the target detection results of each video frame can be reused, this solution introduces a key indicator: a reference video frame change threshold. This threshold serves as the basis for deciding whether to reuse the target detection results corresponding to existing video frames. The reference video frame change threshold can be set for the video frames corresponding to the first and second video frames. It represents an upper limit value, which is closely related to the similarity of the target detection results.
[0054] When the amount of video frame change between the first and second video frames falls within this threshold range, the similarity between the target detection results of the first and second video frames is guaranteed to be greater than a preset similarity. For example, if the preset similarity is set to 80%, then this threshold represents the upper limit of the amount of frame change that allows the target detection results of the two video frames to achieve this level of similarity. Calculating this threshold requires comprehensive consideration of various features of the video frame, the nature of the target object, and the characteristics of the target detection algorithm.
[0055] S230. In response to the fact that the change in the video frame is not greater than the threshold of the change in the reference video frame, the target detection result corresponding to the second video frame is directly reused as the target detection result corresponding to the first video frame.
[0056] The target detection result is used to indicate the result of the detection of the target object in the video frame corresponding to the video frame.
[0057] When the calculated change in the video frame is no greater than the threshold for the change in the reference video frame, it means that the changes in the first and second video frames are small. In this case, from the perspective of video processing efficiency, the target detection result corresponding to the second video frame can be directly reused as the target detection result corresponding to the first video frame. This is because a small change in the frame indicates that the position, features, and other information of the target object in the two video frames have not changed much, and the target detection result of the second video frame still has high reference value. In other words, when the changes in the video frame corresponding to the first and second video frames are small enough, the target detection results corresponding to the two video frames should also be similar enough. Then, the second video frame can skip the target object detection operation and directly use the target detection result corresponding to the first video frame to achieve the application of blurring processing (such as adding a mosaic effect) in the target object area. This avoids performing a complex target object detection operation on the first video frame again, saving computing resources and time.
[0058] By setting a threshold for the amount of change in the reference video frame to determine whether to reuse object detection results, the repeated calls to the object detection algorithm are significantly reduced when the frame changes are small. This is particularly effective when processing a large number of video frames, significantly shortening the time spent on object detection in the entire video processing process and improving processing efficiency. Furthermore, this approach avoids unnecessary object detection operations, reducing the consumption of computing resources such as the Central Processing Unit (CPU) and memory. This allows the video processing system to allocate resources more rationally when processing video, improving the overall performance and stability of the system while ensuring processing quality.
[0059] S240. In response to the change in video frame being greater than the threshold for the change in reference video frame, a target object detection operation is performed on the video frame corresponding to the first video frame to obtain the target detection result corresponding to the first video frame.
[0060] If calculations show that the change in the video frame exceeds the threshold for the change in the reference video frame, it indicates a significant change in the image between the first and second video frames. This significant change may lead to substantial alterations in the position, shape, or quantity of target objects in the corresponding video frames of the first and second video frames. In this case, to ensure the accuracy of the target detection results, the target detection algorithm interface needs to be invoked again to perform target object detection on the corresponding video frame of the first video frame. This involves analyzing each pixel and region in the first video frame to obtain the target detection result for that frame.
[0061] When there are significant changes in the scene, the first video frame is promptly re-detected to ensure that the target detection results accurately reflect the actual situation of the target object in the video frame. This flexible adjustment of the detection strategy based on scene changes avoids target detection errors that may be caused by reusing old results, thus guaranteeing the accuracy of target detection. The above solution fully considers that video content is often dynamically changing and can adapt well to this dynamism. Whether it is slight scene shaking, slow object movement, or large scene changes or rapid object movement, the appropriate strategy can ensure the quality of target detection results, making the entire video processing workflow more robust and reliable.
[0062] In some embodiments, determining a reference video frame change threshold to be used for the first video frame relative to the second video frame includes the following steps B1-B2:
[0063] Step B1: Determine the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame. The third video frame and the second video frame belong to the same image group, and the third video frame is a forward predictive coding frame or keyframe located before and adjacent to the second video frame.
[0064] Step B2: In response to the fact that the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame is not less than a preset similarity threshold, the second data quantity associated with the second video frame is determined as the reference video frame change threshold to be used by the first video frame relative to the second video frame.
[0065] The second data quantity associated with the second video frame is the amount of data generated by the motion vector data and residual value data of the second video frame compared to the third video frame.
[0066] An important phenomenon has been observed in video processing: when the target detection results of two consecutive video frames show a sufficiently high similarity, it often means that the degree of change in the video image between the two frames is relatively small. In other words, in this case, the amount of change in the video image of one video frame relative to the previous video frame is within a specific range, and this range just meets the condition for the detection results of the two video frames to be highly similar.
[0067] Especially when encoding two video frames using inter-frame compression techniques like H.264 and H.265, there is a correlation between the degree of change in the video frame and the amount of data in the forward predictive coded frame. Specifically, the smaller the change in the video frame, the smaller the amount of data contained in the forward predictive coded frame. This is because, under this encoding method, the forward predictive coded frame mainly records the change information in the video frame. When the change in the video frame is small, less content needs to be recorded, and the amount of data is reduced accordingly.
[0068] For example, suppose that in these two consecutive video frames, the latter video frame happens to be a forward predictive coding frame. This provides an important clue: the amount of video scene change recorded in this forward predictive coding frame is small enough to make the detection results of the two video frames highly similar. From the perspective of data volume, this means that the amount of data carried in the forward predictive coding frame is small enough to just meet the condition of such high similarity.
[0069] If we set the data size of this specific forward predictive coding frame to a threshold A, and label this forward predictive coding frame as reference frame A, and define the target detection result of reference frame A as reference result A, then multiple consecutive forward predictive coding frames starting from reference frame A form a new video sequence. As long as the total data size generated by these consecutive forward predictive coding frames does not exceed the threshold A, it indicates that the accumulated video frame changes during this continuous process are sufficiently small. This small amount of video frame change ensures that the target detection result of the currently processed video frame maintains a high degree of similarity to reference result A.
[0070] In video processing, to accurately assess the impact of changes in video frame composition on target detection results, it's necessary to first determine the second and third video frames. The third video frame has a specific relationship with the second frame and belongs to the same image group. Furthermore, the third video frame is a forward predictive coding frame or keyframe that precedes and is adjacent to the second video frame. This adjacent and specific frame relationship ensures that the second and third video frames are closely linked in time and coding logic. The similarity of their target detection results can reflect the impact of changes in the scene over a short period on target information. For example, in surveillance video, two adjacent frames might only show minor object movement or slight changes in lighting. By calculating the similarity between the target detection results of the two video frames, the impact of changes in the scene over a short period on target information can be assessed.
[0071] When the similarity between the target detection results of the second video frame and the target detection results of the third video frame is not less than a preset similarity threshold, it indicates that there is essentially no significant change in the target detection results between the two video frames. This means that the position and state of the target object in the frame have not changed significantly, or no new target object has appeared. In this case, the second data quantity associated with the second video frame can be directly determined as the reference video frame change threshold to be used for the first video frame relative to the second video frame. Subsequently, as long as the change in the frame between the two video frames does not exceed this reference video frame change threshold, the target detection results of the corresponding video frames can be considered to be essentially consistent. Here, the second data quantity is the amount of data generated by the motion vector data and residual value data of the second video frame compared to the third video frame.
[0072] Motion vector data reflects information such as the direction and speed of movement of objects in the image, while residual data reflects the difference in pixel values between two frames. The threshold determined in this way comprehensively considers the dynamic changes in the image, providing a quantitative standard for subsequently judging the relationship between the first and second video frames. For example, if a car speeds past in a traffic surveillance video, causing a decrease in the similarity of target detection results between the second and third video frames, then the threshold for the change in the reference video image can be determined based on the motion vectors and residual values generated between the second and third video frames.
[0073] The above implementation can automatically determine the reference video frame change threshold based on the similarity of target detection results between video frames. This adaptive approach makes the threshold setting more closely match the actual changes in the video content, rather than using a fixed threshold, thereby improving the adaptability of the entire video processing system to different types of videos. By using the motion vector data and residual value data of the second video frame compared to the third video frame to determine the threshold, the information already existing in the video encoding process is fully utilized, avoiding additional complex calculations to determine the threshold and improving computational efficiency. Based on this reference video frame change threshold, when processing the relationship between the first and second video frames, it is possible to more accurately determine whether target detection needs to be repeated. If the frame change between the first and second video frames is within the threshold range, the previous detection results can be reasonably reused, reducing unnecessary target detection operations and thus improving the efficiency of the entire video processing workflow.
[0074] For example, referring to Figure 3, video frames belonging to the same GOP are defined sequentially as F0, F1, F2...Fn-1, Fn according to their playback order. Fi is defined as the i-th video frame in the GOP, and the target detection result obtained by calling the algorithm interface on Fi is defined as Ri. The data size of frame Fi (which can be an I-frame or P-frame in the image group) is Di. Based on this, assuming the existence of a similarity detection scheme, a quantified value can be given for the similarity of target detection results corresponding to two sets of video frames, defined as similarity L. A similarity threshold Lt can be defined. When the similarity of target detection results corresponding to two consecutive video frames is not lower than the similarity threshold Lt, the two sets of target detection results are considered highly similar. The sum of the data sizes of multiple consecutive forward reference frames is defined as the cumulative difference. If the similarity between Ri and Ri+1 is lower than Lt, then target detection is performed on Ri+2, and Ri+1 is taken as Ri, and Ri+2 is taken as Ri+1. If the similarity between Ri and Ri+1 is not less than Lt, meaning Ri and Ri+1 are highly similar, then the data size of frame Fi+1 (Fi+1 being the P-frame in the image group) is set to the cumulative difference threshold Dt (i.e., the reference video frame change threshold that needs to be calculated). The cumulative difference of multiple forward reference frames after video frame Ri+1 is defined as Dsum (denoted as the first data size associated with the video frames after Ri+1):
[0075] Based on the aforementioned cumulative difference, if Dsum is not greater than Dt, then video frames Fi+2 to Fj do not need to call the object detection algorithm interface to perform object detection operations. They can directly perform mosaic processing according to the object detection box of the object in Ri+1. If Dsum+Dj+1 is greater than Dt, then object detection operations need to be performed on Fj+1 and Fj+2, and Rj+1 is taken as Ri, and Rj+2 is taken as Ri+1.
[0076] In some embodiments, determining the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame includes the following steps C1-C4:
[0077] Step C1: Determine the number of target objects in the target detection results corresponding to the second video frame and the number of target objects in the target detection results corresponding to the third video frame.
[0078] Step C2: In response to the fact that the number of two target objects is different, the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame is determined to be a preset value. The preset value is used to indicate that the similarity between the two target detection results is 0.
[0079] Step C3: In response to the fact that the number of two target objects is the same, determine the first size information and the second size information. The first size information is used to indicate the size of the overlapping area of the target detection boxes of the same target object in the target detection results of the second video frame and the target detection results of the third video frame. The second size information is used to indicate the size of the union area of the area covered by the two target detection boxes of the same target object in the target detection results of the second video frame and the target detection results of the third video frame.
[0080] Step C4: Based on the size ratio of the first size information to the second size information, determine the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame.
[0081] In the target detection and analysis stage of video processing, it is necessary to first determine the number of target objects in the target detection results corresponding to the second and third video frames. The number of target objects can intuitively reflect the basic situation of targets in the picture. For example, in surveillance video, target objects may be pedestrians, vehicles, etc. By determining the changes in their number, we can gain a preliminary understanding of the changes in the picture content. In some embodiments, by extracting data from the target detection results already completed in the second and third video frames, the number of target objects in each frame is counted, which involves the analysis of the output results of the target detection algorithm. Target detection algorithms typically identify targets in the picture and provide corresponding information, including the key data of the number of target objects.
[0082] When the number of target objects in the target detection results of the second and third video frames is found to be different, it means that there has been a significant change in the target level between the two frames. This could be due to a new target entering the frame or an existing target leaving the frame. In this case, the similarity between the target detection results of the second and third video frames can be set as a preset value. This preset value is set to indicate that the similarity between the two target detection results is 0, meaning they are considered different. This setting is based on the significant difference in the number of target objects, indicating a fundamental change in the target composition of the two frames.
[0083] When the number of target objects is found to be the same in the target detection results of the second and third video frames, it indicates that the video frames corresponding to the two video frames have a certain degree of similarity in terms of the number of target objects. However, this is not enough to completely determine the similarity between the two video frames; further analysis of information such as the position and size of the target objects in the frame is needed. At this point, it is necessary to determine the first and second size information, and use the first and second size information to determine the similarity between the target detection results of the second and third video frames.
[0084] For example, the first size information is used to indicate the size of the overlapping region of the target detection boxes for the same target object in the target detection results of the second video frame and the third video frame. The target detection box is a rectangular box used in the target detection algorithm to identify the position and extent of a target object in the frame. By analyzing the overlapping portion of the target detection boxes for the same target object in two frames, the relative changes in the target's position and size between the two frames can be determined. Calculating the size of this overlapping region requires using geometric calculation methods to analyze the coordinates and sizes of the two target detection boxes.
[0085] For example, the second size information is used to indicate the size of the union region of the area covered by the two target detection boxes of the same target object in the target detection results of the second and third video frames. Specifically, it is calculated by merging the areas covered by the target detection boxes of the same target object in the two video frames. In other words, the difference between the sum of the area covered by the two target detection boxes of the same target object in the target detection results of the second and third video frames and the size of the overlapping area indicated by the first size information reflects the overall occupancy of the target object in the two frames. Similarly, calculating this second size information also requires the coordinates and size information of the target detection boxes.
[0086] After acquiring the first and second size information, the similarity between the target detection results in the second and third video frames is determined by calculating the ratio of the first and second size information using the intersection-union (IU) ratio. This ratio more accurately measures the degree of change of the target object between the two frames. For example, if the size ratio is close to 1, it indicates that the position and size of the target object change very little between the two frames, and the similarity between the two frames is high; conversely, if the size ratio is small, it indicates that the target object has a large change in position or size between the two frames, and the similarity between the two frames is low. This similarity calculation method based on the size ratio comprehensively considers the spatial position and size changes of the target object in the image, making it more detailed and accurate than judging solely based on the number of target objects.
[0087] For example, for the target detection results corresponding to two consecutive video frames, the following is a feasible similarity detection scheme: The prerequisite for calculating the similarity of the target detection results corresponding to two consecutive video frames is that the number of detected target objects indicated by the target detection results of the two consecutive video frames is the same. If the number of target objects is not the same, the similarity of the target detection results corresponding to the two consecutive video frames is considered to be 0. When the area of the overlapping part of the target detection boxes corresponding to the target objects in the target detection results of the two consecutive video frames is larger, that is, the larger the proportion of the overlapping part of the target detection boxes corresponding to the target objects to the size of the union of the areas covered by the two target detection boxes, it indicates that the target detection boxes corresponding to the two target objects are more similar, and this proportion value can be used as the quantification value L of the similarity. When the number of targets in the target detection results corresponding to the two consecutive video frames is the same, the video frame that appears earlier in the display order is defined as FA, the video frame that appears later in the display order is defined as FB, the area of FA is defined as SA, the area of FB is defined as SB, and the area of the overlapping part of FA and FB is defined as SAB. Then the similarity of the target detection boxes corresponding to the two target objects is:
[0088] Assuming that the number of target objects in the target detection results of two consecutive video frames is n, then the similarity between the target detection results of two consecutive video frames is:
[0089] By comprehensively considering factors such as the number of target objects, the overlapping area of the target detection boxes, and the size of the union region indicated by the second size information, the similarity of the target detection results can be determined, which can more accurately reflect the degree of similarity between two video frames at the target level. This multi-dimensional evaluation method avoids the limitations of single-factor judgment, making the similarity evaluation results more consistent with reality. In complex video scenes, the number, position, and size of target objects may change frequently. The above method can effectively cope with this complexity. Whether it is the rapid movement of the target, the brief occlusion, or the appearance of a new target and the disappearance of an old target, it can accurately evaluate the similarity between two frames through reasonable steps, thereby ensuring the stability and accuracy of the entire video processing system in various complex environments.
[0090] In some embodiments, before determining the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame, steps D1-D2 are further included:
[0091] Step D1: Take the second video frame and the third video frame as video frames to be processed, and divide the video frames to be processed into multiple video frames to be processed.
[0092] Step D2: Perform target object detection operations in parallel on multiple screen blocks to be processed to obtain target detection results corresponding to each screen block to be processed, and merge the target detection results corresponding to the multiple screen blocks to be processed to obtain the target detection results corresponding to the video frame to be processed.
[0093] In video processing, to achieve more efficient and accurate object detection, a strategy of segmenting the video frame is adopted. This is because object detection plays a crucial role in the entire video analysis process, and its results may be related to subsequent operations such as object detection result reuse and frame change evaluation. Here, the second and third video frames can be selected as the video frames to be processed sequentially for segmentation. Segmenting the video frames corresponding to the video frames to be processed breaks down a complex, large frame into multiple relatively simple smaller frame regions, i.e., multiple frames to be processed. By segmenting the video frames to be processed in this way and performing parallel detection on these segments, the detection time of a single video frame can be effectively reduced, thereby reducing the overall processing time.
[0094] The actual segmentation operation can be determined based on different algorithms and application scenarios. For example, the image can be divided evenly according to a fixed size, such as dividing the image into multiple rectangular blocks of the same size; or adaptive segmentation can be performed based on the content features in the image, such as dividing based on features like the outline and color of objects, so that the content within each block has a certain degree of similarity or independence. Regardless of the segmentation method used, the goal is to better handle the problem of target object detection in video images.
[0095] Object detection is performed in parallel on multiple frame blocks to be processed. Parallel processing means that multiple blocks are detected simultaneously, with each frame block being detected independently using an object detection algorithm. These algorithms can be based on machine learning models (such as convolutional neural networks) or traditional image processing methods (such as feature-based object detection algorithms). Parallel processing allows for the detection of a large number of blocks in a short time. After completing object detection for multiple frame blocks, the detection results for each block are merged to obtain the final object detection result for the corresponding video frame to be processed.
[0096] When integrating the local detection results of multiple blocks into the global detection result of the entire video frame, the relationships and positional information between the blocks need to be considered during the merging process. For example, if uniform block division is used, the target detection results of multiple blocks can be stitched together according to the correct positional relationship based on the coordinate order of the blocks; if adaptive block division is used, the detection results need to be accurately merged based on the correlation between the blocks to form the complete target detection result of the video frame to be processed.
[0097] By dividing the video frame into blocks and performing object detection in parallel, computational resources are fully utilized, significantly reducing object detection time. Compared to the traditional method of detecting the entire video frame sequentially, parallel processing can process multiple blocks simultaneously, especially for high-resolution video frames, where this efficiency improvement is even more significant. Block processing allows the object detection algorithm to focus more on the target features within each small region, reducing interference from other complex factors in the frame. Each block contains relatively fewer target objects with more obvious features, which helps improve detection accuracy. Therefore, dividing the video frame into blocks according to a specific block-segmentation scheme and performing parallel detection reduces overall processing time without losing targets.
[0098] In some embodiments, adjacent frame blocks to be processed exist within a plurality of frame blocks to be processed, wherein the vertical height of the rectangular overlapping region is not less than a reference vertical height, and the horizontal width of the rectangular overlapping region is not less than a reference horizontal width. The reference vertical height is determined based on the maximum vertical height of the target detection box to which the target object belongs in the target detection results of historical video frames acquired before the frame to be processed, and the reference horizontal width is determined based on the maximum horizontal width of the target detection box to which the target object belongs in the target detection results of historical video frames acquired before the frame to be processed.
[0099] Referring to Figure 4, to prevent target objects located on the detection region dividing lines from being incorrectly identified during the segmentation of the image to be processed, sufficient overlap must be maintained between the image segments. Simultaneously, to maximize performance, the area of overlapping detection regions should be minimized, i.e., the overlap area between image segments. Furthermore, to ensure that target objects are not missed due to being located at the edge of the detection region, the width and height of the overlapping region must not be less than the width and height of any detected target object. The maximum width of the target detection boxes belonging to all target objects in the target detection results of historical video frames is recorded as the minimum overlapping region width, and the maximum height of the target detection boxes in the target detection results of historical video frames is recorded as the minimum overlapping region height. Subsequently, when segmenting the video frame to be processed, it is necessary to ensure that the width of any overlapping region in the detection area is not less than the minimum overlapping width, and the height is not less than the minimum overlapping height. Finally, each image segment to be processed can be detected in parallel, and the detected results can be merged to obtain the detection results of the video frame.
[0100] For example, referring to Figure 4, define the top-left corner of the video frame to be processed as point A0, the top-right corner as point B1, the bottom-left corner as point C2, and the bottom-right corner as point D3; draw the perpendicular bisector of A0B1, intersecting A0B1 at point AB and C2D3 at point CD; draw the perpendicular bisector of A0C2, intersecting A0C2 at point AC and B1D3 at point BD. Record the maximum width of all target detection boxes in the historical detection results, defined as the minimum width W of the overlapping region, and record the maximum height of all target detection boxes in the historical detection results, defined as the minimum height H of the overlapping region. Point A1 is obtained by moving point AB a distance W / 2 towards point A0; point B0 is obtained by moving point AB a distance W / 2 towards point B1; point A2 is obtained by moving point AC a distance H / 2 towards point A0; point C0 is obtained by moving point AC a distance H / 2 towards point C2; point B3 is obtained by moving point BD a distance H / 2 towards point B1; point D1 is obtained by moving point BD a distance H / 2 towards point D3; point C3 is obtained by moving point CD a distance W / 2 towards point C2; and point D2 is obtained by moving point CD a distance W / 2 towards point D3. At this point, the four frames to be processed are divided into rectangles A0B0D0C0, A1B1D1C1, A2B2D2C2, and A3B3D3C3, respectively. The overlapping area is the cross-shaped area formed by rectangles A1B0D2C3 and A2B3D1C0.
[0101] Referring to Figure 4, after parallel detection of multiple frames to be processed, the detection results need to be merged. For example, the detection results of each frame block obtained from the parallel detection are collected and organized. These detection results usually include the location information of the target object (such as the coordinate range in the block, which may be represented by the coordinates of the upper left and lower right corners) and the target object category (such as specific categories such as people, vehicles, and objects). During the parallel detection process, different blocks may repeatedly detect target objects near the block boundary or in overlapping areas. Therefore, it is necessary to deduplicate the target detection results of each block in the frame to be processed. After deduplication, the detection results of each frame block to be processed are classified and integrated according to the target object category. The integrated target detection results corresponding to each category and position of the target object are summarized to form a complete final detection result applicable to the entire frame.
[0102] In some embodiments, before merging the detection results of each frame segment to be processed, it is necessary to convert the target detection box coordinates of each frame segment to be processed into coordinates on the original video frame. The width of the original video frame is defined as w, the height as h, and the origin of the coordinate axis of the original video frame is defined as the upper left corner of the video frame. The upper left corner coordinates of a single detected target object in any frame segment to be processed are defined as (x, y). Therefore, for rectangle A0B0D0C0, the converted coordinates on the original video frame are (x, y); for rectangle A1B1D1C1, the converted coordinates are (x + w / 2 - W / 2, y); for rectangle A2B2D2C2, the converted coordinates are (x, y + h / 2 - H / 2); and for rectangle A3B3D3C3, the converted coordinates are (x + w / 2 - W / 2, y + h / 2 - H / 2). When merging detection results, if the top-left corner coordinates of multiple detection boxes are the same, these target detection boxes are merged. The height of the merged detection box is the maximum of the heights of these detection boxes, and the width of the merged detection box is the maximum of the widths of these detection boxes.
[0103] S250. Based on the target detection result corresponding to the first video frame, blur the video image corresponding to the first video frame.
[0104] In this embodiment of the application, during video blurring, the first and second video frames belonging to the same image group but with different forward predictive coding frames are analyzed for changes in video image quantity. By accurately analyzing the changes in video image quantity and combining them with the target detection results of the second video frame that has already completed target detection, the target detection result of the first video frame can be determined without needing to perform a target detection algorithm on the first video frame. This method avoids the tedious operation of performing complex target detection algorithms on each video frame again, reducing unnecessary computation. Since the repeated target object detection operations are greatly reduced, the call frequency of the algorithm interface is also significantly reduced. Thus, when facing complex scenarios that require processing a large number of video frames, the processing time can be significantly shortened, and the efficiency of the entire video blurring process can be improved. This can effectively alleviate the performance bottleneck problem of time-consuming algorithm interface calls, and maintain good performance even when processing high-resolution videos or requiring real-time processing, avoiding stuttering and other problems caused by slow processing speed.
[0105] Figure 5 is a schematic diagram of a video processing device provided in an embodiment of this application. This embodiment is applicable to the situation of blurring a part of the video screen. The video processing device can be implemented in the form of software and / or hardware, and is generally integrated on any electronic device with network communication function, such as a mobile terminal, PC or server.
[0106] As shown in Figure 5, the video processing apparatus of this application embodiment may include the following:
[0107] The first determining module 510 is configured to determine the amount of video frame change that occurs between the video frame corresponding to the first video frame and the video frame corresponding to the second video frame. The first video frame and the second video frame are different forward predictive coding frames belonging to the same image group. The image group to which the first video frame and the second video frame belong only includes independently coded keyframes and forward predictive coding frames.
[0108] The second determining module 520 is configured to determine the target detection result corresponding to the first video frame based on the change amount of the video frame and the target detection result corresponding to the second video frame. The target detection result is used to indicate the result of the detection of the target object in the video frame corresponding to the video frame.
[0109] The blurring module 530 is configured to blur the video frame corresponding to the first video frame based on the target detection result corresponding to the first video frame.
[0110] In some embodiments, determining the amount of video frame change relative to the video frame corresponding to the first video frame includes:
[0111] A first data quantity associated with a first video frame is determined. The first data quantity is the data quantity generated by accumulating the data quantities associated with at least one first reference frame. The data quantity associated with the first reference frame is the motion vector data and residual value data of the first reference frame relative to the previous video frame in the image group to which the first reference frame belongs. The first reference frame includes the first video frame or a video frame located between the first video frame and the second video frame and containing the first video frame. The first video frame is a video frame that needs to rely on the second reference frame for predictive coding in the video encoding process to recover the complete video picture. The second reference frame is the second video frame or a video frame located between the first video frame and the second video frame and containing the second video frame.
[0112] Based on the first data volume associated with the first video frame, determine the amount of video frame change that occurs between the video frame corresponding to the first video frame and the video frame corresponding to the second video frame.
[0113] In some embodiments, the video processing apparatus further includes a threshold determination module 540, configured to determine the amount of video frame change relative to the video frame corresponding to the first video frame compared to the video frame corresponding to the second video frame before determining the threshold determination module.
[0114] Determine the reference video frame change threshold to be used for the first video frame relative to the second video frame. The reference video frame change threshold is the upper limit of the video frame change that is required for the similarity between the target detection result corresponding to the first video frame and the target detection result corresponding to the second video frame to be greater than the preset similarity.
[0115] In some embodiments, determining the target detection result corresponding to the first video frame based on the video frame change and the target detection result corresponding to the second video frame includes:
[0116] In response to the fact that the change in the video frame is not greater than the threshold of the change in the reference video frame, the target detection result corresponding to the second video frame is directly reused as the target detection result corresponding to the first video frame.
[0117] In response to the video frame change being greater than the reference video frame change threshold, a target object detection operation is performed on the video frame corresponding to the first video frame.
[0118] In some embodiments, the threshold determination module 540 is configured as follows:
[0119] Determine the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame, wherein the third video frame and the second video frame belong to the same image group, and the third video frame is a forward predictive coding frame or key frame located before and adjacent to the second video frame.
[0120] In response to the fact that the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame is not less than a preset similarity threshold, the second data quantity associated with the second video frame is determined as the reference video frame change threshold to be used by the first video frame relative to the second video frame.
[0121] The second data quantity associated with the second video frame is the amount of data generated by the motion vector data and residual value data of the second video frame compared to the third video frame.
[0122] In some embodiments, determining the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame includes:
[0123] Determine the number of target objects in the target detection result corresponding to the second video frame and the number of target objects in the target detection result corresponding to the third video frame;
[0124] In response to the fact that the number of two target objects is different, the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame is determined to be a preset value, which is used to indicate that the similarity between the two target detection results is 0;
[0125] In response to the fact that the number of two target objects is the same, first size information and second size information are determined. The first size information is used to indicate the size of the overlapping area of the target detection result of the second video frame and the target detection box of the same target object in the target detection result of the second video frame. The second size information is used to indicate the size of the union area of the area covered by the two target detection boxes of the same target object in the target detection result of the second video frame and the target detection result of the second video frame.
[0126] Based on the size ratio of the first size information to the second size information, the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame is determined.
[0127] In some embodiments, the threshold determination module 540 is further configured to:
[0128] Before determining the similarity between the target detection results corresponding to the second video frame and the target detection results corresponding to the third video frame.
[0129] The second video frame and the third video frame are respectively used as video frames to be processed, and the video frames corresponding to the video frames to be processed are divided into blocks to obtain multiple blocks of the video frames to be processed.
[0130] The target object detection operation is performed in parallel on multiple frames to be processed to obtain the target detection results corresponding to each frame. The target detection results corresponding to the multiple frames to be processed are then merged to obtain the target detection result corresponding to the video frame to be processed.
[0131] In some embodiments, there are rectangular overlapping regions between adjacent frames in a plurality of frames to be processed. The vertical height of the rectangular overlapping region is not less than a reference vertical height, and the horizontal width of the rectangular overlapping region is not less than a reference horizontal width. The reference vertical height is determined based on the maximum vertical height of the target detection box to which the target object belongs in the target detection results corresponding to the historical video frames collected before the frame to be processed. The reference horizontal width is determined based on the maximum horizontal width of the target detection box to which the target object belongs in the target detection results corresponding to the historical video frames collected before the frame to be processed.
[0132] In this embodiment of the application, during video blurring, the first and second video frames belonging to the same image group but with different forward predictive coding frames are analyzed for changes in video image quantity. By accurately analyzing the changes in video image quantity and combining them with the target detection results of the second video frame that has already completed target detection, the target detection result of the first video frame can be determined without needing to perform a target detection algorithm on the first video frame. This method avoids the tedious operation of performing complex target detection algorithms on each video frame again, reducing unnecessary computation. Since the repeated target object detection operations are greatly reduced, the call frequency of the algorithm interface is also significantly reduced. Thus, when facing complex scenarios that require processing a large number of video frames, the processing time can be significantly shortened, and the efficiency of the entire video blurring process can be improved. This can effectively alleviate the performance bottleneck problem of time-consuming algorithm interface calls, and maintain good performance even when processing high-resolution videos or requiring real-time processing, avoiding stuttering and other problems caused by slow processing speed.
[0133] The video processing apparatus provided in this application embodiment can execute the video processing method provided in any embodiment of this application, and has the corresponding functional modules and beneficial effects for executing the video processing method.
[0134] The various units and modules included in the above-mentioned device are divided according to functional logic. The actual division only needs to be able to realize the corresponding functions. In addition, the specific names of each functional unit are only for easy differentiation between them.
[0135] Figure 6 is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Referring below to Figure 6, a schematic diagram of the structure of an electronic device (e.g., the terminal device or server in Figure 6) 500 suitable for implementing embodiments of this application is shown. The terminal device in the embodiments of this application may include mobile terminals such as mobile phones, laptops, digital broadcast receivers, personal digital assistants (PDAs), tablet computers (PADs), portable multimedia players (PMPs), in-vehicle terminals (e.g., in-vehicle navigation terminals), and fixed terminals such as digital TVs and desktop computers. The electronic device shown in Figure 6 is an example.
[0136] As shown in Figure 6, the electronic device 500 may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) 501, which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 502 or a program loaded from storage device 508 into random access memory (RAM) 503. The RAM 503 also stores various programs and data required for the operation of the electronic device 500. The processing unit 501, ROM 502, and RAM 503 are interconnected via a bus 504. An input / output (I / O) interface 505 is also connected to the bus 504.
[0137] Typically, the following devices can be connected to I / O interface 505: input devices 506 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 507 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 508 including, for example, magnetic tapes, hard disks, etc.; and communication devices 509. Communication device 509 allows electronic device 500 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 6 shows an electronic device 500 with various devices, it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.
[0138] According to embodiments of this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this application include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the video processing method shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 509, or installed from a storage device 508, or installed from a ROM 502. When the computer program is executed by the processing device 501, it performs the functions defined in the video processing method of the embodiments of this application.
[0139] The names of messages or information exchanged between multiple devices in the embodiments of this application are for illustrative purposes only.
[0140] The electronic device provided in this embodiment and the video processing method provided in the above embodiments belong to the same application concept. Technical details not described in detail in this embodiment can be found in the above embodiments, and this embodiment has the same beneficial effects as the above embodiments.
[0141] This application provides a computer storage medium storing a computer program that, when executed by a processor, implements the video processing method provided in the above embodiments.
[0142] It should be noted that the computer-readable medium described above in this application can be a computer-readable signal medium, a computer-readable storage medium, or any combination of the two. A computer-readable storage medium can be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. A computer-readable storage medium can include: an electrical connection having one or more wires, a portable computer disk, a hard disk, RAM, ROM, an erasable programmable read-only memory (EPROM), flash memory, optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination thereof. In this application, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including: wires, optical fibers, radio frequency (RF), etc., or any suitable combination thereof.
[0143] In some implementations, clients and servers can communicate using any currently known or future-developed network protocol, such as Hypertext Transfer Protocol (HTTP), and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future-developed networks.
[0144] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.
[0145] The aforementioned computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: determine the amount of video frame change relative to the video frame corresponding to a first video frame, wherein the first video frame and the second video frame are different forward predictive coding frames belonging to the same image group, and the image group to which the first video frame and the second video frame belong only includes independently coded keyframes and forward predictive coding frames; determine the target detection result corresponding to the first video frame based on the amount of video frame change and the target detection result corresponding to the second video frame, wherein the target detection result is used to indicate the result of the detection of a target object in the video frame corresponding to the first video frame; and blur the video frame corresponding to the first video frame based on the target detection result corresponding to the first video frame.
[0146] Computer program code for performing the operations of this application can be written in one or more programming languages or a combination thereof. These programming languages include object-oriented programming languages—such as Java, Smalltalk, and C++—and conventional procedural programming languages—such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including LANs or WANs—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0147] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0148] The units described in the embodiments of this application can be implemented in software or in hardware. The name of a unit does not necessarily limit the unit itself; for example, the first acquisition unit can also be described as "a unit that acquires at least two Internet Protocol addresses".
[0149] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used include: Field-Programmable Gate Array (FPGA), Application-Specific Integrated Circuit (ASIC), Application-Specific Standard Product (ASSP), System on a Chip (SOC), Complex Programmable Logic Device (CPLD), and so on.
[0150] In the context of this application, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can include electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. A machine-readable storage medium can include an electrical connection based on one or more wires, a portable computer disk, a hard disk, RAM, ROM, EPROM, flash memory, optical fiber, CD-ROM, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
Claims
1. A video processing method, the method comprising: Determine the amount of video frame change that occurs between the video frame corresponding to the first video frame and the video frame corresponding to the second video frame. The first video frame and the second video frame are different forward predictive coding frames belonging to the same image group. The image group to which the first video frame and the second video frame belong only includes independently coded keyframes and forward predictive coding frames. Based on the changes in the video frame and the target detection result corresponding to the second video frame, the target detection result corresponding to the first video frame is determined. The target detection result is used to indicate the result of the detection of the target object in the video frame corresponding to the video frame. Based on the target detection result corresponding to the first video frame, the video image corresponding to the first video frame is blurred.
2. The method according to claim 1, wherein, Determine the amount of video frame change relative to the second video frame, including: A first data quantity associated with a first video frame is determined. The first data quantity is the data quantity generated by accumulating the data quantities associated with at least one first reference frame. The data quantity associated with the first reference frame is the motion vector data and residual value data of the first reference frame relative to the previous video frame in the image group to which the first reference frame belongs. The first reference frame includes the first video frame or a video frame located between the first video frame and the second video frame and containing the first video frame. The first video frame is a video frame that needs to rely on the second reference frame for predictive coding in the video encoding process to recover the complete video picture. The second reference frame is the second video frame or a video frame located between the first video frame and the second video frame and containing the second video frame. Based on the first data volume associated with the first video frame, determine the amount of video frame change that occurs between the video frame corresponding to the first video frame and the video frame corresponding to the second video frame.
3. The method according to claim 1 or 2, before determining the amount of video frame change relative to the video frame corresponding to the first video frame, the method further includes: A reference video frame change threshold is determined for the first video frame relative to the second video frame. The reference video frame change threshold is the upper limit of the video frame change required to ensure that the similarity between the target detection result corresponding to the first video frame and the target detection result corresponding to the second video frame is greater than the preset similarity.
4. The method according to claim 2, wherein, The step of determining the target detection result corresponding to the first video frame based on the change in the video frame and the target detection result corresponding to the second video frame includes: In response to the fact that the change in the video frame is not greater than the threshold of the change in the reference video frame, the target detection result corresponding to the second video frame is directly reused as the target detection result corresponding to the first video frame.
5. The method according to claim 2, wherein, The step of determining the target detection result corresponding to the first video frame based on the change in the video frame and the target detection result corresponding to the second video frame includes: In response to the video frame change being greater than the reference video frame change threshold, a target object detection operation is performed on the video frame corresponding to the first video frame.
6. The method according to claim 3, wherein, Determining the threshold for the change in reference video frame to be used relative to the second video frame includes: Determine the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame, wherein the third video frame and the second video frame belong to the same image group, and the third video frame is a forward predictive coding frame or key frame located before and adjacent to the second video frame. In response to the fact that the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame is not less than a preset similarity threshold, the second data quantity associated with the second video frame is determined as the reference video frame change threshold to be used by the first video frame relative to the second video frame. The second data quantity associated with the second video frame is the amount of data generated by the motion vector data and residual value data of the second video frame compared to the third video frame.
7. The method according to claim 6, wherein, Determining the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame includes: Determine the number of target objects in the target detection result corresponding to the second video frame and the number of target objects in the target detection result corresponding to the third video frame; In response to the fact that the number of two target objects is different, the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame is determined to be a preset value, which is used to indicate that the similarity between the two target detection results is 0; In response to the fact that the number of two target objects is the same, first size information and second size information are determined. The first size information is used to indicate the size of the overlapping area of the target detection boxes of the same target object in the target detection results of the second video frame and the target detection results of the third video frame. The second size information is used to indicate the size of the union area of the area covered by the two target detection boxes of the same target object in the target detection results of the second video frame and the target detection results of the third video frame. Based on the size ratio of the first size information to the second size information, the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame is determined.
8. The method according to claim 6, wherein before determining the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame, the method further comprises: The second video frame and the third video frame are respectively used as video frames to be processed, and the video frames corresponding to the video frames to be processed are divided into blocks to obtain multiple blocks of the video frames to be processed. The target object detection operation is performed in parallel on multiple frames to be processed to obtain the target detection results corresponding to each frame. The target detection results corresponding to the multiple frames to be processed are then merged to obtain the target detection result corresponding to the video frame to be processed.
9. The method according to claim 8, wherein, In multiple unprocessed video blocks, there are rectangular overlapping areas between adjacent unprocessed video blocks. The vertical height of the rectangular overlapping area is not less than a reference vertical height, and the horizontal width of the rectangular overlapping area is not less than a reference horizontal width. The reference vertical height is determined based on the maximum vertical height of the target detection box to which the target object belongs in the target detection results corresponding to the target detection results of the historical video frames collected before the unprocessed video frame. The reference horizontal width is determined based on the maximum horizontal width of the target detection box to which the target object belongs in the target detection results of the historical video frames collected before the unprocessed video frame.
10. A video processing apparatus, the apparatus comprising: The first determining module is configured to determine the amount of video frame change that occurs between the video frame corresponding to the first video frame and the video frame corresponding to the second video frame. The first video frame and the second video frame are different forward predictive coding frames belonging to the same image group. The image group to which the first video frame and the second video frame belong only includes independently coded keyframes and forward predictive coding frames. The second determining module is configured to determine the target detection result corresponding to the first video frame based on the change amount of the video frame and the target detection result corresponding to the second video frame. The target detection result is used to indicate the result of the detection of the target object in the video frame corresponding to the video frame. The blurring module is configured to blur the video frame corresponding to the first video frame based on the target detection result corresponding to the first video frame.
11. The apparatus according to claim 10, wherein, The first determining module is configured as follows: A first data quantity associated with a first video frame is determined. The first data quantity is the data quantity generated by accumulating the data quantities associated with at least one first reference frame. The data quantity associated with the first reference frame is the motion vector data and residual value data of the first reference frame relative to the previous video frame in the image group to which the first reference frame belongs. The first reference frame includes the first video frame or a video frame located between the first video frame and the second video frame and containing the first video frame. The first video frame is a video frame that needs to rely on the second reference frame for predictive coding in the video encoding process to recover the complete video picture. The second reference frame is the second video frame or a video frame located between the first video frame and the second video frame and containing the second video frame. Based on the first data volume associated with the first video frame, determine the amount of video frame change that occurs between the video frame corresponding to the first video frame and the video frame corresponding to the second video frame.
12. The apparatus according to claim 10 or 11, further comprising: The threshold determination module is configured to determine a reference video frame change threshold to be used for the first video frame relative to the second video frame before determining the amount of video frame change generated by the video frame corresponding to the first video frame relative to the video frame corresponding to the second video frame. The reference video frame change threshold is a video frame change threshold that, for the video frame corresponding to the first video frame and the video frame corresponding to the second video frame, makes the similarity between the target detection result corresponding to the first video frame and the target detection result corresponding to the second video frame greater than a preset similarity upper limit value.
13. The apparatus according to claim 11, wherein, The second module is set as follows: In response to the fact that the change in the video frame is not greater than the threshold of the change in the reference video frame, the target detection result corresponding to the second video frame is directly reused as the target detection result corresponding to the first video frame.
14. The apparatus according to claim 11, wherein, The second module is set as follows: In response to the video frame change being greater than the reference video frame change threshold, a target object detection operation is performed on the video frame corresponding to the first video frame.
15. The apparatus according to claim 12, wherein, The threshold determination module is set as follows: Determine the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame, wherein the third video frame and the second video frame belong to the same image group, and the third video frame is a forward predictive coding frame or key frame located before and adjacent to the second video frame. In response to the fact that the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame is not less than a preset similarity threshold, the second data quantity associated with the second video frame is determined as the reference video frame change threshold to be used by the first video frame relative to the second video frame. The second data quantity associated with the second video frame is the amount of data generated by the motion vector data and residual value data of the second video frame compared to the third video frame.
16. The apparatus according to claim 15, wherein, Determining the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame includes: Determine the number of target objects in the target detection result corresponding to the second video frame and the number of target objects in the target detection result corresponding to the third video frame; In response to the fact that the number of two target objects is different, the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame is determined to be a preset value, which is used to indicate that the similarity between the two target detection results is 0; In response to the fact that the number of two target objects is the same, first size information and second size information are determined. The first size information is used to indicate the size of the overlapping area of the target detection result of the second video frame and the target detection box of the same target object in the target detection result of the second video frame. The second size information is used to indicate the size of the union area of the area covered by the two target detection boxes of the same target object in the target detection result of the second video frame and the target detection result of the second video frame. Based on the size ratio of the first size information to the second size information, the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame is determined.
17. The apparatus according to claim 16, wherein, The threshold determination module is further configured to: Before determining the similarity between the target detection result corresponding to the second video frame and the target detection result corresponding to the third video frame, the second video frame and the third video frame are respectively used as video frames to be processed, and the video frames corresponding to the video frames to be processed are divided into blocks to obtain multiple blocks of the video frames to be processed. The target object detection operation is performed in parallel on multiple frames to be processed to obtain the target detection results corresponding to each frame. The target detection results corresponding to the multiple frames to be processed are then merged to obtain the target detection result corresponding to the video frame to be processed.
18. The apparatus according to claim 17, wherein, In multiple unprocessed video blocks, there are rectangular overlapping areas between adjacent unprocessed video blocks. The vertical height of the rectangular overlapping area is not less than a reference vertical height, and the horizontal width of the rectangular overlapping area is not less than a reference horizontal width. The reference vertical height is determined based on the maximum vertical height of the target detection box to which the target object belongs in the target detection results corresponding to the target detection results of the historical video frames collected before the unprocessed video frame. The reference horizontal width is determined based on the maximum horizontal width of the target detection box to which the target object belongs in the target detection results of the historical video frames collected before the unprocessed video frame.
19. An electronic device, the electronic device comprising: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the video processing method according to any one of claims 1-9.
20. A storage medium comprising computer-executable instructions, which, when executed by a computer processor, are used to perform the video processing method according to any one of claims 1-9.