A video processing method and apparatus

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By encoding and compressing videos, fusing information, and processing spatiotemporally adaptive feature maps, the problems of slow video processing speed and low accuracy are solved, and efficient video processing results are achieved.

CN115661706BActive Publication Date: 2026-06-16SHENZHEN XUMI YUNTU SPACE TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHENZHEN XUMI YUNTU SPACE TECH CO LTD
Filing Date: 2022-10-20
Publication Date: 2026-06-16

Smart Images

Figure CN115661706B_ABST

Patent Text Reader

Abstract

The present disclosure provides a video processing method and device. After obtaining a to-be-processed video, the method can first perform encoding compression processing on the to-be-processed video to obtain information coding features of the to-be-processed video; then, information fusion processing can be performed on the information coding features to obtain a multi-space field feature map and a multi-time field feature map; then, according to the multi-space field feature map and the multi-time field feature map, a space-time mutual adaptation feature map can be determined; finally, preset type video processing can be performed on the space-time mutual adaptation feature map to obtain a preset type processing result corresponding to the to-be-processed video. In this way, in the video processing process, multi-field, multi-span and space-time information mutual adaptation fusion are realized to fully mine information, thereby improving the accuracy of the video processing result. Furthermore, the method provided by the present disclosure not only improves the processing speed and efficiency of video processing, but also improves the accuracy of the video processing result.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of image processing technology, and in particular to a video processing method and apparatus. Background Technology

[0002] Current image processing technologies are beginning to employ visual neural network technology. However, while this technology is highly efficient for processing two-dimensional images, it is slow and inefficient for video processing. Furthermore, it lacks multi-view or multi-span processing methods in both spatial and temporal dimensions, resulting in low accuracy in video processing results. Therefore, there is an urgent need for a new video processing method that balances efficiency and accuracy. Summary of the Invention

[0003] In view of this, the present disclosure provides a video processing method, apparatus, computer device, and computer-readable storage medium to solve the problems of slow video processing speed, low efficiency, and low accuracy of video processing results in the prior art.

[0004] A first aspect of this disclosure provides a video processing method, the method comprising:

[0005] The process involves acquiring a video to be processed and encoding and compressing the video to obtain its information encoding features.

[0006] The information encoding features are subjected to information fusion processing to obtain multi-spatial field-of-view feature maps and multi-temporal field-of-view feature maps;

[0007] Based on the multi-spatial view feature map and the multi-temporal view feature map, a spatiotemporal adaptation feature map is determined;

[0008] The spatiotemporal adaptive feature map is subjected to preset type video processing to obtain the preset type processing result corresponding to the video to be processed.

[0009] A second aspect of this disclosure provides a video processing method, the method comprising:

[0010] The feature extraction unit is used to acquire the video to be processed and to encode and compress the video to be processed to obtain the information encoding features of the video to be processed.

[0011] The information fusion unit is used to perform information fusion processing on the information encoding features to obtain a multi-spatial field-view feature map and a multi-temporal field-view feature map.

[0012] The feature determination unit is used to determine a spatiotemporal adaptation feature map based on the multi-spatial view feature map and the multi-temporal view feature map;

[0013] The result determination unit is used to perform preset type video processing on the spatiotemporal adaptive feature map to obtain the preset type processing result corresponding to the video to be processed.

[0014] A third aspect of this disclosure provides a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method described above.

[0015] A fourth aspect of this disclosure provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described method.

[0016] The beneficial effects of this disclosed embodiment compared with the prior art are as follows: This disclosed embodiment provides a video processing method. After acquiring a video to be processed, the method first performs encoding and compression processing on the video to be processed to obtain information encoding features of the video to be processed. Then, it performs information fusion processing on the information encoding features to obtain a multi-spatial field-of-view feature map and a multi-temporal field-of-view feature map. Next, it determines a spatiotemporal adaptation feature map based on the multi-spatial field-of-view feature map and the multi-temporal field-of-view feature map. Finally, it performs preset type video processing on the spatiotemporal adaptation feature map to obtain a preset type processing result corresponding to the video to be processed. As can be seen, in this application, the extraction of video information coding features and the processing of video information are handled separately. This improves the speed of feature extraction during the extraction process. During video information processing, multi-spatial and multi-temporal view feature maps corresponding to the video to be processed are extracted to determine a spatiotemporal adaptive feature map. The spatiotemporal adaptive feature map is then used to perform preset type video processing to obtain the preset type processing result corresponding to the video to be processed. Thus, the video processing process achieves full information mining across multiple views, spans, and spatiotemporal information adaptive fusion, thereby improving the accuracy of the video processing results. Therefore, the method provided in this application not only improves the processing speed and efficiency of video processing but also enhances the accuracy of the video processing results. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of this disclosure, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1This is a schematic diagram illustrating an application scenario of an embodiment of this disclosure;

[0019] Figure 2 This is a flowchart of the video processing method provided in the embodiments of this disclosure;

[0020] Figure 3 This is a block diagram of the video processing apparatus provided in the embodiments of this disclosure;

[0021] Figure 4 This is a schematic diagram of a computer device provided in an embodiment of this disclosure. Detailed Implementation

[0022] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, so as to provide a thorough understanding of the embodiments of this disclosure. However, those skilled in the art will understand that this disclosure may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this disclosure with unnecessary detail.

[0023] A video processing method and apparatus according to embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings.

[0024] Current image processing technologies are slow and inefficient when processing video. Furthermore, they lack multi-view or multi-span processing methods in both spatial and temporal dimensions, resulting in low accuracy. Therefore, a new video processing method that balances efficiency and accuracy is urgently needed. In other words, the efficiency and accuracy of existing methods still fall short of practical application requirements and require further research and improvement.

[0025] To address the aforementioned problems, this invention provides a video processing method. After acquiring the video to be processed, the method first encodes and compresses the video to obtain its information encoding features. Then, it performs information fusion processing on these information encoding features to obtain a multi-spatial field-of-view feature map and a multi-temporal field-of-view feature map. Next, it determines a spatiotemporal adaptation feature map based on these two maps. Finally, it performs preset-type video processing on the spatiotemporal adaptation feature map to obtain a preset-type processing result corresponding to the video to be processed. As can be seen, in this application, the extraction of video information coding features and the processing of video information are handled separately. This improves the speed of feature extraction during the extraction process. During video information processing, multi-spatial and multi-temporal view feature maps corresponding to the video to be processed are extracted to determine a spatiotemporal adaptive feature map. This spatiotemporal adaptive feature map is then used to perform preset type video processing to obtain the preset type processing result for the video to be processed. Thus, the video processing process achieves full information mining across multiple views, spans, and spatiotemporal information adaptive fusion, thereby improving the accuracy of the video processing results. Therefore, the method provided in this application not only improves the processing speed and efficiency of video processing but also enhances the accuracy of the video processing results.

[0026] For example, embodiments of the present invention can be applied to, for example... Figure 1 The application scenario shown can include terminal device 1 and server 2.

[0027] Terminal device 1 can be hardware or software. When terminal device 1 is hardware, it can be various electronic devices with image acquisition capabilities and supporting communication with server 2, including but not limited to smartphones, tablets, laptops, and desktop computers; when terminal device 1 is software, it can be installed in the aforementioned electronic devices. Terminal device 1 can be implemented as multiple software programs or software modules, or as a single software program or software module; this embodiment of the disclosure does not impose any limitations on this. Server 2 can be a server that provides various services, such as a backend server that receives requests sent by terminal devices with which it has established communication connections. This backend server can receive and analyze the requests sent by the terminal devices and generate processing results. Server 2 can be a single server, a server cluster consisting of several servers, or a cloud computing service center; this embodiment of the disclosure does not impose any limitations on this.

[0028] It should be noted that server 2 can be either hardware or software. When server 2 is hardware, it can be various electronic devices that provide various services to terminal device 1. When server 2 is software, it can be multiple software programs or software modules that provide various services to terminal device 1, or it can be a single software program or software module that provides various services to terminal device 1. This disclosure does not impose any limitations on this aspect.

[0029] Terminal device 1 and server 2 can communicate via a network. The network can be a wired network using coaxial cable, twisted pair, or fiber optic connection, or a wireless network that enables interconnection of various communication devices without wiring, such as Bluetooth, Near Field Communication (NFC), or Infrared. This disclosure does not limit the scope of the embodiments.

[0030] Specifically, a user can input two human images to be matched via terminal device 1, which then sends the video to be processed to server 2. Server 2 first encodes and compresses the video to obtain its information encoding features; then, it performs information fusion processing on these features to obtain multi-spatial and multi-temporal view feature maps; next, it determines a spatiotemporal adaptation feature map based on these maps; finally, it performs preset type video processing on the spatiotemporal adaptation feature map to obtain the preset type processing result corresponding to the video. Server 2 returns the preset type processing result to terminal device 1 so that terminal device 1 can display the result to the user. This not only improves the processing speed and efficiency of video processing but also enhances the accuracy of the processing results.

[0031] It should be noted that the specific types, quantities, and combinations of terminal device 1, server 2, and network can be adjusted according to the actual needs of the application scenario, and this disclosure embodiment does not impose any restrictions on this.

[0032] It should be noted that the above application scenarios are shown only for the purpose of understanding this disclosure, and the implementation of this disclosure is not limited in any way. On the contrary, the implementation of this disclosure can be applied to any applicable scenario.

[0033] Figure 2 This is a flowchart of a video processing method provided in an embodiment of this disclosure. Figure 2 A video processing method can be derived from Figure 1 The terminal device or server executes the command. For example... Figure 2 As shown, the video processing method includes:

[0034] S201: Obtain the video to be processed, and encode and compress the video to obtain the information encoding features of the video to be processed.

[0035] In this embodiment, the video to be processed can be understood as the video that needs to be processed. As an example, the video to be processed can be captured by a surveillance camera installed in a fixed location, captured by a mobile terminal device, or read from a storage device that pre-stores images. It should be noted that, in one implementation, the video to be processed can be a segment of video extracted from a video. It should also be noted that, in order to further reduce the amount of data processed and improve the efficiency of video processing, the video to be processed can be a video containing a preset number of frames per second (e.g., 8 frames). If the number of frames per second of the video to be processed does not meet the preset number of frames, the number of frames can be processed, for example, by reducing or increasing the number of frames per second, so that the number of frames per second of the video to be processed meets the preset number of frames. For example, if the number of frames per second in the video to be processed is 64 images, and the video duration is 8 seconds, the number of frames in the video to be processed can be reduced so that the video to be processed contains 8 frames per second.

[0036] After obtaining the video to be processed, features can be extracted and compressed to achieve rapid feature extraction and compression. After acquiring the video, it can be encoded and compressed to obtain its information encoding features. For example, image information can be extracted first, followed by image feature extraction and encoding / compression of the extracted image features to obtain the information encoding features of the video. It should be noted that the information encoding features of the video can be understood as compressed and encoded feature vectors that reflect information related to the video's time, space, channel, and content.

[0037] S202: Perform information fusion processing on the information encoding features to obtain multi-spatial view feature maps and multi-temporal view feature maps.

[0038] After obtaining the information encoding features of the video to be processed, these features can be fused from both the temporal and spatial dimensions. The following section will introduce how to fuse these features from both the temporal and spatial dimensions to obtain multi-spatial view feature maps and multi-temporal view feature maps.

[0039] Specifically, we can first perform multi-view calculations on the information encoding features in the time dimension to obtain multiple time dimension information with different view sizes, that is, the view size corresponding to each time dimension information is different; then, we can fuse the multiple time dimension information according to the spatial dimension to obtain a multi-time view feature map, that is, we fuse time information according to spatial information; in this way, we can further encode the information encoding features, perform multi-view calculations in the time dimension to obtain time dimension information with multiple views, and effectively utilize the information in the spatial dimension to fuse multi-scale information in the time dimension (i.e., multiple time dimension information with different view sizes).

[0040] Furthermore, the information encoding features can first undergo multi-view calculation in the spatial dimension to obtain multiple spatial dimension information with different view sizes, meaning that the view size corresponding to each spatial dimension information is different; then, the multiple time dimension information is fused according to the time dimension to obtain a multi-spatial view feature map, that is, spatial information is fused according to time information; in this way, the information encoding features can be further encoded, and multi-view calculation in the spatial dimension can be performed to obtain spatial dimension information with multiple views, and the information in the time dimension can be effectively utilized to fuse multi-scale information in the spatial dimension (i.e., multiple spatial dimension information with different view sizes).

[0041] S203: Determine the spatiotemporal adaptation feature map based on the multi-spatial view feature map and the multi-temporal view feature map.

[0042] In this embodiment, after determining the multi-spatial view feature map and the multi-temporal view feature map corresponding to the video to be processed, the spatiotemporal adaptation feature map can be determined using the multi-spatial view feature map and the multi-temporal view feature map. The spatiotemporal adaptation feature map can be understood as a feature map that reflects the relationship between the temporal and spatial dimensions of the video to be processed; that is, the spatiotemporal adaptation feature map reflects the relationship between the various features corresponding to the video to be processed.

[0043] As an example, after obtaining the multi-spatial and multi-temporal view feature maps of the video to be processed, the multi-spatial and multi-temporal view feature maps can be stacked to obtain a multi-view fusion feature map, which is a fusion feature map of multiple view sizes. Then, convolution is performed on this multi-view fusion feature map to obtain a spatiotemporal adaptive feature map.

[0044] S204: Perform preset type video processing on the spatiotemporal adaptive feature map to obtain the preset type processing result corresponding to the video to be processed.

[0045] Since the spatiotemporal adaptation feature map reflects the features of the video to be processed, various types of video processing can be performed using the spatiotemporal adaptation feature map of the video to be processed to obtain the corresponding video processing results. For example, the preset type of video processing includes at least one of the following: video content detection, video content recognition, and action recognition. Thus, after determining the spatiotemporal adaptation feature map of the video to be processed, the preset type of video processing can be performed on the spatiotemporal adaptation feature map to obtain the preset type of processing result corresponding to the video to be processed. For example, assuming the preset type of video processing method is video content detection, and the detected content is a door, then the door in the video to be processed can be detected using the spatiotemporal adaptation feature map to obtain the door detection result corresponding to the video to be processed; assuming the preset type of video processing method is video content recognition, then the content of the video to be processed can be detected using the spatiotemporal adaptation feature map to obtain the content corresponding to the video to be processed as "a golden retriever is eating"; assuming the preset type of video processing method is action recognition, then the action in the video to be processed can be detected using the spatiotemporal adaptation feature map to obtain the action recognition result corresponding to the video to be processed: the action of closing a door occurs.

[0046] Thus, the method provided in this embodiment, after acquiring the video to be processed, can first encode and compress the video to obtain the information encoding features of the video to be processed; then, it can perform information fusion processing on the information encoding features to obtain a multi-spatial field-of-view feature map and a multi-temporal field-of-view feature map; next, it can determine a spatiotemporal adaptation feature map based on the multi-spatial field-of-view feature map and the multi-temporal field-of-view feature map; finally, it can perform preset type video processing on the spatiotemporal adaptation feature map to obtain the preset type processing result corresponding to the video to be processed. As can be seen, in this application, the extraction of video information coding features and the processing of video information are handled separately. This improves the speed of feature extraction during the extraction process. During video information processing, multi-spatial and multi-temporal view feature maps corresponding to the video to be processed are extracted to determine a spatiotemporal adaptive feature map. The spatiotemporal adaptive feature map is then used to perform preset type video processing to obtain the preset type processing result for the video to be processed. This achieves full information mining (i.e., extraction of deep spatiotemporal information) across multiple views and spans through spatiotemporal adaptive fusion during video processing, thereby improving the accuracy of the video processing results. Therefore, the method provided in this application not only improves the processing speed and efficiency of video processing but also enhances the accuracy of the video processing results.

[0047] Next, we will introduce one implementation method of "encoding and compressing the video to be processed to obtain the information encoding features of the video to be processed" in S201. In one implementation method, Figure 2The corresponding method can be applied to a two-order spatiotemporal transformation model, which includes an initial-order spatiotemporal transformation sub-model. In this embodiment, the step of encoding and compressing the video to be processed to obtain the information encoding features of the video to be processed may include the following steps:

[0048] The video to be processed is input into the initial spatiotemporal variation sub-model to obtain the information encoding features of the video to be processed.

[0049] The initial-order spatiotemporal transformation sub-model comprises several initial-order spatiotemporal transformation modules, which are connected in series. Each initial-order spatiotemporal transformation module includes a spatial convolutional layer, a first residual convolutional layer, a second residual convolutional layer, and a third residual convolutional layer. For example, in one implementation, the initial-order spatiotemporal transformation sub-model comprises eight initial-order spatiotemporal transformation modules.

[0050] In one implementation, the spatial convolutional layer can be a single 1x3x3 convolutional layer with 32 channels. The first residual convolutional layer comprises two modules, each module consisting of two 1x3x3 convolutional layers with 32 groups and 32 channels, a batch normalization (BN) layer with 32 channels, a 1x1x1 convolution with 64 channels, a PreLU activation function layer, a 3x1x1 convolutional layer with 32 channels, and a batch normalization (BN) layer with 32 channels. The downsampling of the first module is 2. Each module is configured based on a residual design. It should be noted that the f1 dimension of the feature map output by the first residual convolutional layer is (32, 64, 224, 224). The second residual convolutional layer consists of two modules, each of which includes: two 1x3x3 convolutional layers with 64 groups and 64 channels each; a batch normalization (BN) layer with 64 channels; a 1x1x1 convolutional layer with 128 channels; a PreLU activation function layer; a 3x1x1 convolutional layer with 64 channels; and a batch normalization (BN) layer with 64 channels. The downsampling of the first module is 2. Each module of the second residual convolutional layer is configured based on a residual design. It should be noted that the f2 dimension of the feature map output by the second residual convolutional layer is (64, 64, 112, 112). The third residual convolutional layer consists of three modules; each module structure includes: three 1x3x3 convolutional layers with 128 groups and 128 channels, a batch normalization (BN) layer with 128 channels, a 1x1x1 convolutional layer with 256 channels, a PreLU activation function layer, a 3x1x1 convolutional layer with 128 channels, and a batch normalization (BN) layer with 128 channels; the first module has a downsampling of 2; each module is set based on the residual design method; the f3 dimension of the feature map output by the third residual convolutional layer is (128, 64, 56, 56). The initial spatiotemporal variation sub-model can complete one spatiotemporal and channel calculation for the video to be processed. Because the temporal dimension of each video does not change much, the spatial dimension calculation is more and the temporal dimension calculation is less in the initial spatiotemporal variation sub-model.

[0051] It should be noted that in this embodiment, the 1x3x3 convolutional layers are all spatial convolutional layers, the 1x1x1 convolutional layers are all channel information transformation convolutional layers, and the 3x1x1 convolutional layers are all temporal convolutional layers.

[0052] Next, we will introduce one implementation method for "performing information fusion processing on the information encoding features in S202 to obtain multi-spatial view feature maps and multi-temporal view feature maps" and "determining spatiotemporal adaptation feature maps based on the multi-spatial view feature maps and multi-temporal view feature maps" in S203. In one implementation method, Figure 2The corresponding method can be applied to a two-order spatiotemporal transformation model, which includes an advanced spatiotemporal transformation model, which includes a deep spatiotemporal adaptation sub-model. Each deep spatiotemporal adaptation sub-model includes a spatial dimension fusion module, a temporal dimension fusion module, and a spatiotemporal fusion module.

[0053] In this embodiment, the step of performing information fusion processing on the information encoding features to obtain multi-spatial view feature maps and multi-temporal view feature maps may include the following steps:

[0054] S202a: Input the information encoding feature into the spatial dimension fusion module, and use the spatial dimension fusion module to perform spatial dimension information fusion processing on the information encoding feature to obtain a multi-spatial view feature map.

[0055] In this embodiment, the spatial dimension fusion module may include: multiple spatial view branches, a temporal convolutional layer, and an activation function layer.

[0056] Specifically, the information encoding features can be input into the multiple spatial vision branches to obtain multiple spatial vision feature maps. In one implementation, the spatial dimension fusion module can include four spatial vision branches, each representing a different spatial computation vision, as follows: The first spatial vision branch includes a 1x3x3 depthwise separable convolutional layer, a batch normalization (BN) layer, a reLU activation function layer, and a 1x1x1 convolutional layer; the second spatial vision branch includes a 1x5x5 depthwise separable convolutional layer, group normalization (GN), a preLU activation function layer, and a 1x1x1 convolutional layer; the third spatial vision branch includes a 1x7x7 depthwise separable convolutional layer, an instance normalization (IN) layer, a GELU activation function layer, and a 1x1x1 convolutional layer; the fourth spatial vision branch includes three 1x3x3 depthwise separable convolutional layers, a group normalization (GN) layer, a mish activation function layer, and a 1x1x1 convolutional layer. Assume that the input feature map of each of the four spatial vision branches is p0, with dimensions (c,t,h,w). After passing through the above four branches, spatial vision feature maps p1, p2, p3, and p4 are obtained respectively.

[0057] Then, these multiple spatial view feature maps can be stacked along the number of channels to obtain a multi-view fusion feature map. Continuing the above example, these four spatial view feature maps p1, p2, p3, and p4 can be stacked together along the number of channels to obtain a multi-view fusion feature map p5 with dimensions (4c,t,h,w).

[0058] Next, the multi-view fusion feature map is input into the temporal convolutional layer to obtain the temporal dimension feature map. The temporal convolutional layer can be a 3x1x1 convolutional layer with 4 channels. In one implementation, assuming the multi-view fusion feature map p5 is input into the temporal convolutional layer, the temporal convolutional layer can output a temporal dimension feature map p6 with dimensions (4c,t,h,w).

[0059] Next, the temporal feature map can be input into the activation function layer. The activation function layer performs activation function calculations on the temporal feature map along the channel number dimension to obtain multiple feature map weight values. Each feature map weight value is associated with each spatial view feature. Figure 1 One-to-one correspondence. It can be understood that each spatial view feature map output by a spatial view branch receives an adaptively calculated feature map weight value, which is calculated based on the time dimension. In one implementation, the activation function layer can be a softmax function. The softmax function can calculate the feature map p7 by performing softmax calculation on the time dimension feature map along the first axis (i.e., the channel number dimension). Then, feature map p7 can be decomposed into four feature map weight values with dimensions (1, t, h, w), represented as p71, p72, p73, and p74, respectively. These four feature map weight values correspond one-to-one with the four spatial view feature maps p1, p2, p3, and p4.

[0060] Finally, a multi-spatial view feature map can be obtained based on the multiple spatial view feature maps and their corresponding weight values. In one implementation, the sum of the products of each spatial view feature map and its corresponding weight value can be used as the multi-spatial view feature map; for example, p8 = p1*p71 + p2*p72 + p3*p73 + p4*p74, where p8 is the multi-spatial view feature map, p1, p2, p3, and p4 are spatial view feature maps, and p71, p72, p73, and p74 are the weight values of the feature maps.

[0061] S202b: Input the information encoding feature into the time dimension fusion module, and use the time dimension fusion module to perform time dimension information fusion processing on the information encoding feature to obtain a multi-time view feature map.

[0062] This time-dimensional fusion module includes: multiple time-view branches, spatial convolutional layers, and activation function layers.

[0063] Specifically, the information encoding features can be input into multiple temporal view branches in the temporal dimension fusion module to obtain multiple temporal view feature maps. In one implementation, the temporal dimension fusion module can include four temporal view branches, each representing a different temporal computation view, as follows: The first temporal view branch includes a 3x1x1 depthwise separable convolutional layer, a batch normalization (BN) layer, a reLU activation function layer, and a 1x1x1 convolutional layer; the second temporal view branch includes a 5x1x1 depthwise separable convolutional layer, group normalization (GN), a preLU activation function layer, and a 1x1x1 convolutional layer; the third temporal view branch includes a 7x1x1 depthwise separable convolutional layer, an instance normalization (IN) layer, a GELU activation function layer, and a 1x1x1 convolutional layer; the fourth temporal view branch includes three 3x1x1 depthwise separable convolutional layers, a group normalization (GN) layer, a mish activation function layer, and a 1x1x1 convolutional layer. Assuming that the input feature maps of the four temporal vision branches are all p0 with dimensions (c,t,h,w), the temporal vision feature maps s1, s2, s3, and s4 are obtained after passing through the four branches.

[0064] Then, these multiple temporal view feature maps can be stacked along the number of channels to obtain a multi-view fusion feature map. Continuing the above example, these four temporal view feature maps s1, s2, s3, and s4 can be stacked together along the number of channels to obtain a multi-view fusion feature map s5 with dimensions (4c,t,h,w).

[0065] Next, the multi-view fusion feature map is input into the spatial convolutional layer to obtain a spatial dimension feature map. The spatial convolutional layer can be a 1x3x3 convolutional layer with 4 channels. In one implementation, assuming the multi-view fusion feature map s5 is input into the spatial convolutional layer, the spatial convolutional layer can output a spatial dimension feature map s6 with dimensions (4c, t, h, w).

[0066] Next, the spatial dimension feature map can be input into the activation function layer. This layer performs activation function calculations along the channel number dimension of the spatial dimension feature map, obtaining multiple feature map weight values. Each feature map weight value is associated with a specific temporal view feature. Figure 1One-to-one correspondence. It can be understood that each temporal view branch outputs a temporal view feature map with an adaptively calculated feature map weight value, which is calculated based on the spatial dimension. In one implementation, the activation function layer can be a softmax function. The softmax function can calculate the feature map s7 by performing softmax calculation along the first axis (i.e., the channel number dimension) on the spatial dimension feature map. Then, feature map s7 can be decomposed into four feature map weight values with dimensions (1, t, h, w), represented as s71, s72, s73, and s74, respectively. These four feature map weight values correspond one-to-one with the four temporal view feature maps s1, s2, s3, and s4.

[0067] Finally, a multi-temporal view feature map can be obtained based on the multiple temporal view feature maps and their corresponding weight values. In one implementation, the sum of the products of each temporal view feature map and its corresponding weight value can be used as the multi-temporal view feature map; for example, s8 = s1*s71 + s2*s72 + s3*s73 + s4*s74, where s8 is the multi-temporal view feature map, s1, s2, s3, and s4 are temporal view feature maps, and s71, s72, s73, and s74 are feature map weight values.

[0068] In this embodiment, the step of determining the spatiotemporal adaptation feature map based on the multi-spatial view feature map and the multi-temporal view feature map may include the following steps:

[0069] The multi-spatial view feature map and the multi-temporal view feature map are input into the spatiotemporal fusion module to obtain the spatiotemporal adaptive feature map.

[0070] In this embodiment, the spatiotemporal fusion module may include: a spatial convolutional layer and a temporal convolutional layer.

[0071] Specifically, in this embodiment, the multi-spatial view feature map and the multi-temporal view feature map can first be stacked along the number of channels to obtain a multi-view fusion feature map. Then, the multi-view fusion feature map can be input into the spatial convolutional layer to obtain a spatial feature map; wherein, the spatial convolutional layer can be a 1x3x3 convolutional layer with c channels. Next, the spatial feature map can be input into the temporal convolutional layer to obtain the spatiotemporal adaptation feature map; wherein, the temporal convolutional layer can be a 3x1x1 convolutional layer with c channels.

[0072] It should be noted that, in one implementation, the two-order spatiotemporal transformation model may include multiple deep spatiotemporal adaptive sub-models. Furthermore, the multiple deep spatiotemporal adaptive sub-models are connected in series, and the network architecture of each deep spatiotemporal adaptive sub-model is the same as that described in the above embodiments, including: a spatial dimension fusion module, a temporal dimension fusion module, and a spatiotemporal fusion module, which will not be elaborated further here. It should also be noted that the input to each deep spatiotemporal adaptive sub-model is the output of its adjacent preceding deep spatiotemporal adaptive sub-model.

[0073] For example, this two-order spatiotemporal transformation model can include eight cascaded deep spatiotemporal adaptive sub-models and twelve cascaded deep spatiotemporal adaptive sub-models. The eight cascaded deep spatiotemporal adaptive sub-models and the twelve cascaded deep spatiotemporal adaptive sub-models are connected by a max pooling layer with a downsampling rate of 2, and the twelve cascaded deep spatiotemporal adaptive sub-models are followed by a max pooling layer with a downsampling rate of 2. Specifically, each model in the eight cascaded deep spatiotemporal adaptive sub-models has 256 channels, and the dimension of the output spatiotemporal adaptive feature map is (256, 64, 28, 28); each model in the twelve cascaded deep spatiotemporal adaptive sub-models has 512 channels, and the dimension of the output spatiotemporal adaptive feature map is (512, 64, 14, 14). It should be noted that after the 12 cascaded deep spatiotemporal adaptation sub-models, the max pooling layer can be followed by a global average pooling layer, then two fully connected layers, an activation layer, and a preset type layer. This allows the spatiotemporal adaptation feature map to undergo preset type video processing, resulting in the preset type processing result corresponding to the video to be processed. For example, when the preset type layer is a classification layer, the preset type video processing can be used for action classification and recognition.

[0074] All of the above-mentioned optional technical solutions can be combined in any way to form optional embodiments of this disclosure, and will not be described in detail here.

[0075] The following are embodiments of the apparatus disclosed herein, which can be used to execute embodiments of the method disclosed herein. For details not disclosed in the apparatus embodiments of this disclosure, please refer to the embodiments of the method disclosed herein.

[0076] Figure 3 This is a schematic diagram of the video processing apparatus provided in an embodiment of this disclosure. Figure 3 As shown, the video processing device includes:

[0077] The feature extraction unit 301 is used to acquire the video to be processed and to encode and compress the video to be processed to obtain the information encoding features of the video to be processed.

[0078] The information fusion unit 302 is used to perform information fusion processing on the information encoding features to obtain a multi-spatial view feature map and a multi-temporal view feature map.

[0079] The feature determination unit 303 is used to determine the spatiotemporal adaptation feature map based on the multi-spatial view feature map and the multi-temporal view feature map;

[0080] The result determination unit 304 is used to perform preset type video processing on the spatiotemporal adaptive feature map to obtain the preset type processing result corresponding to the video to be processed.

[0081] Optionally, the device is applied to a two-order spatiotemporal transformation model, which includes a first-order spatiotemporal transformation sub-model; the feature extraction unit 301 is used for:

[0082] The video to be processed is input into the initial spatiotemporal variation sub-model to obtain the information encoding features of the video to be processed;

[0083] The initial spatiotemporal transformation sub-model includes several initial spatiotemporal transformation modules, each of which includes a spatial convolutional layer, a first residual convolutional layer, a second residual convolutional layer, and a third residual convolutional layer.

[0084] Optionally, the device is applied to a two-order spatiotemporal transformation model, which includes an advanced spatiotemporal transformation model, which includes a deep spatiotemporal adaptation sub-model, and each of the deep spatiotemporal adaptation sub-models includes: a spatial dimension fusion module, a temporal dimension fusion module, and a spatiotemporal fusion module.

[0085] The information fusion unit 302 is used for:

[0086] The information encoding feature is input into the spatial dimension fusion module, and the spatial dimension fusion module is used to perform spatial dimension information fusion processing on the information encoding feature to obtain a multi-spatial view feature map.

[0087] The information encoding feature is input into the time dimension fusion module, and the time dimension fusion module is used to perform time dimension information fusion processing on the information encoding feature to obtain a multi-time view feature map.

[0088] The feature determining unit 303 is used for:

[0089] The multi-spatial view feature map and the multi-temporal view feature map are input into the spatiotemporal fusion module to obtain the spatiotemporal adaptive feature map.

[0090] Optionally, the spatial dimension fusion module includes: multiple spatial view branches, a temporal convolutional layer, and an activation function layer; the information fusion unit 302 is used for:

[0091] The information encoding features are input into the multiple spatial view branches respectively to obtain multiple spatial view feature maps;

[0092] The multiple spatial view feature maps are stacked along the number of channels to obtain a multi-view fusion feature map;

[0093] The multi-view fusion feature map is input into the temporal convolutional layer to obtain the temporal dimension feature map;

[0094] The time dimension feature map is input into the activation function layer, which performs activation function calculations on the time dimension feature map along the channel number dimension to obtain multiple feature map weight values.

[0095] Based on the multiple spatial view feature maps and their weight values, a multi-spatial view feature map is obtained.

[0096] Optionally, the temporal dimension fusion module includes: multiple temporal perspective branches, spatial convolutional layers, and activation function layers; the information fusion unit 302 is used for:

[0097] The information encoding features are input into the multiple temporal view branches respectively to obtain multiple temporal view feature maps;

[0098] The multiple temporal view feature maps are stacked along the number of channels to obtain a multi-view fusion feature map;

[0099] The multi-view fusion feature map is input into the spatial convolutional layer to obtain the spatial dimension feature map;

[0100] The spatial dimension feature map is input into the activation function layer, which performs activation function calculations on the spatial dimension feature map along the channel number dimension to obtain multiple feature map weight values.

[0101] Based on the multiple temporal view feature maps and their weight values, a multi-temporal view feature map is obtained.

[0102] Optionally, the spatiotemporal fusion module includes: a spatial convolutional layer and a temporal convolutional layer; the feature determination unit 303 is used for:

[0103] The multi-spatial view feature map and the multi-temporal view feature map are stacked along the number of channels to obtain a multi-view fusion feature map;

[0104] The multi-view fusion feature map is input into the spatial convolutional layer to obtain the spatial feature map;

[0105] The spatial feature map is input into the temporal convolutional layer to obtain the spatiotemporal adaptive feature map.

[0106] Optionally, the preset video processing types include at least one of the following: video content detection, video content recognition, and motion recognition.

[0107] The technical solution provided in this embodiment is a video processing apparatus, which includes: a feature extraction unit, used to acquire a video to be processed, and to encode and compress the video to be processed to obtain information encoding features of the video to be processed;

[0108] The information fusion unit is used to perform information fusion processing on the information encoding features to obtain multi-spatial view feature maps and multi-temporal view feature maps.

[0109] The feature determination unit is used to determine the spatiotemporal adaptation feature map based on the multi-spatial view feature map and the multi-temporal view feature map;

[0110] The result determination unit is used to perform preset type video processing on the spatiotemporal adaptive feature map to obtain the preset type processing result corresponding to the video to be processed. In this embodiment, the extraction of video information encoding features and the processing of video information are handled separately. This improves the speed of feature extraction during the extraction process. During video information processing, the spatiotemporal adaptive feature map is determined by extracting the multi-spatial view feature map and multi-temporal view feature map corresponding to the video to be processed. The preset type video processing is then performed using the spatiotemporal adaptive feature map to obtain the preset type processing result corresponding to the video to be processed. This achieves full information mining during video processing, encompassing multiple views, multiple spans, and spatiotemporal information adaptive fusion, thereby improving the accuracy of the video processing results. Therefore, the method provided in this application not only improves the processing speed and efficiency of video processing but also enhances the accuracy of the video processing results.

[0111] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this disclosure.

[0112] Figure 4 This is a schematic diagram of the computer device 4 provided in an embodiment of this disclosure. Figure 4 As shown, the computer device 4 in this embodiment includes a processor 401, a memory 402, and a computer program 403 stored in the memory 402 and executable on the processor 401. When the processor 401 executes the computer program 403, it implements the steps in the various method embodiments described above. Alternatively, when the processor 401 executes the computer program 403, it implements the functions of each module / unit in the various device embodiments described above.

[0113] Exemplarily, computer program 403 may be divided into one or more modules / units, which are stored in memory 402 and executed by processor 401 to perform the present disclosure. The one or more modules / units may be a series of computer program instruction segments capable of performing a specific function, which describe the execution process of computer program 403 in computer device 4.

[0114] Computer device 4 can be a desktop computer, laptop, handheld computer, cloud server, or other similar computer device. Computer device 4 may include, but is not limited to, processor 401 and memory 402. Those skilled in the art will understand that... Figure 4 This is merely an example of computer device 4 and does not constitute a limitation on computer device 4. It may include more or fewer components than shown, or combine certain components, or different components. For example, computer device may also include input / output devices, network access devices, buses, etc.

[0115] Processor 401 can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.

[0116] The memory 402 can be an internal storage unit of the computer device 4, such as a hard disk or RAM of the computer device 4. The memory 402 can also be an external storage device of the computer device 4, such as a plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, or Flash Card equipped on the computer device 4. Furthermore, the memory 402 can include both internal and external storage units of the computer device 4. The memory 402 is used to store computer programs and other programs and data required by the computer device. The memory 402 can also be used to temporarily store data that has been output or will be output.

[0117] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this disclosure. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0118] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0119] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this disclosure.

[0120] In the embodiments provided in this disclosure, it should be understood that the disclosed apparatus / computer devices and methods can be implemented in other ways. For example, the apparatus / computer device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. Multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0121] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0122] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0123] If an integrated module / unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program may include computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. A computer-readable medium may include: any entity or device capable of carrying computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in a computer-readable medium may be appropriately added to or subtracted according to the requirements of legislation and patent practice in a jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.

[0124] The above embodiments are only used to illustrate the technical solutions of this disclosure, and are not intended to limit it. Although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this disclosure, and should all be included within the protection scope of this disclosure.

Claims

1. A video processing method, characterized in that, The method includes: The process involves acquiring a video to be processed and encoding and compressing the video to obtain its information encoding features. The information encoding features are subjected to information fusion processing to obtain multi-spatial field-of-view feature maps and multi-temporal field-of-view feature maps; Based on the multi-spatial view feature map and the multi-temporal view feature map, a spatiotemporal adaptation feature map is determined; The spatiotemporal adaptive feature map is subjected to video processing of a preset type to obtain the preset type processing result corresponding to the video to be processed; The method is applied to a two-order spatiotemporal transformation model, which includes an advanced spatiotemporal transformation model. The advanced spatiotemporal transformation model includes a deep spatiotemporal adaptation sub-model, and each deep spatiotemporal adaptation sub-model includes: a spatial dimension fusion module, a temporal dimension fusion module, and a spatiotemporal fusion module. The information fusion processing of the information encoding features to obtain multi-spatial view feature maps and multi-temporal view feature maps includes: The information encoding features are input into the spatial dimension fusion module, and the spatial dimension fusion module is used to perform spatial dimension information fusion processing on the information encoding features to obtain a multi-spatial view feature map. The information encoding features are input into the time dimension fusion module, and the time dimension fusion module is used to perform time dimension information fusion processing on the information encoding features to obtain a multi-time view feature map. The step of determining the spatiotemporal adaptation feature map based on the multi-spatial view feature map and the multi-temporal view feature map includes: The multi-spatial view feature map and the multi-temporal view feature map are input into the spatiotemporal fusion module to obtain the spatiotemporal adaptive feature map.

2. The method according to claim 1, characterized in that, The two-order spatiotemporal transformation model includes an initial-order spatiotemporal transformation sub-model; the encoding and compression processing of the video to be processed to obtain the information encoding features of the video to be processed includes: The video to be processed is input into the first-order spatiotemporal variation sub-model to obtain the information encoding features of the video to be processed. The initial spatiotemporal transformation sub-model includes several initial spatiotemporal transformation modules, each of which includes a spatial convolutional layer, a first residual convolutional layer, a second residual convolutional layer, and a third residual convolutional layer.

3. The method according to claim 1, characterized in that, The spatial dimension fusion module includes: multiple spatial view branches, a temporal convolutional layer, and an activation function layer; the step of inputting the information encoding features into the spatial dimension fusion module, and using the spatial dimension fusion module to perform spatial dimension information fusion processing on the information encoding features to obtain a multi-spatial view feature map includes: The information encoding features are respectively input into the multiple spatial view branches to obtain multiple spatial view feature maps; The multiple spatial view feature maps are stacked along the number of channels to obtain a multi-view fusion feature map; The multi-view fusion feature map is input into the temporal convolutional layer to obtain the temporal dimension feature map; The time dimension feature map is input into the activation function layer, and the activation function layer performs activation function calculation on the time dimension feature map along the channel number dimension to obtain multiple feature map weight values; Based on the multiple spatial view feature maps and the multiple feature map weight values, a multi-spatial view feature map is obtained.

4. The method according to claim 1, characterized in that, The temporal dimension fusion module includes: multiple temporal view branches, spatial convolutional layers, and activation function layers; the process of inputting the information encoding features into the temporal dimension fusion module and using the temporal dimension fusion module to perform temporal dimension information fusion processing on the information encoding features to obtain multi-temporal view feature maps includes: The information encoding features are input into the multiple temporal view branches respectively to obtain multiple temporal view feature maps; The multiple temporal view feature maps are stacked along the number of channels to obtain a multi-view fusion feature map; The multi-view fusion feature map is input into the spatial convolutional layer to obtain a spatial dimension feature map; The spatial dimension feature map is input into the activation function layer, and the activation function layer performs activation function calculation on the spatial dimension feature map along the channel number dimension to obtain multiple feature map weight values; Based on the multiple temporal view feature maps and the multiple feature map weight values, a multi-temporal view feature map is obtained.

5. The method according to claim 1, characterized in that, The spatiotemporal fusion module includes: a spatial convolutional layer and a temporal convolutional layer; the step of inputting the multi-spatial view feature map and the multi-temporal view feature map into the spatiotemporal fusion module to obtain the spatiotemporal adaptive feature map includes: The multi-spatial field-view feature map and the multi-temporal field-view feature map are stacked along the number of channels to obtain a multi-field-view fusion feature map. The multi-view fusion feature map is input into the spatial convolutional layer to obtain the spatial feature map; The spatial feature map is input into the temporal convolutional layer to obtain the spatiotemporal adaptive feature map.

6. The method according to any one of claims 1-5, characterized in that, The preset video processing types include at least one of the following: video content detection, video content recognition, and motion recognition.

7. A video processing apparatus, characterized in that, The device includes: The feature extraction unit is used to acquire the video to be processed and to encode and compress the video to be processed to obtain the information encoding features of the video to be processed. The information fusion unit is used to perform information fusion processing on the information encoding features to obtain a multi-spatial field-view feature map and a multi-temporal field-view feature map. The feature determination unit is used to determine a spatiotemporal adaptation feature map based on the multi-spatial view feature map and the multi-temporal view feature map; The result determination unit is used to perform preset type video processing on the spatiotemporal adaptive feature map to obtain the preset type processing result corresponding to the video to be processed. The device is applied to a two-order spatiotemporal transformation model, which includes an advanced spatiotemporal transformation model. The advanced spatiotemporal transformation model includes a deep spatiotemporal adaptation sub-model, and each of the deep spatiotemporal adaptation sub-models includes: a spatial dimension fusion module, a temporal dimension fusion module, and a spatiotemporal fusion module. The information fusion unit is specifically used for: inputting the information encoding features into the spatial dimension fusion module, using the spatial dimension fusion module to perform spatial dimension information fusion processing on the information encoding features to obtain a multi-spatial view feature map; inputting the information encoding features into the temporal dimension fusion module, using the temporal dimension fusion module to perform temporal dimension information fusion processing on the information encoding features to obtain a multi-temporal view feature map; The feature determination unit is specifically used to: input the multi-spatial view feature map and the multi-temporal view feature map into the spatiotemporal fusion module to obtain the spatiotemporal adaptive feature map.

8. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method as described in any one of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 6.