Video scene recognition method and device, storage medium and electronic device

By fusing the temporal features of the target video frame and the reference video frame through a scene parsing network, the problem of low accuracy in video scene recognition is solved, and fine segmentation and accurate recognition of different types of object regions in the video scene are achieved, thereby improving computational efficiency.

CN115937728BActive Publication Date: 2026-06-26ALIPAY (HANGZHOU) INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
Filing Date
2022-01-13
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, the accuracy of video scene recognition is low, especially when there is insufficient semantic information between video frames and insufficient mining of pixel block correlation information.

Method used

A scene parsing network is used to fuse temporal features of the target video frame and the reference video frame. Spatial information is extracted through a spatial branch network, semantic information is extracted through a semantic branch network, and information is fused through a feature fusion network to output a region category map of the target video frame.

Benefits of technology

It improves the accuracy of video scene recognition, enables fine segmentation and accurate recognition of different types of object regions in video scenes, and enhances computational efficiency and recognition speed.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115937728B_ABST
    Figure CN115937728B_ABST
Patent Text Reader

Abstract

The specification discloses a video scene recognition method and device, a storage medium and an electronic device, wherein the method comprises: acquiring a target video frame of a target video, inputting the target video frame into a scene analysis network, determining a first region image feature corresponding to the target video frame and a second region image feature corresponding to a reference video frame through the scene analysis network, and performing time sequence feature fusion on the first region image feature and the second region image feature through the scene analysis network to output a region category graph corresponding to the target video frame. The specification can improve the accuracy of video scene recognition.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to the field of computer technology, and in particular to a video scene recognition method, apparatus, storage medium, and electronic device. Background Technology

[0002] Video scene recognition is an important research direction in computer vision based on image segmentation technology, and it serves as a crucial component of image semantic understanding. Video scene recognition based on image segmentation technology refers to the process of dividing an image into several regions with similar properties; simply put, it involves segmenting the regions containing different categories of objects from at least one frame of a video. Summary of the Invention

[0003] This specification provides a video scene recognition method, apparatus, storage medium, and electronic device, the technical solution of which is as follows:

[0004] Firstly, this specification provides a video scene recognition method, the method comprising:

[0005] Obtain the target video frame from the target video;

[0006] The target video frame is input into the scene parsing network, and the scene parsing network determines the first region image features corresponding to the target video frame and the second region image features corresponding to the reference video frame. The reference video frame is a video frame preceding the target video frame.

[0007] The scene parsing network performs temporal feature fusion on the image features of the first region and the image features of the second region, and outputs the region category map corresponding to the target video frame.

[0008] Secondly, this specification provides a video scene recognition device, the device comprising:

[0009] The frame acquisition module is used to acquire the target video frames of the target video.

[0010] The feature determination module is used to input the target video frame into the scene parsing network, and determine the first region image features corresponding to the target video frame and the second region image features corresponding to the reference video frame through the scene parsing network. The reference video frame is a video frame preceding the target video frame.

[0011] The feature fusion module is used to perform temporal feature fusion on the image features of the first region and the image features of the second region through the scene parsing network, and output the region category map corresponding to the target video frame.

[0012] Thirdly, this specification provides a computer storage medium storing a plurality of instructions adapted for loading by a processor and executing the above-described method steps.

[0013] Fourthly, this specification provides an electronic device that may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to execute the above-described method steps.

[0014] In the embodiments of this specification, a target video frame is acquired from a target video and input into a scene parsing network. The scene parsing network determines the first region image features corresponding to the target video frame and the second region image features corresponding to a reference video frame, where the reference video frame is a video frame preceding the target video frame. The scene parsing network then performs temporal feature fusion on the first region image features and the second region image features. The second region image features based on the reference video frame can assist in the video scene recognition of the target video frame, thereby obtaining rich image information in the video scene. Finally, a region category map corresponding to the target video frame is output. The region category map contains accurate region category results, which can significantly improve the accuracy of video scene recognition. Attached Figure Description

[0015] To more clearly illustrate the technical solutions in this specification or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of one or more embodiments of the specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 This is a flowchart illustrating a video scene recognition system provided in the embodiments of this specification;

[0017] Figure 2 This is a flowchart illustrating a video scene recognition method provided in the embodiments of this specification;

[0018] Figure 3 This is a scene diagram of a spatial branching network involved in the video scene recognition method provided in the embodiments of this specification;

[0019] Figure 4 This is a schematic diagram of the network architecture of a semantic branch network involved in the video scene recognition method provided in the embodiments of this specification;

[0020] Figure 5 This is a schematic diagram of the network structure of a visual converter involved in the video scene recognition method provided in the embodiments of this specification;

[0021] Figure 6 This is a schematic diagram of the network structure of an attention optimization module involved in the video scene recognition method provided in the embodiments of this specification;

[0022] Figure 7 This is a schematic diagram of the network architecture of a feature fusion network involved in the video scene recognition method provided in the embodiments of this specification;

[0023] Figure 8 This is a schematic diagram of a temporal feature fusion module involved in the video scene recognition method provided in the embodiments of this specification;

[0024] Figure 9 This is a flowchart illustrating a video scene recognition method provided in the embodiments of this specification;

[0025] Figure 10 This is a scene diagram of a scene parsing network provided in the embodiments of this specification;

[0026] Figure 11 This is a scene diagram illustrating a measured video frame example provided in the embodiments of this specification;

[0027] Figure 12 This is a schematic diagram of a video scene analysis application provided in the embodiments of this specification;

[0028] Figure 13 This is a schematic diagram of the structure of a video scene recognition device provided in the embodiments of this specification;

[0029] Figure 14 This is a schematic diagram of the structure of a feature determination module provided in an embodiment of this specification;

[0030] Figure 15 This is a schematic diagram of the structure of a feature fusion module provided in an embodiment of this specification;

[0031] Figure 16 This is a schematic diagram of the structure of an electronic device provided in the embodiments of this specification. Detailed Implementation

[0032] To make the features and advantages of the embodiments of this specification more apparent and understandable, the technical solutions of the embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the embodiments of this specification, and not all embodiments. Based on one or more embodiments of this specification, all other embodiments obtained by those skilled in the art without creative effort are within the protection scope of the embodiments of this specification.

[0033] In the following description, when referring to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with those in this specification. Rather, they are merely examples of apparatuses and methods consistent with some aspects of the embodiments of this specification as detailed in the appended claims. The flowcharts shown in the drawings are merely illustrative and are not necessarily to be performed in accordance with the steps shown. For example, some steps are parallel and do not have a strict logical order, so the actual execution order is variable. Furthermore, the terms "first," "second," "third," "fourth," "fifth," "sixth," "seventh," and "eighth" are for distinguishing purposes only and should not be construed as limiting the scope of this disclosure.

[0034] Video scene recognition, simply put, involves segmenting different categories of objects from video frames. For example, in a street image, it's about segmenting the areas containing roads, people, cars, trees, and buildings. Common techniques include semantic segmentation and scene parsing, which often directly perform scene recognition on the current video frame. However, real-world scenes are very diverse, and directly performing scene recognition on the video frame results in low accuracy.

[0035] Please see Figure 1 This document provides a flowchart illustrating a video scene recognition method as described in an embodiment. This method can be implemented using a computer program and can run on a video scene recognition device based on the von Neumann architecture. The computer program can be integrated into an application or run as a standalone utility application. The video scene recognition device can be an electronic device. This electronic device includes, but is not limited to, servers, service platforms, personal computers, tablets, in-vehicle devices, smartphones, computing devices, or other processing devices connected to a wireless modem. In different networks, electronic devices can have different names, such as: user equipment, access terminal, user unit, user station, mobile station, mobile station, remote station, remote terminal, mobile device, user terminal, terminal, wireless communication device, user agent or user device, cellular phone, cordless phone, personal digital assistant (PDA), electronic devices in 5G networks or future evolved networks, etc.

[0036] Specifically, the video scene recognition method may include the following steps:

[0037] S102, Obtain the target video frame of the target video;

[0038] Understandably, video scene recognition for a target video can involve identifying the regions containing different categories of objects within at least one video frame. For example, in a recorded video of a street, scene recognition of the target video frame can identify areas such as roads, people, cars, trees, and houses. The target video frame can be understood as the single video image to be identified within the target video.

[0039] In one or more embodiments of the specification, the target video can be a video recorded under conditions requiring real-time video scene recognition. The target video can be a video in an online video scenario, such as an online face-to-face interview scenario or an online identity verification scenario. Taking an online face-to-face interview scenario as an example, when a user and a server conduct a video face-to-face interview to obtain relevant service information, in order to effectively protect the information security of the user or service personnel and improve the video experience for both parties, scene recognition can be performed on at least one frame of the real-time recorded target video to determine the object category region in the video scene (such as the user and / or service personnel category region) to achieve fine segmentation of the object from the target video, and then replace it with a suitable background or blur the background; alternatively, the object category region (such as the user and / or service personnel) in the video scene can be determined to replace the object with a virtual object (such as replacing the user image with a cartoon object). In the aforementioned process, video scene recognition of at least one frame of the target video is involved to identify the region where different categories of objects are located in the video scene, so as to determine the region where the object is located.

[0040] In the embodiments of this specification, in the process of video scene recognition involving target video, it is usually necessary to perform video scene recognition on target video frames in multiple rounds of target video based on actual application needs.

[0041] S104, the target video frame is input into the scene parsing network, and the first region image features corresponding to the target video frame and the second region image features corresponding to the reference video frame are determined by the scene parsing network. The reference video frame is a video frame preceding the target video frame.

[0042] Understandably, the scene parsing network can be a neural network model, and the input data of the scene parsing network can be at least the target video frame to be identified in the target video; the output data of the scene parsing network is the region category map corresponding to the target video frame; the scene parsing network identifies the regions where different categories of objects are located in the video scene corresponding to the target video frame, and represents them with the region category map.

[0043] The first region image feature is obtained by extracting region image features from the target video frame through a scene parsing network. It can be understood that the first region image feature can characterize the region features of different types of objects in the video scene (corresponding to the target video frame), such as roads, walls, cars, and people, within the target video frame.

[0044] In one or more embodiments of the specification, the scene parsing network can be implemented based on fitting one or more of the following models: Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Networks (RNN), Embedding model, Gradient Boosting Decision Tree (GBDT) model, Logistic Regression (LR) model, etc.

[0045] Understandably, the reference video frame is a video frame preceding the target video frame. The target video typically consists of at least one video frame, and the reference video frame is the video frame preceding the target video frame. In this application, during the process of using a scene parsing network to perform video scene recognition on the target video frame, second region image features corresponding to the reference video frame are also introduced. The scene parsing network can combine the second region image features to capture and extract feature information between video frames, thereby achieving fine and accurate semantic segmentation of the video scene corresponding to the target video frame and improving the accuracy of video scene recognition.

[0046] The second region image feature is obtained by extracting region image features from the reference video frame using a scene parsing network. It can be understood that the second region image feature can characterize the region features of different types of objects (such as roads, walls, cars, and people) in the video scene (corresponding to the reference video frame).

[0047] In one or more embodiments of the specification, an initial scene parsing network is trained by using sample video frames contained in the sample video data. After training is completed, a trained scene parsing network can be obtained.

[0048] In one feasible implementation, a target video frame and at least one reference video frame can be input into a scene parsing network. The scene parsing network determines the first region image features corresponding to the target video frame and the second region image features corresponding to each of the reference video frames. This can be understood as follows: when performing video scene recognition on the target video frame, the target video frame and at least one preceding reference video frame are used as input data, both as input to the scene parsing network. The scene parsing network can synchronously or asynchronously perform video scene parsing processing on the target video frame and each reference video frame. During video scene parsing processing, the scene parsing network can determine the first region image features corresponding to the target video frame and the second region image features corresponding to each reference video frame. It is understood that the second region image features based on the reference video frames can assist the scene parsing network in recognizing the video scene of the target video frame, avoiding inaccurate recognition due to insufficient semantic information, inadequate processing and mining of non-local information and correlation information between pixel blocks when only recognizing a single video frame.

[0049] In one feasible implementation, during the video scene recognition process of video frames, each frame of the target video typically involves multiple rounds of recognition, such as recognizing video frame 1 at time t1, recognizing video frame 2 at time t2, ... recognizing video frame i at time ti. During the training phase, the scene parsing network can generate the function of "saving and acquiring the regional image features of reference video frames that have already been recognized". In this way, when performing scene parsing and recognition on the current target video frame to be recognized, it is not necessary to perform secondary recognition on its previous reference video frames, thereby saving computational recognition resources and improving the efficiency of scene recognition. Understandably, the target video frame can be input into the scene parsing network. The scene parsing network determines the first region image features corresponding to the target video frame and obtains the second region image features corresponding to at least one saved reference video frame. That is, the scene parsing network extracts the first region image frame corresponding to the target video on the one hand, and on the other hand, since the second region image features are usually the region image features for the reference video frame determined by the scene parsing network, the scene parsing network can store the second region image features of the reference video frame. In this way, when processing the target video frame, the second region image features can be directly obtained through the scene parsing network.

[0050] S106, the scene parsing network performs temporal feature fusion on the image features of the first region and the image features of the second region, and outputs the region category map corresponding to the target video frame.

[0051] The region category map contains region information of multiple categories of objects (objects) in the video scene. The region category map can be a region category mask, which can clearly reflect the boundaries or contours of the image regions where different categories of objects are located.

[0052] Understandably, the scene parsing network takes the region category mask corresponding to the target video frame as input, and the region category mask can accurately reflect the position of different categories of objects in the image. This facilitates subsequent applications such as extracting points of interest (POIs), masking certain specific regions, and extracting object structural features.

[0053] Understandably, if the reference video frame is a video frame preceding the target video frame, then the first region image features corresponding to the target video frame and the second region image features corresponding to the reference video frame have different feature time sequences. The scene parsing network is used to obtain the image category information in the second region image features corresponding to the previous reference video frame, and then integrates it into the first region image features of the current target video frame. This fuses the common feature information of the current target video frame and the previous reference video frame belonging to the same category of objects, thereby obtaining the region category map corresponding to the target video frame, such as the region category mask map (mask) corresponding to the target video frame.

[0054] In one or more embodiments of the specification, a target video frame is acquired from a target video and input into a scene parsing network. The scene parsing network determines the first region image features corresponding to the target video frame and the second region image features corresponding to a reference video frame, wherein the reference video frame is a video frame preceding the target video frame. The scene parsing network then performs temporal feature fusion on the first region image features and the second region image features. The second region image features based on the reference video frame can assist in the video scene recognition of the target video frame, thereby obtaining rich image information in the video scene. Finally, a region category map corresponding to the target video frame is output. The region category map contains accurate region category results, which can significantly improve the accuracy of video scene recognition.

[0055] Please see Figure 2 This diagram illustrates a video scene recognition method provided in an embodiment of this specification. Specifically, the video scene recognition method may include the following steps:

[0056] S202, Obtain the target video frame of the target video and input the target video frame into the scene parsing network.

[0057] It is understood that the scene parsing network may include at least a spatial branch network, a semantic branch network, and a feature fusion network.

[0058] In one or more embodiments of this specification, the scene parsing network may further include a temporal feature fusion module. It is understood that the scene parsing network involved in this specification typically uses a shallow convolutional neural network to extract spatial information from video frames (such as target video frames and reference video frames) and generate spatial information features that preserve spatial information. The semantic branch network can extract and parse (image) semantic information features from video frames (such as target video frames and reference video frames) based on a visual transformer, so as to generate a first region image feature containing rich spatial and semantic information through a feature fusion network. To effectively utilize information from past frames, the temporal information fusion module of the scene parsing network performs temporal fusion of information from past and current frames, thereby improving the accuracy and robustness of semantic segmentation of the target video frame and achieving accurate video scene recognition.

[0059] S204, Spatial information segmentation processing is performed on the target video frame through a spatial branching network to obtain spatial information features;

[0060] Understandably, the spatial branching network can also be called a spatial path network. It processes video frames (such as target video frames) to preserve the spatial scale of the original input video frame image while encoding rich spatial information. In one or more embodiments, the spatial branching network may contain i (i is a natural number) layers, each containing a convolutional layer with a stride of n (n is a natural number), followed by a batch normalization part (e.g., a batch normalization unit) and a non-linear activation part (ReLU) (e.g., a ReLU unit). Each layer downsamples the image by 1 / n. The image size output by this spatial branching network is 1 / (n) times the size of the original image. i In other words, after processing by the spatial branching network, the video frame (such as the target video frame) only yields an output feature map equivalent to the original image size. This output feature map is also the spatial information feature corresponding to the video frame (such as the target video frame). Therefore, the spatial branching network extracts the spatial information feature equivalent to 1 / (n) of the original image (such as the target video frame). i The output feature map of the spatial branching network. Since the spatial branching network utilizes a large-scale feature map, it can obtain rich spatial information. Understandably, video scenes often involve multiple different categories of objects. Based on the spatial branching network, each pixel feature in the image can be finely classified, providing crucial spatial information for video scene recognition. The richer the spatial information, the more refined the segmented region and the better the edge effect during video scene recognition.

[0061] Optionally, the batch normalization (bn) processing section can use multi-card synchronous bn (Syncbn) or single-card bn.

[0062] Understandably, multi-GPU simultaneous beat-navigate (Syncbn) performs better and faster than single-GPU beat-navigate when processing multiple video frames.

[0063] In some implementations, a scene resolution network can automatically switch between "multi-card simultaneous setup" (BN) and "single-card setup" based on task type and / or device environment. For example, the electronic device can determine the current device environment (such as graphics card parameters) through the scene resolution network to automatically switch between "multi-card simultaneous setup" and "single-card setup." For instance, it can switch to multi-card simultaneous setup when the device environment determines that it supports it, or switch to single-card setup when the device environment determines that it does not support it. Also illustratively, the electronic device can automatically switch between "multi-card simultaneous setup" and "single-card setup" based on the amount of input data (such as the number of video frames, data volume, etc.). For example, when the amount of input data exceeds a threshold, it can automatically switch to "multi-card simultaneous setup" through a spatial branching network.

[0064] Optionally, the scene resolution network may include a decision unit that automatically switches between "multi-card synchronous bn" and "single-card bn" in the spatial branch network. Understandably, the decision unit can detect the current device environment and select between "multi-card synchronous bn" and "single-card bn", thereby improving the intelligence of the scene resolution network.

[0065] Indicative, such as Figure 3 As shown, Figure 3 This is a schematic diagram of a spatial branching network. The spatial branching network can consist of three layers, each containing a convolutional layer with a stride of 2, multi-GPU synchronous batch normalization (BN), and ReLU. Each layer downsamples the input video frame (e.g., the target video frame) by 1 / 8. The spatial branching network extracts an output feature map equivalent to 1 / 8 of the video frame (e.g., the target video frame). The spatial information features corresponding to the video frame (e.g., the target video frame) are represented in the form of this output feature map. Optionally, the scene parsing network's decision maker pre-detects the current device environment and automatically switches to multi-GPU synchronous BN based on the device environment. This improves the parsing efficiency of the scene parsing network while also considering the computational resource requirements, thus enhancing the intelligence of the scene parsing network.

[0066] S206, Semantic information segmentation processing is performed on the target video frame through a semantic branching network to obtain semantic information features;

[0067] Understandably, the semantic branch network can also be called a context path network. The semantic branch network can extract and parse semantic information features from video frames (such as target video frames and reference video frames), providing a sufficient receptive field for the subsequent synthesis of first region image features. In some implementations, to combine the features of the semantic branch network with the spatial information features output by the spatial branch network, an attention optimization module (ARM) is used in the semantic branch network to optimize the features of each visual stage. The global average pooling of the attention optimization module captures the global context of the video frame and calculates the attention vector to guide feature learning. This optimizes the output features of each (visual) stage in the semantic branch network while avoiding the need for upsampling operations to easily integrate global context information and obtain accurate semantic information features.

[0068] In one or more embodiments of this specification, the semantic branch network may not be constructed based on a convolutional neural network to avoid the problems of limited effectiveness, insufficient semantic information, and inadequate mining of non-local information and correlation between pixel blocks when using a deep convolutional neural network model to extract image semantic information. Instead, the main branches are constructed based on a visual transformer to replace the semantic branches usually based on convolutional neural networks, thereby better extracting the semantic information contained in the image and also extracting the interrelationships between different pixel blocks to obtain more accurate semantic information features.

[0069] Understandably, semantic information features can be obtained by semantic branching networks based on visual transformers to segment the target video frames semantically.

[0070] In one or more embodiments of this specification, the semantic branching network based on the visual transformer consists of a block splitting layer, four visual stages, and an attention optimization module; the first visual stage consists of a linear embedding layer and a visual transformer; the second, third, and fourth visual stages consist of a block merging layer and the visual transformer; the third visual stage is connected to the first attention optimization module; and the fourth visual stage is connected to the second attention optimization module.

[0071] For example, such as Figure 4 As shown, Figure 4 This is a schematic diagram of a semantic branching network architecture, such as... Figure 4The semantic branching network shown is based on a visual transformer, mainly consisting of a block splitting layer, four visual stages (stages 1-4 shown in the figure), and an attention optimization module (ARM1, ARM2 shown in the figure). Each visual stage is used to downsample the input video frames. The first visual stage (stage 1) includes a linear embedding layer and a visual transformer. The second visual stage (stage 1), the third visual stage (stage 2), and the fourth visual stage (stage 3) consist of a patch merging layer and a visual transformer.

[0072] Understandably, the input video frames (such as the target video frame, etc., which can be H) are first processed through a block splitting layer. W The RGB image (3) is split into non-overlapping image patches of equal size. Each image patch is treated as a patch token (which can be understood as a set of blocks containing image patches), and a total of N patch tokens (i.e., the effective input sequence length of the visual transformer) are split into an image patch sequence.

[0073] For example, the block-splitting layer can be used to split the input, such as the target image 224. 224 is divided into a set of non-overlapping patch blocks, where each patch block is 4x4 in size, “target image 224”. The 224” has 3 color channels, each patch has a feature dimension of 4x4x3=48, and the number of patch blocks is H / 4x W / 4.

[0074] The first visual stage (stage 1): The feature dimensions of the divided patch are first projected to an arbitrary dimension C through a linear embedding layer, and then fed into the visual transformer of stage 1 to identify the feature vector of the image in each block. The visual transformer of stage 1 keeps the number of input and output image blocks unchanged.

[0075] The second visual stage (Stage 1), third visual stage (Stage 2), and fourth visual stage (Stage 3) operate similarly. The image patch sequence is taken as input and first passed through a block merging layer. The input is merged according to the size of adjacent blocks (e.g., merging according to 2x2 adjacent patches). This reduces the number of patch tokens in the image patch sequence by a corresponding factor (e.g., merging according to 2x2 adjacent patches reduces the number of patch tokens to 1 / 4 of the original). The feature dimension of each patch token is then increased by a corresponding factor (e.g., becoming 4C as illustrated above). A linear layer is then used to stitch features together for each patch, reducing the output dimension by half. Finally, a visual transformer is used for feature transformation, while maintaining the image resolution. The first block merging layer and the visual transformer used for feature transformation constitute the second visual stage. Repeating the same process twice as the second visual stage constitutes the third visual stage (Stage 2) and the fourth visual stage (Stage 3). Similarly, each visual stage changes the tensor dimension, thus forming a hierarchical representation. By processing the feature vectors of video frames through four visual stages, the semantic information features of the video frames, represented in the form of feature vectors, are obtained. It can be understood that in the visual converter network, the size of each image patch is pre-configured, and the number of patches can be determined based on the determined image patch size.

[0076] The block splitting layer is used to segment video frames into multiple image patches and obtain the feature vector of each image patch. The four vision stages are used to perform image semantic recognition on video frames based on the feature vectors, and obtain semantic information such as category semantics, position semantics, and pixel semantics of different categories of objects in the video frames (the semantic information is represented in the form of feature vectors).

[0077] After convolutional data is input to the patch partition layer, the patch partition layer divides the input convolutional data into a set of non-overlapping patches, which serve as the input features for the Swin Transformer network. The semantic branch network, as one of the backbones, is constructed by stacking visual transformers from various visual stages. The input features undergo feature dimension transformation through linear embedding layers. The network composed of visual transformers achieves the reuse of image semantic features by merging adjacent patches in the input.

[0078] Optionally, each visual stage may include multiple visual transducers, such as Figure 5 As shown, Figure 5 As shown, Figure 5 This is a schematic diagram of the network structure of a visual converter. Figure 5It contains two vision transformers. Each vision stage typically consists of an alternating vision transformer composed of a window-based MSA and a shift-window-based MSA. Each vision stage also consists of a window-based MSA (multi-head self-attention) with two layers of MLP (Multi-Layer Perception) and / or a shift-window-based MSA (multi-head self-attention). The number of layers in the vision transformer for each vision stage is usually a multiple of two, with one layer of W-MSA and the other of SW-MSA. The window-based MSA is referred to as W-MSA, and the shift-window-based MSA as SW-MSA. A LayerNorm (LN) layer is used before each W-MSA (or SW-MSA) module and each MLP, and residual connections are used after each MSA and MLP. The MSA module based on (displacement) windows (W-MSA or SW-MSA) divides the input image into non-overlapping windows and then performs self-attention calculations within different windows. Its computational complexity is linearly related to the image size.

[0079] It is questionable whether attention optimization modules (ARMs) are connected in the third and fourth visual stages, such as... Figure 4 As shown, the third visual stage is connected to the first attention optimization module; the fourth visual stage is connected to the second attention optimization module.

[0080] Understandably, the Attention Optimization Module (ARM) optimizes features at each visual stage. The ARM consists of a global pool, convolutional and multiplicative multiplication layers (such as conv(1×1)), a batch normalization layer for accelerated neural network training, and a sigmoid activation function layer. Figure 6 As shown, Figure 6 This is a schematic diagram of the network structure of an attention optimization module. The ARM shown captures the global context and computes attention vectors to guide feature learning by using global average pooling. Figure 6 As shown, the third visual stage is connected to the first attention optimization module (ARM1), which optimizes the output of the third visual stage; the fourth visual stage is connected to the second attention optimization module (ARM2), which optimizes the output of the fourth visual stage.

[0081] Understandably, in the semantic branch network, the attention optimization module (ARM) optimizes the features of the visual stage. Based on the global average pooling of the attention optimization module, the global context of video frames is captured, and attention vectors are calculated to guide feature learning. This optimizes the output features of the (visual) stage in the semantic branch network, while avoiding the need for upsampling operations, allowing for easy integration of global contextual information and obtaining accurate semantic features. Introducing the attention optimization module into the outputs of the third and fourth visual stages in the semantic branch network can optimize the output features of each visual stage in the network, easily integrating global image contextual information without any upsampling operations.

[0082] S208, the spatial information features and semantic information features are fused by a feature fusion network to obtain the first region image features corresponding to the target video frame;

[0083] Understandably, the Feature Fusion Network (FFM) is used to integrate spatial information features from the spatial branch network and semantic information features from the semantic branch network.

[0084] Understandable, such as Figure 7 As shown, Figure 7 This is a schematic diagram of the network architecture of a feature fusion network, which consists of a concatenate layer, a "convolution and multiplication layer (Conv) + sync-bn layer + ReLU layer", a global pool, a convolution and multiplication layer (conv(1×1)), a sync-accelerated neural network training layer (sync-bn layer), and an activation function layer (sigmoid).

[0085] Understandably, FFM first connects the two output features (spatial and semantic information features) of the Spatial Path and Context Path; then, it balances the feature scales of the two output features (spatial and semantic information features) through batch normalization. The next step is to pool the connected features (spatial and semantic information features) into a single feature vector and calculate a weight vector. This weight vector can reweight the features, playing a role in feature selection and combination. The feature fusion network then fuses the output features of these two paths to perform the final prediction, thereby obtaining the first region image features corresponding to the fused target video frame.

[0086] S210, the second region image features corresponding to the reference video frame are determined by the scene parsing network, wherein the reference video frame is a video frame preceding the target video frame;

[0087] Understandably, for the current target video frame, the scene parsing network decomposes the acquisition of image spatial information and the enlargement of the receptive field of the image into two paths. Then, the feature fusion module fuses the image features extracted from the two paths. The simultaneous calculation of the two paths accelerates the inference speed of the scene parsing network for the target video frame, greatly improving the computational efficiency and the detection rate of the model network, and pushing the image semantic segmentation task for the target video frame to a real-time processing level. On the other hand, the scene parsing network determines the second region image features corresponding to the reference video frame preceding the target video frame. The second region image features based on the reference video frame can assist the scene parsing network in recognizing the video scene of the target video frame, avoiding inaccurate recognition due to factors such as insufficient semantic information, insufficient processing and mining of non-local information and correlation information between pixel blocks when recognizing only one video frame.

[0088] In one or more embodiments of this specification, since the reference video frame is a video frame preceding the target video frame, video scene recognition of the reference video frame has typically been completed based on the scene parsing network before the current target video frame is identified. That is, the second region image features corresponding to the process reference video frame have been generated in the historical moments preceding the current moment. In some embodiments, the scene parsing network can generate the function of "saving and acquiring the region image features of the reference video frame that has already been scene-recognized" during the training phase. In this way, when performing scene parsing and recognition on the current target video frame to be identified, it is not necessary to perform secondary recognition on the previous reference video frames. The second region image features corresponding to at least one saved reference video frame can be directly acquired, thereby saving computational processing resources. At the same time, the scene recognition process of the target video frame can be improved by using the previous reference video frames to assist in the scene recognition process of the target video frame.

[0089] S212, the temporal feature fusion module based on the scene parsing network performs feature splicing processing on the image features of the first region and the image features of the second region to obtain image fusion features.

[0090] The image features of the first region and the image features of the second region are usually represented in the form of feature maps, such as feature mask maps.

[0091] In one feasible implementation, the image features of the first region and the image features of the second region can be spliced ​​together using channel features to obtain image fusion features.

[0092] Schematic, assuming the size of the image features of the first region and the image features of the second region can be represented as B. C H W is the width, H is the height, C is the number of channels, and b is the batch size. The image features of the first and second regions are concatenated along the channel dimension, and the size of the concatenated feature becomes B. 2C H W, B 2C H W refers to the image fusion feature.

[0093] In a specific implementation scenario, based on the actual situation, if there are multiple reference video frames included in the reference, then there are multiple second region image features. The second region image features can first be fused using the temporal feature fusion module of the scene parsing network to obtain the target second region image features after feature fusion processing. For example, assuming there are n reference video frames included in the reference, then there are n second region image features. The n second region image features are fused using the temporal feature fusion module of the scene parsing network to obtain the target second region image features after feature fusion. Then, the step of feature stitching processing of the first region image features and the target second region image features based on the temporal feature fusion module of the scene parsing network is performed. The process for the first region image features and the target second region image features is similar to the aforementioned process and will not be repeated here.

[0094] Optionally, the target second region image features can be obtained by averaging the channel features of each second region image feature based on the temporal feature fusion module of the scene parsing network. For example, the channel features of n second region image features can be averaged by the temporal feature fusion module of the scene parsing network, assuming that the size of each second region image feature is B. C H W, the resulting feature is n [B C H W], then take the arithmetic mean of the features of n frames along the channel dimension, so the first 3 features are merged into 1 feature with size B. C H W is used as the image feature of the second target region.

[0095] S214, Based on the temporal feature fusion module, perform feature fusion classification on the image fusion features and output the region category map corresponding to the target video frame.

[0096] The temporal feature fusion module may include at least one visual transformer; the at least one visual transformer included in the temporal feature fusion module performs feature selection and fusion classification on the fused features of the stitched image.

[0097] In one feasible implementation, the image fusion features can be input into at least one visual transformer of the temporal feature fusion module for feature fusion classification, and then the region category map corresponding to the target video frame can be output.

[0098] Indicative, stitched image fusion features, assumed to be 2C in size. H W is reduced in dimensionality by a convolutional layer with at least one visual transformer, and its size becomes N. H W and N represent the number of categories in the segmentation (which can be understood as the number of objects in different categories). Finally, softmax processing is performed on N, the dimension of "number of categories in the segmentation," for feature selection and fusion classification. Then, argmax processing is performed to obtain an H map. W is a region category map, such as a region category mask. The value of each pixel in the mask indicates which category the pixel belongs to. Regions with the same pixel value are objects of the same category. It can be understood that objects of the same category have the same pixel value in the region category mask. Based on this, the region category mask can intuitively reflect the regions where different categories of objects are located in the image.

[0099] Indicative, such as Figure 8 As shown, Figure 8 This is a schematic diagram of a temporal feature fusion module. The temporal feature fusion module may include at least a feature stitching part (such as a feature stitching unit) and a visual transformer. It takes at least first region image features and second region image features as inputs to the temporal feature fusion module. Then, based on the feature stitching part (such as the feature stitching unit) in the temporal feature fusion module, feature stitching processing is performed on the first region image features and the second region image features. The output of the feature stitching part (such as the feature stitching unit) is the stitched image fusion feature. The image fusion feature is then input into at least one visual transformer for feature fusion classification, and finally, the region category map corresponding to the target video frame is output.

[0100] Understandably, for example: assuming the number of reference video frames included in the reference is n, then the second region image features are n. The n second region image features are fused by the feature stitching part (such as the feature stitching unit) of the temporal feature fusion module to obtain the target second region image features after feature fusion. Then, the target second region image features and the first region image features are stitched together to obtain the image fusion features.

[0101] Optionally, in some implementations, the number of visual transformers in the temporal feature fusion module can be two, such as... Figure 5 As shown, the temporal feature fusion module consists of a visual converter composed of window-based MSAs connected to a visual converter composed of shift-window-based MSAs. The window-based MSAs can be called W-MSA, and the shift-window-based MSAs can be called SW-MSA. A LayerNorm (LN) layer is used before each W-MSA (or SW-MSA) module and each MLP, and residual connections are used after each MSA and MLP.

[0102] In one or more embodiments of the specification, a target video frame is acquired from a target video and input into a scene parsing network. The scene parsing network determines the first region image features corresponding to the target video frame and the second region image features corresponding to a reference video frame, wherein the reference video frame is a video frame preceding the target video frame. The scene parsing network then performs temporal feature fusion on the first region image features and the second region image features. The second region image features based on the reference video frame can assist in the video scene recognition of the target video frame, thereby obtaining rich image information in the video scene. The region category map corresponding to the target video frame is then output. The region category map contains accurate region category results, which can significantly improve the accuracy of video scene recognition. Furthermore, by introducing a temporal feature fusion module to capture and utilize information between video frames, fine and accurate semantic segmentation of video frames can be performed, which also improves the robustness of video scene recognition.

[0103] Please see Figure 9 This document provides a flowchart illustrating a video scene recognition method as described in an embodiment of this specification. Specifically, the video scene recognition method may include the following steps:

[0104] S302, acquire the target video frame and at least one reference video frame of the target video;

[0105] It is understood that the reference video frame is a video frame preceding the target video frame. In this application, a reference video frame is also introduced during the process of using a scene parsing network to perform video scene recognition on the target video frame. The scene parsing network can combine the second region image features corresponding to the reference video frame to capture and extract feature information between video frames of the target video, so as to achieve fine and accurate semantic segmentation of the video scene corresponding to the target video frame and improve the accuracy of video scene recognition.

[0106] In the embodiments of this specification, at least one reference video frame may be determined for the target video frame.

[0107] In one feasible implementation, a reference video frame indicating the target number before the target video frame can be obtained based on the target frame interval by setting the target frame interval and the target number.

[0108] The target frame interval is the frame interval set for selecting reference video frames, and the target number is the number of reference video frames set, i.e., the number of reference video frames selected. For example, the target frame interval can be set to 3, and the target number can be 3. Assuming the target video frame is the i-th frame, the selected reference video frames are the (i-3)-th frame, the (i-6)-th frame, and the (i-9)-th frame.

[0109] Optionally, before determining or acquiring at least one reference video frame, the video scene type corresponding to the target video frame can be determined to obtain the target frame interval and / or target quantity corresponding to the video scene type.

[0110] The video scene type is determined based on the actual application scenario, and can be an outdoor scene type, an indoor scene type, a night scene type, etc. The video scene type can be determined by scene recognition of the target video frame or the target video; or it can be determined by obtaining the video function options triggered by the user before shooting the target video. For example, if the function corresponding to the video chat scene is triggered to shoot the video, then the video scene type corresponding to the target video frame is determined to be the video chat scene type.

[0111] Understandably, at least one parameter mapping relationship between a reference scene type and the "reference frame interval and / or reference quantity" is pre-defined. This parameter mapping relationship can be represented in the form of a parameter mapping table, parameter mapping set, parameter mapping combination, etc., without specific limitations here. In practical applications, the video scene type corresponding to the target video frame is determined, and then, based on the aforementioned parameter mapping relationship, the "target frame interval and / or target quantity" corresponding to the current video scene type is determined.

[0112] In one feasible implementation, at least one historical video frame preceding the target video frame can be acquired, and a reference video frame indicating the target quantity can be determined based on the at least one historical video frame. It is understood that historical video frames within a specific number or interval preceding the target video frame (video frames exceeding the specific number or interval are less relevant and not included in the reference) can be acquired, and the reference video frame indicating the target quantity can be determined by comparing the differences between each historical video frame and the target video frame.

[0113] Understandably, when the differences between historical video frames and target video frames are small, and considering that these two video frames are quite similar to each other, the reference value of the feature fusion of historical video frames is limited. Therefore, video frames with suitable differences can be selected for fusion to obtain more accurate image fusion features.

[0114] Optionally, the image difference between at least one historical video frame and the target video frame can be determined, and a reference video frame indicating the target quantity can be determined from the at least one historical video frame based on the image difference. Illustratively, all historical video frames within a specific number or interval before the target video frame can be acquired, the image similarity between each historical video frame and the target video frame can be calculated, resulting in an image similarity sequence for each historical video frame, and the reference video frame indicating the target quantity can be selected based on the image similarity sequence. Alternatively, the image similarity sequence can be sorted, and the reference video frame indicating the target quantity can be selected from largest to smallest.

[0115] Optionally, the image similarity can be calculated as the feature similarity between the regional image features corresponding to the historical video frame and the regional image features corresponding to the target video frame. The feature similarity can be calculated as the feature vector distance, such as Euclidean distance, cosine similarity, etc.

[0116] S304, the target video frame and at least one reference video frame are input into the scene parsing network, and the scene parsing network is used to determine the first region image features corresponding to the target video frame and the second region image features corresponding to each of the reference video frames.

[0117] Understandably, a target video frame and at least one reference video frame can be input into a scene parsing network. The scene parsing network determines the first region image features corresponding to the target video frame and the second region image features corresponding to each of the reference video frames. In other words, when performing video scene recognition on a target video frame: the target video frame and at least one preceding reference video frame are used as input data to the scene parsing network. The scene parsing network can synchronously or asynchronously perform video scene parsing processing on the target video frame and each reference video frame. During video scene parsing processing, the scene parsing network can determine the first region image features corresponding to the target video frame and the second region image features corresponding to each reference video frame. Understandably, the second region image features based on the reference video frames can assist the scene parsing network in recognizing the video scene of the target video frame, avoiding inaccurate recognition due to insufficient semantic information, inadequate processing and mining of non-local information and correlation information between pixel blocks when only recognizing a single video frame.

[0118] Indicative, such as Figure 10 As shown, Figure 10 This is a scene diagram of a scene parsing network involved in this specification, such as... Figure 10 As shown, the target video frame is "FRAME". iThe reference video frames obtained are the third, sixth, and ninth video frames preceding the target video frame, respectively, and are designated as "FRAME". i-3 "FRAME" i-6 "FRAME" i-9 These four video frames are input into the scene parsing network. The scene parsing network can perform video scene parsing processing on the target video frame and each reference video frame synchronously or asynchronously. After passing through the spatial branch network, semantic branch network, and feature fusion network, the "FRAME" is obtained. i The corresponding first region image features, that is Figure 10 "F-Map" i FRAME i-3 The corresponding second region image features, that is Figure 10 "F-Map" i-3 “FRAME” i-6 The corresponding second region image features, that is Figure 10 "F-Map" i-6 “FRAME” i-9 The corresponding second region image features, that is Figure 10 "F-Map" i-9 Then, the temporal feature fusion module further processes the image features of the four regions, assuming that the size of each feature is B. C H W, the resulting feature is 4 [B C H Then, the features of the three reference video frames are averaged across the channel dimension (i.e., average processing is performed), so the features of the three reference video frames are fused into one feature "F-Map". mean "", size is B C H W, and then compared with the first region image features "F-Map" corresponding to the current target video frame. i "The image fusion features are obtained by stitching together the data along the channel dimension, and the size of the features becomes B." 2C H W, then the image fusion features are input into a two-layer visual converter for feature fusion and classification, and finally the region category map corresponding to the target video frame is output. (Illustrative example, such as...) Figure 11 As shown, Figure 11 This is a scene illustration of a measured video frame example involved in this application. Figure 11The example clearly shows the four video frames used, which are the aforementioned target video frames "FRAME". i The third video frame before the target video frame. i-3 ", sixth video frame" FRAME i-6 "and the ninth video frame "FRAME" i-9 "After the aforementioned process, the region category map corresponding to the target video frame output is as follows..." Figure 11 As shown, Figure 11 The region category map shown accurately identifies the regional features of various objects such as people, musical instruments, and boxes in the target video frame.

[0119] S306, Obtain the target video frame of the target video;

[0120] For details, please refer to the method steps of one or more embodiments described in this specification, which will not be repeated here.

[0121] S308, the target video frame is input into the scene parsing network, and the first region image features corresponding to the target video frame are determined by the scene parsing network and the second region image features corresponding to at least one saved reference video frame are obtained. The second region image features are the region image features for the reference video frame determined by the scene parsing network.

[0122] In one or more embodiments of this specification, during the process of video scene recognition of video frames, each frame of the target video typically involves multiple rounds of recognition sequentially. For example, video frame 1 is recognized at time t1, video frame 2 is recognized at time t2, and so on, until time ti is recognized. The scene parsing network can generate the function of "saving and acquiring the regional image features of reference video frames that have already been recognized" during the training phase. In this way, when performing scene parsing and recognition on the current target video frame to be recognized, it is not necessary to perform secondary recognition on its previous reference video frames, thereby saving computational recognition resources and improving the efficiency of scene recognition. Understandably, the target video frame can be input into the scene parsing network. The scene parsing network determines the first region image features corresponding to the target video frame and obtains the second region image features corresponding to at least one saved reference video frame. That is, the scene parsing network extracts the first region image frame corresponding to the target video on the one hand, and on the other hand, since the second region image features are usually the region image features for the reference video frame determined by the scene parsing network, the scene parsing network can store the second region image features of the reference video frame. In this way, when processing the target video frame, the second region image features can be directly obtained through the scene parsing network.

[0123] S310, the scene parsing network performs temporal feature fusion on the image features of the first region and the image features of the second region, and outputs the region category map corresponding to the target video frame.

[0124] In one or more embodiments of the specification, the target video can be a video recorded under conditions requiring real-time video scene recognition. The target video can be a video in an online video scenario, such as an online face-to-face interview scenario or an online identity verification scenario. Taking an online face-to-face interview scenario as an example, when a user and a server conduct a video interview to obtain relevant service information, in order to effectively protect the information security of the user or service personnel and improve the video experience for both parties, scene recognition can be performed on at least one frame of the real-time recorded target video to determine the object category region in the video scene (such as the user and / or service personnel category region). This allows for the fine segmentation of the object from the target video frame, followed by replacement with a suitable background or blurring of the background. Alternatively, the object category region in the video scene (such as the user and / or service personnel) can be determined to replace the object with a virtual object (such as replacing the user image with a cartoon object). The aforementioned process involves video scene recognition of at least one frame of the target video to identify the regions where different categories of objects are located in the video scene, in order to determine the region where the object is located.

[0125] Indicative, such as Figure 12 As shown, Figure 12 This is a schematic diagram of a video scene analysis application. By inputting the target video frame corresponding to the target video in an online face-to-face interview scene and three reference video frames into the scene analysis network, the network can perform scene recognition on the target video frame of the real-time recorded target video and determine the object category region (e.g., ...) in the video scene with the assistance of the reference video frames. Figure 12 Specifically, by outputting a region category map, which contains the regional features of identified objects such as service personnel, desks, and ceilings in the target video frame, the region category map is then processed to refine the segmentation of objects from the target video frame and replace the background, resulting in a background-processed video image frame.

[0126] In one or more embodiments of the specification, a target video frame is acquired from a target video and input into a scene parsing network. The scene parsing network determines the first region image features corresponding to the target video frame and the second region image features corresponding to a reference video frame, wherein the reference video frame is a video frame preceding the target video frame. The scene parsing network then performs temporal feature fusion on the first and second region image features. The second region image features based on the reference video frame can assist in the video scene recognition of the target video frame, thereby obtaining rich image information in the video scene. A region category map corresponding to the target video frame is then output. The region category map contains accurate region category results, which can significantly improve the accuracy of video scene recognition. Furthermore, by introducing a temporal feature fusion module to capture and utilize information between video frames, fine and accurate semantic segmentation of the video frames can be performed, improving the robustness of video scene recognition. Also, considering that the reference video frame has already been recognized, there is no need for secondary recognition of the reference video frame, saving computational resources.

[0127] The video scene recognition device provided in the embodiments of this specification will now be described in detail. It should be noted that the video scene recognition device is used to execute this specification. Figures 1-12 The methods of the embodiments shown are illustrated only in connection with this specification for ease of explanation. For specific technical details not disclosed, please refer to this specification. Figures 1-12 The example shown.

[0128] Please see Figure 13 This diagram illustrates the structure of a video scene recognition device according to an embodiment of this specification. The video scene recognition device 1 can be implemented as all or part of a user terminal through software, hardware, or a combination of both. According to some embodiments, the video scene recognition device 1 includes a frame acquisition module 11, a feature determination module 12, and a feature fusion module 13, specifically used for:

[0129] Frame acquisition module 11 is used to acquire target video frames of the target video;

[0130] The feature determination module 12 is used to input the target video frame into the scene parsing network, and determine the first region image features corresponding to the target video frame and the second region image features corresponding to the reference video frame through the scene parsing network. The reference video frame is a video frame preceding the target video frame.

[0131] The feature fusion module 13 is used to perform temporal feature fusion on the image features of the first region and the image features of the second region through the scene parsing network, and output the region category map corresponding to the target video frame.

[0132] Optionally, the scene parsing network includes a spatial branching network, a semantic branching network, and a feature fusion network, such as... Figure 14 As shown, the feature determination module 12 includes:

[0133] Spatial processing unit 121 is used to perform spatial information segmentation processing on the target video frame through the spatial branch network to obtain spatial information features;

[0134] The semantic processing unit 122 is used to perform semantic information segmentation processing on the target video frame through the semantic branch network to obtain semantic information features;

[0135] The fusion processing unit 123 is used to perform feature fusion on the spatial information features and semantic information features through the feature fusion network to obtain the first region image features corresponding to the target video frame.

[0136] Optionally, the semantic processing unit 122 is specifically used for:

[0137] Semantic information features are obtained by segmenting the target video frame using a semantic branching network based on a visual transformer.

[0138] The semantic branching network based on the visual transformer consists of a block splitting layer, four visual stages, and an attention optimization module; the first visual stage consists of a linear embedding layer and a visual transformer; the second, third, and fourth visual stages consist of a block merging layer and the visual transformer; the third visual stage is connected to the first attention optimization module; and the fourth visual stage is connected to the second attention optimization module.

[0139] Optional, such as Figure 15 As shown, the feature fusion module 13 includes:

[0140] Image fusion unit 131 is used to perform feature splicing processing on the image features of the first region and the image features of the second region based on the temporal feature fusion module of the scene parsing network to obtain image fusion features;

[0141] The fusion classification unit 132 is used to perform feature fusion classification on the image fusion features based on the temporal feature fusion module, and output the region category map corresponding to the target video frame.

[0142] Optionally, the image fusion unit 131 is specifically used for:

[0143] The image features of the first region and the image features of the second region are concatenated using channel features to obtain image fusion features.

[0144] Optionally, the temporal feature fusion module includes at least one visual transformer.

[0145] The fusion classification unit 132 is specifically used for:

[0146] The image fusion features are input into at least one visual transformer of the temporal feature fusion module for feature fusion classification, and the region category map corresponding to the target video frame is output.

[0147] Optionally, the second region image features may be multiple.

[0148] The image fusion unit 131 is specifically used for:

[0149] The temporal feature fusion module based on the scene parsing network performs feature fusion on the image features of each second region to obtain the target second region image features after feature fusion processing.

[0150] The temporal feature fusion module based on the scene parsing network performs feature stitching processing on the image features of the first region and the image features of the target second region.

[0151] Optionally, the image fusion unit 131 is specifically used for:

[0152] The temporal feature fusion module based on the scene parsing network performs channel feature averaging on the image features of each second region to obtain the image features of the target second region.

[0153] Optionally, the feature determination module 12 is specifically used for:

[0154] The target video frame and at least one reference video frame are input into a scene parsing network, which determines the first region image features corresponding to the target video frame and the second region image features corresponding to each of the reference video frames; or,

[0155] The target video frame is input into the scene parsing network, and the scene parsing network determines the first region image features corresponding to the target video frame and obtains the second region image features corresponding to at least one saved reference video frame. The second region image features are the region image features for the reference video frame determined by the scene parsing network.

[0156] Optionally, the device 1 is specifically used for:

[0157] Based on the target frame interval, obtain a reference video frame indicating the number of targets preceding the target video frame; or,

[0158] Obtain at least one historical video frame preceding the target video frame, and determine a reference video frame for the target quantity indication based on the at least one historical video frame.

[0159] Optionally, the device 1 is specifically used for:

[0160] Determine the video scene type corresponding to the target video frame, and obtain the target frame interval and / or target quantity corresponding to the video scene type.

[0161] Optionally, the device 1 is specifically used for:

[0162] Determine the image difference between the at least one historical video frame and the target video frame, and determine the reference video frame indicating the target quantity from the at least one historical video frame based on each image difference.

[0163] It should be noted that the video scene recognition device provided in the above embodiments is only illustrated by the division of the above functional modules when executing the video scene recognition method. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the video scene recognition device provided in the above embodiments and the video scene recognition method embodiments in this specification belong to the same concept, and the implementation process is detailed in the method embodiments, which will not be repeated here.

[0164] In one or more embodiments of the specification, a target video frame is acquired from a target video and input into a scene parsing network. The scene parsing network determines the first region image features corresponding to the target video frame and the second region image features corresponding to a reference video frame, wherein the reference video frame is a video frame preceding the target video frame. The scene parsing network then performs temporal feature fusion on the first and second region image features. The second region image features based on the reference video frame can assist in the video scene recognition of the target video frame, thereby obtaining rich image information in the video scene. A region category map corresponding to the target video frame is then output. The region category map contains accurate region category results, which can significantly improve the accuracy of video scene recognition. Furthermore, by introducing a temporal feature fusion module to capture and utilize information between video frames, fine and accurate semantic segmentation of the video frames can be performed, improving the robustness of video scene recognition. Also, considering that the reference video frame has already been recognized, there is no need for secondary recognition of the reference video frame, saving computational resources.

[0165] This specification also provides a computer storage medium that can store multiple instructions adapted to be loaded and executed by a processor as described above. Figures 1-12 The video scene recognition method described in the illustrated embodiment can be found in the following documentation for its specific execution process. Figures 1-12 The specific details of the illustrated embodiments will not be elaborated here.

[0166] This specification also provides a computer program product that stores at least one instruction, which is loaded and executed by the processor as described above. Figures 1-12 The video scene recognition method described in the illustrated embodiment can be found in the following documentation for its specific execution process. Figures 1-12 The specific details of the illustrated embodiments will not be elaborated here.

[0167] Please see Figure 16 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 16 As shown, the electronic device 1000 may include: at least one processor 1001, at least one network interface 1004, a user interface 1003, a memory 1005, and at least one communication bus 1002.

[0168] The communication bus 1002 is used to realize the connection and communication between these components.

[0169] The user interface 1003 may include a display screen and a camera. Optionally, the user interface 1003 may also include a standard wired interface and a wireless interface.

[0170] The network interface 1004 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface).

[0171] The processor 1001 may include one or more processing cores. The processor 1001 connects to various parts within the server 1000 using various interfaces and lines, and performs various functions and processes data by running or executing instructions, programs, code sets, or instruction sets stored in the memory 1005, and by calling data stored in the memory 1005. Optionally, the processor 1001 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The processor 1001 may integrate one or a combination of several of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the content required for display; and the modem handles wireless communication. It is understood that the modem may also be implemented as a separate chip without being integrated into the processor 1001.

[0172] The memory 1005 may include random access memory (RAM) or read-only memory. Optionally, the memory 1005 may include a non-transitory computer-readable storage medium. The memory 1005 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 1005 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for at least one function (such as touch function, sound playback function, image playback function, etc.), instructions for implementing the above-described method embodiments, etc.; the data storage area may store data involved in the above-described method embodiments, etc. Optionally, the memory 1005 may also be at least one storage device located remotely from the aforementioned processor 1001. Figure 16 As shown, the memory 1005, which serves as a computer storage medium, may include an operating system, a network communication module, a user interface module, and application programs.

[0173] exist Figure 16In the illustrated electronic device 1000, the user interface 1003 is mainly used to provide an input interface for the user and to obtain user input data; while the processor 1001 can be used to call the application program stored in the memory 1005 and specifically perform the following operations:

[0174] Obtain the target video frame from the target video;

[0175] The target video frame is input into the scene parsing network, and the scene parsing network determines the first region image features corresponding to the target video frame and the second region image features corresponding to the reference video frame. The reference video frame is a video frame preceding the target video frame.

[0176] The scene parsing network performs temporal feature fusion on the image features of the first region and the image features of the second region, and outputs the region category map corresponding to the target video frame.

[0177] In one embodiment, the scene parsing network includes a spatial branch network, a semantic branch network, and a feature fusion network. When the processor 1001 executes the step of determining the first region image features corresponding to the target video frame through the scene parsing network, it specifically performs the following steps:

[0178] The spatial information features are obtained by performing spatial information segmentation on the target video frame through the spatial branching network.

[0179] The semantic branch network is used to perform semantic information segmentation on the target video frame to obtain semantic information features.

[0180] The spatial information features and semantic information features are fused by the feature fusion network to obtain the first region image features corresponding to the target video frame.

[0181] In one embodiment, when the processor 1001 performs semantic information segmentation processing on the target video frame through the semantic branch network to obtain semantic information features, it specifically performs the following steps:

[0182] Semantic information features are obtained by segmenting the target video frame using a semantic branching network based on a visual transformer.

[0183] The semantic branching network based on the visual transformer consists of a block splitting layer, four visual stages, and an attention optimization module; the first visual stage consists of a linear embedding layer and a visual transformer; the second, third, and fourth visual stages consist of a block merging layer and the visual transformer; the third visual stage is connected to the first attention optimization module; and the fourth visual stage is connected to the second attention optimization module.

[0184] In one embodiment, when the processor 1001 performs temporal feature fusion of the first region image features and the second region image features through the scene parsing network to output the region category map corresponding to the target video frame, it specifically performs the following steps:

[0185] The temporal feature fusion module based on the scene parsing network performs feature splicing processing on the image features of the first region and the image features of the second region to obtain image fusion features;

[0186] Based on the temporal feature fusion module, the image fusion features are fused and classified to output the region category map corresponding to the target video frame.

[0187] In one embodiment, when the processor 1001 performs the feature concatenation process on the image features of the first region and the image features of the second region to obtain image fusion features, it specifically performs the following steps:

[0188] The image features of the first region and the image features of the second region are concatenated using channel features to obtain image fusion features.

[0189] In one embodiment, the processor 1001 executing the temporal feature fusion module includes at least one visual transformer.

[0190] When performing feature fusion classification on the image fusion features based on the temporal feature fusion module and outputting the region category map corresponding to the target video frame, the following steps are specifically performed:

[0191] The image fusion features are input into at least one visual transformer of the temporal feature fusion module for feature fusion classification, and the region category map corresponding to the target video frame is output.

[0192] In one embodiment, the second region image features are multiple. When the processor 1001 executes the temporal feature fusion module based on the scene parsing network to perform feature concatenation processing on the first region image features and the second region image features, the following steps are specifically performed:

[0193] The temporal feature fusion module based on the scene parsing network performs feature fusion on the image features of each second region to obtain the target second region image features after feature fusion processing.

[0194] The temporal feature fusion module based on the scene parsing network performs feature stitching processing on the image features of the first region and the image features of the target second region.

[0195] In one embodiment, when the processor 1001 executes the temporal feature fusion module based on the scene parsing network to perform feature fusion on the image features of each second region to obtain the target second region image features after feature fusion processing, the following steps are specifically performed:

[0196] The temporal feature fusion module based on the scene parsing network performs channel feature averaging on the image features of each second region to obtain the image features of the target second region.

[0197] In one embodiment, when the processor 1001 executes the step of inputting the target video frame into the scene parsing network and determining the first region image features corresponding to the target video frame and the second region image features corresponding to the reference video frame through the scene parsing network, the following steps are specifically performed:

[0198] The target video frame and at least one reference video frame are input into a scene parsing network, which determines the first region image features corresponding to the target video frame and the second region image features corresponding to each of the reference video frames; or,

[0199] The target video frame is input into the scene parsing network, and the scene parsing network determines the first region image features corresponding to the target video frame and obtains the second region image features corresponding to at least one saved reference video frame. The second region image features are the region image features for the reference video frame determined by the scene parsing network.

[0200] In one embodiment, when the processor 1001 executes the video scene recognition method, it specifically performs the following steps:

[0201] Based on the target frame interval, obtain a reference video frame indicating the number of targets preceding the target video frame; or,

[0202] Obtain at least one historical video frame preceding the target video frame, and determine a reference video frame for the target quantity indication based on the at least one historical video frame.

[0203] In one embodiment, before the processor 1001 executes the reference video frame indicating the target number prior to acquiring the target video frame based on the target frame interval, it further performs the following steps:

[0204] Determine the video scene type corresponding to the target video frame, and obtain the target frame interval and / or target quantity corresponding to the video scene type.

[0205] In one embodiment, when the processor 1001 executes the reference video frame that determines the target quantity indication based on the at least one historical video frame, it specifically performs the following steps:

[0206] Determine the image difference degree between the at least one historical video frame and the target video frame, and determine the reference video frame indicating the target quantity from the at least one historical video frame based on each image difference degree.

[0207] In one or more embodiments of the specification, a target video frame is acquired from a target video and input into a scene parsing network. The scene parsing network determines the first region image features corresponding to the target video frame and the second region image features corresponding to a reference video frame, wherein the reference video frame is a video frame preceding the target video frame. The scene parsing network then performs temporal feature fusion on the first and second region image features. The second region image features based on the reference video frame can assist in the video scene recognition of the target video frame, thereby obtaining rich image information in the video scene. A region category map corresponding to the target video frame is then output. The region category map contains accurate region category results, which can significantly improve the accuracy of video scene recognition. Furthermore, by introducing a temporal feature fusion module to capture and utilize information between video frames, fine and accurate semantic segmentation of the video frames can be performed, improving the robustness of video scene recognition. Also, considering that the reference video frame has already been recognized, there is no need for secondary recognition of the reference video frame, saving computational resources.

[0208] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory, or random access memory, etc.

[0209] The above-disclosed embodiments are merely preferred embodiments of this specification and should not be construed as limiting the scope of this specification. Therefore, any equivalent variations made in accordance with the claims of this specification shall still fall within the scope of this specification.

[0210] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

Claims

1. A video scene recognition method, the method comprising: Obtain the target video frame from the target video; The target video frame is input into the scene parsing network, and the scene parsing network determines the first region image features corresponding to the target video frame and the second region image features corresponding to the reference video frame. The reference video frame is a video frame preceding the target video frame. The scene parsing network performs temporal feature fusion on the image features of the first region and the image features of the second region to output the region category map corresponding to the target video frame. The step of performing temporal feature fusion on the image features of the first region and the image features of the second region through the scene parsing network to output the region category map corresponding to the target video frame includes: The temporal feature fusion module based on the scene parsing network performs feature splicing processing on the image features of the first region and the image features of the second region to obtain image fusion features; Based on the temporal feature fusion module, the image fusion features are fused and classified to output the region category map corresponding to the target video frame.

2. The method according to claim 1, wherein the scene parsing network comprises a spatial branching network, a semantic branching network, and a feature fusion network. The step of determining the first region image features corresponding to the target video frame through the scene parsing network includes: The spatial information features are obtained by performing spatial information segmentation on the target video frame through the spatial branching network. The semantic branch network is used to perform semantic information segmentation on the target video frame to obtain semantic information features. The spatial information features and semantic information features are fused by the feature fusion network to obtain the first region image features corresponding to the target video frame.

3. The method according to claim 2, wherein performing semantic information segmentation processing on the target video frame through the semantic branch network to obtain semantic information features includes: Semantic information features are obtained by segmenting the target video frame using a semantic branching network based on a visual transformer. The semantic branching network based on the visual transformer consists of a block splitting layer, four visual stages, and an attention optimization module; the first visual stage consists of a linear embedding layer and a visual transformer; the second, third, and fourth visual stages consist of a block merging layer and the visual transformer; the third visual stage is connected to the first attention optimization module; and the fourth visual stage is connected to the second attention optimization module.

4. The method according to claim 1, wherein performing feature concatenation processing on the image features of the first region and the image features of the second region to obtain image fusion features includes: The image features of the first region and the image features of the second region are concatenated using channel features to obtain image fusion features.

5. The method according to claim 4, wherein the temporal feature fusion module comprises at least one visual transformer. The step of performing feature fusion classification on the image fusion features based on the temporal feature fusion module, and outputting the region category map corresponding to the target video frame, includes: The image fusion features are input into at least one visual transformer of the temporal feature fusion module for feature fusion classification, and the region category map corresponding to the target video frame is output.

6. The method according to claim 4, wherein the second region image features are multiple. The temporal feature fusion module based on the scene parsing network performs feature concatenation processing on the image features of the first region and the image features of the second region, including: The temporal feature fusion module based on the scene parsing network performs feature fusion on the image features of each second region. The image features of the second target region after feature fusion processing are obtained; The temporal feature fusion module based on the scene parsing network performs feature stitching processing on the image features of the first region and the image features of the target second region.

7. The method according to claim 6, wherein the temporal feature fusion module based on the scene parsing network performs feature fusion on the image features of each second region to obtain the target second region image features after feature fusion processing, comprising: The temporal feature fusion module based on the scene parsing network performs channel feature averaging on the image features of each second region to obtain the image features of the target second region.

8. The method according to claim 1, wherein inputting the target video frame into a scene parsing network and determining the first region image features corresponding to the target video frame and the second region image features corresponding to the reference video frame through the scene parsing network comprises: The target video frame and at least one reference video frame are input into a scene parsing network, which determines the first region image features corresponding to the target video frame and the second region image features corresponding to each of the reference video frames; or, The target video frame is input into the scene parsing network, and the scene parsing network determines the first region image features corresponding to the target video frame and obtains the second region image features corresponding to at least one saved reference video frame. The second region image features are the region image features for the reference video frame determined by the scene parsing network.

9. The method according to claim 1 or 8, further comprising: Based on the target frame interval, obtain the reference video frame indicating the number of targets preceding the target video frame; or, Obtain at least one historical video frame preceding the target video frame, and determine a reference video frame for the target quantity indication based on the at least one historical video frame.

10. The method according to claim 9, further comprising, before obtaining the reference video frame indicating the target number prior to the target video frame based on the target frame interval: Determine the video scene type corresponding to the target video frame, and obtain the target frame interval and / or target quantity corresponding to the video scene type.

11. The method of claim 9, wherein determining the reference video frame for the target quantity indication based on the at least one historical video frame comprises: Determine the image difference degree between the at least one historical video frame and the target video frame, and determine the reference video frame indicating the target quantity from the at least one historical video frame based on each image difference degree.

12. A video scene recognition device, the device comprising: The frame acquisition module is used to acquire the target video frames of the target video. The feature determination module is used to input the target video frame into the scene parsing network, and determine the first region image features corresponding to the target video frame and the second region image features corresponding to the reference video frame through the scene parsing network. The reference video frame is a video frame preceding the target video frame. The feature fusion module is used to perform temporal feature fusion on the image features of the first region and the image features of the second region through the scene parsing network, and output the region category map corresponding to the target video frame; The step of performing temporal feature fusion on the image features of the first region and the image features of the second region through the scene parsing network to output the region category map corresponding to the target video frame includes: The temporal feature fusion module based on the scene parsing network performs feature splicing processing on the image features of the first region and the image features of the second region to obtain image fusion features; Based on the temporal feature fusion module, the image fusion features are fused and classified to output the region category map corresponding to the target video frame.

13. A computer storage medium storing a plurality of instructions adapted for loading by a processor and executing the method steps of any one of claims 1 to 11.

14. An electronic device comprising: A processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and executed the method steps as claimed in any one of claims 1 to 11.

15. A computer program product storing at least one instruction, the at least one instruction being loaded by a processor and executing the method steps of any one of claims 1 to 11.