A video scene segmentation method with multimodal semantic interaction

By employing a multimodal semantic interaction approach, combined with shot representation learning and scene segmentation learning, the problem of unutilized modal information complementarity in video scene segmentation is solved, achieving higher accuracy and efficiency in video scene segmentation.

CN116152715BActive Publication Date: 2026-06-30SHANGHAI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI UNIV
Filing Date
2023-02-24
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing video scene segmentation technologies do not fully consider the complementarity and mutual assistance between different modal information, resulting in insufficient segmentation accuracy.

Method used

A multimodal semantic interaction approach is adopted, which involves two stages: shot representation learning and scene segmentation learning. It utilizes an interactive attention module, a visual context perception module, and an auditory context perception module, combined with self-supervised and supervised learning, to design auxiliary tasks, extract visual and auditory features, and perform video scene segmentation.

Benefits of technology

It achieves more accurate video scene segmentation, improves the model's generalization ability and computational efficiency, fully explores the connections within and between visual and auditory modalities, and obtains richer semantic information.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116152715B_ABST
    Figure CN116152715B_ABST
Patent Text Reader

Abstract

This invention relates to a video scene segmentation method with multimodal semantic interaction, comprising: segmenting an input video segment into multiple consecutive shot sequences, extracting visual and auditory features from each shot sequence; constructing a shot representation learning network, utilizing video structural characteristics and semantic interaction characteristics between multiple modal information, designing auxiliary tasks, and learning visual and auditory shot representations suitable for scene segmentation; constructing a scene segmentation learning network, transferring the parameters of the trained shot representation learning network to the scene segmentation learning network, fusing the visual and auditory shot representations to obtain the final shot representation, and achieving scene segmentation through softmax layer classification. Compared with existing technologies, this invention fully explores the connections within and between visual and auditory modalities, and through two-stage training, combines the advantages of self-supervised and supervised learning to achieve more accurate and robust video scene segmentation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and video understanding, and in particular to a video scene segmentation method with multimodal semantic interaction. Background Technology

[0002] With the development of the internet and multimedia technologies, the number of videos has exploded, highlighting the growing demand for efficient and intelligent analysis and processing of video data. Videos are organized hierarchically according to "frames," "shots," and "scenes." A key initial step in video analysis is to decompose them into smaller, more easily processed segments, such as shots and scenes. Compared to shots, scenes are the most basic logical story units in video structure, composed of a series of temporally and semantically coherent shots, and can more effectively convey semantic information. Therefore, how to accurately segment video scenes has become a research hotspot in recent years.

[0003] Video scene segmentation takes shots as its research object, aiming to group similar consecutive shots based on their content to divide a video into several semantically related scene segments. Traditional video scene segmentation mainly achieves shot grouping by analyzing low-level features of shots such as color, texture, and optical flow, as well as utilizing fixed rules in film and television editing. However, these methods generally rely heavily on manual features, prior knowledge, and manual parameter settings, resulting in problems such as high computational complexity, low accuracy, and poor generalization performance.

[0004] With the development of deep learning, numerous deep learning-based video scene segmentation methods have been proposed. Early methods mostly employed unsupervised learning, primarily using pre-trained feature extractors to extract one or more deep video features, then calculating the similarity between shots based on similarity metrics to cluster them into scenes. With the release of video scene segmentation datasets, research based on supervised learning has gained increasing attention. These methods learn segmentation from real-world scene labels, offering lower computational cost and higher accuracy compared to unsupervised methods, but they are limited by the need for large amounts of labeled, high-quality training data. In recent years, self-supervised learning has been proposed for video scene segmentation. These methods design auxiliary tasks to replace labels in supervised learning, allowing the use of large amounts of unlabeled video data to learn effective shot representations, which are then grouped using clustering or supervised methods. They leverage the inherent characteristics of video data without requiring manually annotated scene labels, overcoming dataset limitations and achieving higher-quality segmentation.

[0005] For example, Haq et al. proposed a three-fold framework based on object detection and set theory, which utilizes both low-level and high-level features of the video to achieve meaningful scene segmentation. This method first uses color and texture features to calculate the similarity between adjacent frames, segmenting the video into shots. Then, a CNN model is used for object detection in each shot. Finally, based on the detection results, a sliding window-based set theory method is applied to assign similar shots to the same scene.

[0006] Mohamed et al. proposed an unsupervised method for segmenting videos into scenes based on genre prediction, combining multiple modal information from the video to achieve better results. This method first uses deep visual and auditory features to predict the genre of each shot, then calculates the similarity between adjacent shots using the obtained genre prediction vectors, and performs clustering to achieve scene segmentation.

[0007] Rao et al. proposed a method for scene segmentation that employs a "local-to-global" strategy. This method integrates visual, auditory, character, and action information from three hierarchical structures: shot, segment, and film. It can extract rich semantics from the layered temporal structure of long films, providing top-down guidance for scene segmentation and achieving higher accuracy.

[0008] Pei et al. proposed a novel temporal clustering method for video scene segmentation. This method predicts the association scores of adjacent shot nodes in a time series using graph convolution, and then merges shot nodes belonging to the same scene into a single cluster. This method leverages the transitive connectivity between shot nodes, fully considers the temporal order of shots, and avoids complex steps and pre-set parameter configurations, resulting in high computational efficiency.

[0009] Chen et al. proposed a self-supervised shot representation learning method for scene segmentation. This method treats adjacent shots as enhanced versions of each other and achieves shot encoding through contrastive learning, which greatly improves the similarity between shots in the same scene. It can more effectively encode video scene structure and segment complex temporal scene structures in long videos.

[0010] However, current video scene segmentation technologies do not fully consider the complementarity and mutual assistance between different modal information in the video, resulting in insufficient segmentation accuracy. Summary of the Invention

[0011] The purpose of this invention is to overcome the shortcomings of the existing technology and provide a video scene segmentation method with multimodal semantic interaction.

[0012] The objective of this invention can be achieved through the following technical solutions:

[0013] A video scene segmentation method for multimodal semantic interaction includes the following steps:

[0014] S1. Input a video clip;

[0015] S2. Multimodal feature extraction steps: Segment the input video clip into multiple consecutive shot sequences, and extract the visual and auditory features of each shot sequence;

[0016] S3. Shot Representation Learning Steps: Based on the visual and auditory features, construct a shot representation learning network, and utilize the video structure characteristics and semantic interaction characteristics between multiple modal information to design auxiliary tasks to learn visual and auditory shot representations suitable for scene segmentation.

[0017] S4. Scene segmentation learning steps: Construct a scene segmentation learning network, transfer the parameters of the trained shot representation learning network to the scene segmentation learning network, fuse visual shot representation and auditory shot representation to obtain the final shot representation, and achieve scene segmentation through softmax layer classification.

[0018] Furthermore, in step S3, the lens representation learning network is trained in a self-supervised manner on the MovieNet dataset without real labels;

[0019] In step S4, the scene segmentation learning stage uses classification loss to perform supervised training on the MovieScenes dataset with real scene labels.

[0020] Furthermore, step S2 specifically includes the following steps:

[0021] For the input video V, use a shot segmentation tool to segment it into n consecutive shot sequences [s1, s2, ..., sn]. n ], n>0;

[0022] For each shot sequence, visual features are extracted using a Bassl model pre-trained on the MovieNet dataset.

[0023] For each shot sequence, a NaverNet model pre-trained on the AVA-ActiveSpeaker dataset is used to separate speech and background sound. Speech features and background sound features are obtained separately through short-time Fourier transform, and the speech features and background sound features are concatenated to obtain auditory features.

[0024] Furthermore, the lens representation learning network includes an interactive attention module, a visual context awareness module, and an auditory context awareness module;

[0025] The interactive attention module is used to enable one modality to aggregate useful information from all relevant fragments of another modality, including visual and auditory modalities;

[0026] Both the visual context perception module and the auditory context perception module adopt the transformer structure to model the context information of the shot sequence.

[0027] Furthermore, step S3 specifically includes the following steps:

[0028] The visual and auditory features are input into the interactive attention module, and a linear transformation maps the auditory and visual features to the same dimension T×d. m ;

[0029] In the visual aspect, the linearly transformed auditory features are input into self-attention; in the auditory aspect, the linearly transformed visual features are input into self-attention, thereby modeling the temporal relationship between shots within the modality.

[0030] The auditory features are obtained by performing a residual connection between the auditory features and the linearly transformed auditory features. Right now:

[0031]

[0032] Where ω represents the parameters that the learning network needs to learn for each shot, and i represents the i-th shot. This represents a self-attention operation;

[0033] The obtained auditory features The linearly transformed visual features are input into a cross-modal attention block to model the interaction between visual and auditory features, resulting in enhanced visual features.

[0034] Similarly, by performing a residual connection between the visual features and the linearly transformed visual features, the features are obtained. Right now:

[0035]

[0036] The obtained visual features The linearly transformed auditory features are input into a cross-modal attention block to model the interaction between auditory and visual features, resulting in enhanced auditory features.

[0037] Based on the enhanced visual features The input shot sequence is divided into two semantically disjoint subsequences using a dynamic time warping algorithm, and the scene pseudo-labels corresponding to the shot sequence are calculated.

[0038] Enhanced visual features and auditory characteristics The visual context perception module and the auditory context perception module are input respectively, and the optimal visual and auditory shot representations are learned through auxiliary task training based on the scene pseudo-labels.

[0039] Furthermore, the cross-modal attention block performs the following steps:

[0040] In terms of vision, the visual features are linearly transformed to obtain a query feature vector Q, and then the input visual features and auditory features are combined. Features obtained by concatenation The key feature vector K and value feature vector V are obtained through two linear transformations. The query feature vector Q is then multiplied by the key feature vector K to obtain a feature map representing the correlation between vectors Q and K. Finally, this feature map is multiplied by the value feature vector V to obtain the cross-modal attention output. Right now:

[0041]

[0042] Where ω q ω k ω v The lens represents the parameters that the learning network needs to learn;

[0043] Finally, the cross-modal attention output will be... By performing a residual connection with the linearly transformed visual features, enhanced visual features are obtained.

[0044] In terms of hearing, this involves considering the auditory and visual characteristics of the input. Performing the same operations, modeling intramodal temporal information and intermodal interaction information, yields enhanced auditory features.

[0045] Furthermore, the scene pseudo-labels corresponding to the calculated shot sequence are specifically as follows:

[0046] The first shot in the input sequence It is considered to belong to the first subsequence S left The last shot Belonging to the second subsequence S right By calculating the remaining shots and and Using cosine similarity, the shots that are semantically most similar to the two are found, and finally, two optimal subsequences S are obtained. left ={s1,...,s 1+i} and S right ={{s i+2,...,s n}}, where s 1+i This represents the calculated boundary lens, with the pseudo-label for this lens set to 1, and the pseudo-label for the other lenses set to 0.

[0047] Furthermore, the auxiliary tasks include scene matching, context grouping matching, and pseudo-boundary prediction;

[0048] The scene matching feature encourages the model to maximize the similarity of shots within a scene while minimizing the similarity between different scenes; S left and S right Treating it as a pseudo-scene, the pseudo-boundary feature f is calculated. 1+i The model is trained using the InfoNCE loss of the two subsequences, namely:

[0049]

[0050] Where h SSM Represents a linear transformation; r(S) is a scene-level representation obtained by averaging shot features in the subsequence;

[0051] The context grouping matching is based on the center shot s in the input sequence. c Construct triples (s c ,s pos ,s neg ), s pos From s c Shots randomly sampled from the same subsequence are used as positive samples, while s neg Shots sampled from another subsequence are used as negative samples. A binary cross-entropy loss is used to learn whether two given shots belong to the same scene.

[0052]

[0053] Where h CGM This means taking two shots as input and predicting a matching score; R c ,R pos and R neg They are s c ,s pos and s neg The corresponding contextual shot features are calculated by the context-aware module;

[0054] False boundary prediction is used to identify semantic changes at specific moments. It is achieved by calculating the binary cross-entropy loss between the false boundary shot and a randomly sampled non-shot boundary, i.e.:

[0055]

[0056] Where R1+i and R m These represent pseudo-boundary lenses s and s, respectively. 1+i and randomly sampled lenses m The context of the shot; h PP This represents the probability distribution that maps the shot representation to binary.

[0057] Furthermore, step S4 specifically includes the following steps:

[0058] The interaction attention module, visual context perception module and auditory context perception module trained in the aforementioned shot representation learning network are transferred to the scene segmentation learning network.

[0059] The input is a video segment to be segmented. Feature extraction is performed on the segment to obtain visual and auditory features. These visual and auditory features are then processed by an interactive attention module, a visual context module, and an auditory context module, respectively, to output a visual shot representation. And auditory camera indicates

[0060] The visual and auditory camera representations are fused by the feature fusion module, and the fused feature [O1, O2, ..., O] is output. n ];

[0061] The final result is obtained through softmax classification.

[0062] Furthermore, the scene segmentation learning network is equipped with classification loss and visual discrimination loss training models, namely:

[0063]

[0064] in and It's a hyperparameter; This indicates the calculation of classification loss; This represents computational visual discrimination loss; y i Indicates lens s i Corresponding real-world scene tags;

[0065] The classification loss minimizes the binary cross-entropy loss between the predicted value obtained from the fused features and the real scene label, while the visual discrimination loss minimizes the binary cross-entropy loss between the predicted value obtained from the visual lens representation and the real scene label, i.e.:

[0066]

[0067]

[0068] Where hc This represents the probability distribution that maps features to binary; h v This represents the probability distribution that maps the visual lens representation to binary.

[0069] Compared with the prior art, the present invention has the following beneficial effects:

[0070] This invention proposes a two-stage scene segmentation method based on multimodal semantic interaction. By introducing an interactive attention module, it can fully explore the connections within and between visual and auditory modalities, obtaining richer semantic information. The video scene segmentation task is divided into two stages: shot representation learning and scene segmentation learning. In the first stage, multiple auxiliary tasks are designed to perform self-supervised learning using the scene structure characteristics of the video itself. This can utilize a large amount of unlabeled video data to achieve more reliable shot representation learning and improve the model's generalization ability. In the second stage, the audiovisual representations of the shots are fused to perform supervised scene segmentation learning. The visual discrimination loss can effectively adjust the influence of visual and auditory information, thereby achieving more accurate video scene segmentation with high computational efficiency. Attached Figure Description

[0071] Figure 1 Here is a flowchart of a two-stage scene segmentation method based on multimodal semantic interaction;

[0072] Figure 2 This is a structural diagram of the interactive attention module of the present invention;

[0073] Figure 3 This is a schematic diagram of the cross-modal attention block of the present invention. Detailed Implementation

[0074] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.

[0075] To address the problem that existing video scene segmentation techniques do not fully consider the complementarity and mutual assistance between different modal information in a video, resulting in insufficient segmentation accuracy, this invention proposes a multimodal semantic interaction video scene segmentation method.

[0076] For the overall process of this invention, please refer to [link / reference]. Figure 1 This includes the following steps:

[0077] S1: Multimodal deep feature extraction

[0078] For the input video V, first use an existing shot segmentation tool to segment it into n consecutive shot sequences [s1, s2, ..., sn]. nThen, for each shot, visual features are extracted using a Bassl model pre-trained on the MovieNet dataset.

[0079] For auditory information, a NaverNet model pre-trained on the AVA-ActiveSpeaker dataset was used to separate speech and background sound. Then, speech features and background sound features were obtained separately through short-time Fourier transform at a sampling rate of 16 kHz and a windowed signal length of 512. These features were then concatenated to obtain the auditory features.

[0080] S2: Shot Representation Learning Stage: Construct a shot representation learning network and utilize the structural characteristics of the video and the semantic interaction characteristics between multiple modal information to design auxiliary tasks. Train the network in a self-supervised manner on the MovieNet dataset without real labels to learn visual and auditory shot representations suitable for scene segmentation.

[0081] The camera representation learning network includes an interactive attention module, a visual context awareness module, and an auditory context awareness module;

[0082] The extracted visual and auditory features are input into the interactive attention module, which allows one modality to aggregate useful information from all relevant fragments of another modality, including both visual and auditory aspects. The structure is as follows: Figure 2 As shown.

[0083] In terms of vision, auditory and visual features are first mapped to the same dimension T×d through a linear transformation. m The transformed auditory features are used to model the temporal relationships between shots within the modality through self-attention. Then, the obtained features are residually concatenated with the linearly transformed auditory features to obtain the final feature set. Right now:

[0084]

[0085] Where ω is the parameter that the network needs to learn, and i represents the i-th shot. This represents a self-attention operation.

[0086] Then the obtained auditory features The linearly transformed visual features are input into a cross-modal attention block to model the interaction between visual and auditory features. The structure of the cross-modal attention block is as follows: Figure 3 As shown.

[0087] Specifically, the visual feature lines undergo a linear transformation to obtain a query feature vector Q, and then the input visual and auditory features are combined... Features obtained by concatenation The key feature vector K and value feature vector V are obtained through two linear transformations. The query feature vector Q is then multiplied by the key feature vector K to obtain a feature map representing the correlation between vectors Q and K. Finally, the feature map is multiplied by the value feature vector to obtain the cross-modal attention output. Right now:

[0088]

[0089] Where ω q ω k ω v These are parameters that need to be learned.

[0090] Finally, the cross-modal attention output will be... By performing a residual connection with the linearly transformed visual features, enhanced visual features are obtained. The same operation is performed on auditory aspects, modeling intramodal temporal information and intermodal interaction information to obtain enhanced auditory features.

[0091] Output-based enhanced visual features The input shot sequence is divided into two semantically disjoint subsequences using a dynamic time warping algorithm, and scene pseudo-labels corresponding to the shot sequences are calculated. Specifically, the first shot in the input sequence... It is considered to belong to the first subsequence S left The last shot Belonging to the second subsequence S right Then, by calculating the remaining shots and and Using cosine similarity, the shots that are semantically most similar to the two are found, and finally, two optimal subsequences S are obtained. left ={s1,...,s 1+i} and S right ={{s i+2 ,...,s n}}, where s 1+i This represents the calculated boundary shot, with the pseudo-label for this shot set to 1, and the rest set to 0.

[0092] Enhanced visual features and auditory characteristics The visual context-aware module and the auditory context-aware module are input separately. Both adopt the classic transformer structure, aiming to model the contextual information of the shot sequence. Then, based on pseudo-labels, they are trained through auxiliary tasks to learn the optimal visual and auditory shot representations.

[0093] Three auxiliary tasks were designed: shot scene matching, context grouping matching, and pseudo-boundary prediction. Shot scene matching encourages the model to maximize the similarity of shots within a scene while minimizing the similarity between different scenes. S... left and S right Treating it as a pseudo-scene, the pseudo-boundary feature f is calculated. 1+i The model is trained using the InfoNCE loss of the two subsequences, namely:

[0094]

[0095] Where h SSM denoted as a linear transformation; r(S) is a scene-level representation obtained by averaging shot features in the subsequence.

[0096] Context grouping matching is based on the center shot s in the input sequence c Construct triples (s c ,s pos ,s neg ),s pos From s c Shots randomly sampled from the same subsequence are used as positive samples, while s neg Shots sampled from another subsequence are used as negative samples. A binary cross-entropy loss is used to learn whether two given shots belong to the same scene.

[0097]

[0098] Where h CGM This means taking two shots as input and predicting a matching score; R c ,R pos and R neg They are s c ,s pos and s neg The corresponding contextual shot features are calculated by the context-aware module.

[0099] False boundary prediction enables the model to identify moments of semantic change. It is achieved by calculating the binary cross-entropy loss between the false boundary shot and a randomly sampled non-shot boundary, i.e.:

[0100]

[0101] Where R 1+i and R m These represent pseudo-boundary lenses s and s, respectively. 1+i and randomly sampled lenses m The context of the shot; h PP This represents the probability distribution that maps the shot representation to binary.

[0102] S3: Scene Segmentation Learning Stage

[0103] The interaction attention module, visual context awareness module, and auditory context awareness module trained in the first stage are then transferred to the scene segmentation learning network in the second stage. For example... Figure 1 As shown, the input visual and auditory features are processed by the interactive attention module, the visual context module, and the auditory context module, respectively, and finally the visual lens representation is output. And auditory camera indicates Then, the visual and auditory representations are fused through a feature fusion module. This module uses the same computational structure as the cross-modal attention module to output the fused features [O1, O2, ..., O n Finally, the final result is obtained through softmax classification.

[0104] Furthermore, during the scene segmentation learning phase, a classification loss is used for supervised training on the MovieScenes dataset with real-world scene labels. Additionally, since video scene segmentation is a task highly correlated with vision, this invention also designs a visual discrimination loss model to enhance the representation of visual features and improve the accuracy of scene segmentation, expressed as:

[0105]

[0106] in and It's a hyperparameter; This indicates the calculation of classification loss; This represents computational visual discrimination loss; y i Indicates lens s i Corresponding real-world scene tags.

[0107] The classification loss minimizes the binary cross-entropy loss between the predicted value obtained from the fused features and the real scene label, while the visual discrimination loss minimizes the binary cross-entropy loss between the predicted value obtained from the visual lens representation and the real scene label, i.e.:

[0108]

[0109]

[0110] Where h c This represents the probability distribution that maps features to binary; h v This represents the probability distribution that maps the visual lens representation to binary.

[0111] This invention proposes a two-stage scene segmentation method based on multimodal semantic interaction. By introducing an interactive attention module, it can fully explore the connections within and between visual and auditory modalities, obtaining richer semantic information. The video scene segmentation task is divided into two stages: shot representation learning and scene segmentation learning. In the first stage, multiple auxiliary tasks are designed to perform self-supervised learning using the scene structure characteristics of the video itself. This can utilize a large amount of unlabeled video data to achieve more reliable shot representation learning and improve the model's generalization ability. In the second stage, the audiovisual representations of the shots are fused to perform supervised scene segmentation learning. The visual discrimination loss can effectively adjust the influence of visual and auditory information, thereby achieving more accurate video scene segmentation with high computational efficiency.

[0112] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.

Claims

1. A video scene segmentation method for multimodal semantic interaction, characterized in that, Includes the following steps: S1. Input a video clip; S2. Multimodal feature extraction steps: Segment the input video clip into multiple consecutive shot sequences, and extract the visual and auditory features of each shot sequence; S3. Shot Representation Learning Steps: Based on the visual and auditory features, construct a shot representation learning network, and utilize the video structure characteristics and semantic interaction characteristics between multiple modal information to design auxiliary tasks to learn visual and auditory shot representations suitable for scene segmentation. S4. Scene segmentation learning steps: Construct a scene segmentation learning network, transfer the parameters of the trained shot representation learning network to the scene segmentation learning network, fuse visual shot representation and auditory shot representation to obtain the final shot representation, and achieve scene segmentation through softmax layer classification; The lens representation learning network includes an interactive attention module, a visual context perception module, and an auditory context perception module; The interactive attention module is used to enable one modality to aggregate useful information from all relevant fragments of another modality, including visual and auditory modalities; Both the visual context perception module and the auditory context perception module adopt the transformer structure to model the context information of the shot sequence. Step S3 specifically includes the following steps: The visual and auditory features are input into the interactive attention module, and a linear transformation maps the auditory and visual features to the same dimension. ; In the visual aspect, the linearly transformed auditory features are input into self-attention; in the auditory aspect, the linearly transformed visual features are input into self-attention, thereby modeling the temporal relationship between shots within the modality. The auditory features are obtained by performing a residual connection between the auditory features and the linearly transformed auditory features. ,Right now: in The lens represents the parameters that the learning network needs to learn. Indicates the first One shot, This represents a self-attention operation; The obtained auditory features The linearly transformed visual features are input into a cross-modal attention block to model the interaction between visual and auditory features, resulting in enhanced visual features. ; Similarly, by performing a residual connection between the visual features and the linearly transformed visual features, the feature is obtained. ,Right now: The obtained visual features The linearly transformed auditory features are input into a cross-modal attention block to model the interaction between auditory and visual features, resulting in enhanced auditory features. ; Based on the enhanced visual features [ , , …, The input shot sequence is divided into two semantically disjoint subsequences using a dynamic time warping algorithm, and the scene pseudo-labels corresponding to the shot sequence are calculated. Enhanced visual features [ , , …, ] and auditory features [ , , …, The visual context perception module and the auditory context perception module are respectively input, and the best visual and auditory shot representations are learned through auxiliary task training based on the scene pseudo-labels.

2. The video scene segmentation method for multimodal semantic interaction according to claim 1, characterized in that, In step S3, the lens representation learning network is trained in a self-supervised manner on the MovieNet dataset without real labels; In step S4, the scene segmentation learning stage uses classification loss to perform supervised training on the MovieScenes dataset with real scene labels.

3. The video scene segmentation method for multimodal semantic interaction according to claim 2, characterized in that, Step S2 specifically includes the following steps: For the input video V, use a shot segmentation tool to divide it into n consecutive shot sequences. , , …, ], n>0; For each shot sequence, visual features are extracted using a Bassl model pre-trained on the MovieNet dataset. , , …, ]; For each shot sequence, a NaverNet model pre-trained on the AVA-ActiveSpeaker dataset is used to separate speech and background sound. Speech features and background sound features are obtained separately through short-time Fourier transform. The speech features and background sound features are then concatenated to obtain the auditory features. , , …, ].

4. The video scene segmentation method for multimodal semantic interaction according to claim 1, characterized in that, The cross-modal attention block performs the following steps: In terms of vision, the visual features are linearly transformed to obtain a query feature vector Q, and then the input visual features and auditory features are combined. Features obtained by concatenation The key feature vector K and value feature vector V are obtained through two linear transformations. The query feature vector Q is then multiplied by the key feature vector K to obtain a feature map representing the correlation between the Q and K vectors. Finally, the feature map is multiplied by the value feature vector V to obtain the cross-modal attention output. ,Right now: in The lens represents the parameters that the learning network needs to learn; Finally, the cross-modal attention output will be... By performing a residual connection with the linearly transformed visual features, enhanced visual features are obtained. ; In terms of hearing, this involves considering the auditory and visual characteristics of the input. Performing the same operations, modeling intramodal temporal information and intermodal interaction information, yields enhanced auditory features. .

5. The video scene segmentation method for multimodal semantic interaction according to claim 1, characterized in that, The specific scene pseudo-labels corresponding to the calculated shot sequence are: The first shot in the input sequence It is considered to belong to the first subsequence The last shot Belongs to the second subsequence By calculating the remaining shots and and Using cosine similarity, the shots that are semantically most similar to the two sequences are found, and finally, two optimal subsequences are obtained. and ,in This represents the calculated boundary lens, with the pseudo-label for this lens set to 1, and the pseudo-label for the other lenses set to 0.

6. The video scene segmentation method for multimodal semantic interaction according to claim 5, characterized in that, The auxiliary tasks include scene matching, context grouping matching, and pseudo-boundary prediction. The scene matching feature encourages the model to maximize the similarity of shots within a scene while minimizing the similarity between different scenes; and Treating it as a pseudo-scene, by calculating pseudo-boundary features The model is trained using the InfoNCE loss of the two subsequences, namely: in Represent a linear transformation; It is a scene-level representation obtained by averaging shot features in subsequences; The context grouping matching is based on the center shot in the input sequence. Constructing triples ( ), From and Shots randomly sampled from the same subsequence are used as positive samples, while Shots sampled from another subsequence are used as negative samples. A binary cross-entropy loss is used to learn whether two given shots belong to the same scene. in This means taking two shots as input and predicting a matching score; and They are and The corresponding contextual shot features are calculated by the context-aware module; False boundary prediction is used to identify semantic changes at specific moments. It is achieved by calculating the binary cross-entropy loss between the false boundary shot and a randomly sampled non-shot boundary, i.e.: in and These represent pseudo-boundary lenses. and randomly sampled shots Contextual shot representation; This represents the probability distribution that maps the shot representation to binary.

7. The video scene segmentation method for multimodal semantic interaction according to claim 1, characterized in that, Step S4 specifically includes the following steps: The interaction attention module, visual context perception module and auditory context perception module trained in the aforementioned shot representation learning network are transferred to the scene segmentation learning network. Input a video segment to be segmented, extract visual and auditory features from the video segment, and then process the visual and auditory features through an interactive attention module, a visual context module, and an auditory context module, respectively, to output a visual shot representation. , , …, ] and auditory shots indicate [ , , …, ]; The visual and auditory camera representations are fused by the feature fusion module, and the fused features are output. , , …, ]; The final result is obtained through softmax classification.

8. The video scene segmentation method for multimodal semantic interaction according to claim 1, characterized in that, The scene segmentation learning network is equipped with training models using classification loss and visual discrimination loss, namely: in and It's a hyperparameter; This indicates the calculation of classification loss; This represents the computational visual discrimination loss; Indicates the lens Corresponding real-world scene tags; The classification loss minimizes the binary cross-entropy loss between the predicted value obtained from the fused features and the real scene label, while the visual discrimination loss minimizes the binary cross-entropy loss between the predicted value obtained from the visual lens representation and the real scene label, i.e.: = in This represents the probability distribution that maps features to binary; This represents the probability distribution that maps the visual lens representation to binary.