A method and system for segmenting referential video targets based on wavelet correction learning
By employing a wavelet-based correction learning method, the target entity is perceived in wavelet space using a Transformer encoder and wavelet transform, and then segmented using the Hungarian algorithm. This approach solves the problems of cross-modal ambiguity and spatiotemporal fragmentation in the segmentation of referential video targets, achieving higher entity recognition accuracy and segmentation completeness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TONGJI UNIV
- Filing Date
- 2025-07-31
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies suffer from cross-modal ambiguity and spatiotemporal fragmentation in the segmentation of referential video targets, resulting in inaccurate target entity recognition and incomplete segmentation.
A wavelet-based correction learning method is adopted to extract video and text features, use Transformer encoder and wavelet transform to perceive target entities in wavelet space, and combine Hungarian algorithm for target segmentation to achieve unified understanding of cross-modal information and global spatiotemporal modeling.
It significantly improves the accuracy of entity recognition and the completeness of segmentation details, and solves the problems of cross-modal understanding ambiguity and spatiotemporal perception fragmentation in existing technologies.
Smart Images

Figure CN120931929B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of video target segmentation technology, and in particular to a video target segmentation method and system based on wavelet correction learning. Background Technology
[0002] Video target segmentation, a task that aims to segment specific entities in videos based on natural language descriptions, is widely used in cross-media information interaction tasks. Significant progress has been made in this area due to the Transformer architecture's ability to model the segmentation process end-to-end. However, the following challenges remain:
[0003] (1) The language description is usually inconsistent with the visual content of the video. Existing technologies often rely on language features that are not aligned with the actual video for referential segmentation, which leads to cross-modal ambiguity and inaccurate target entity recognition.
[0004] (2) Existing technologies typically perform cross-modal perception directly in pixel space. This strategy suffers from spatiotemporal perception fragmentation, leading to incomplete segmentation of target entities across frames. For example, patent application CN117788814A discloses a reference image segmentation method, apparatus, and storage medium based on progressive visual features. Although this method employs a progressive fusion strategy of multi-scale features to achieve target segmentation, it still operates in pixel space, resulting in insufficient spatiotemporal context modeling capabilities and thus facing challenges such as incomplete segmentation of referential entities. Summary of the Invention
[0005] The purpose of this invention is to overcome the shortcomings of the existing technology by providing a method and system for segmenting referential video targets based on wavelet correction learning, which significantly improves the accuracy of entity recognition and solves the problem of incomplete segmentation details.
[0006] The objective of this invention can be achieved through the following technical solutions:
[0007] A method for segmenting referential video targets based on wavelet-corrected learning includes the following steps:
[0008] Acquire video data and extract high-dimensional and low-dimensional visual features from the video data;
[0009] Acquire text data and extract initial word features and sentence features from the text data;
[0010] The initial word features are corrected to obtain corrected word features, and the semantic entity information of the given visual features is perceived by the Transformer encoder based on the corrected word features.
[0011] Extract the high-dimensional wavelet features and low-dimensional wavelet features of the high-dimensional visual features, refine the high-dimensional wavelet features and low-dimensional wavelet features, and obtain the pixel space mapping based on the refined high-dimensional wavelet features and low-dimensional wavelet features.
[0012] The semantic entity information and pixel space mapping are fused and concatenated into enhanced high-dimensional visual features. Based on the enhanced high-dimensional visual features and instance queries, the predicted instance is obtained through the Transformer decoder.
[0013] A prediction head is generated based on the prediction instance, and the target entity is segmented using the prediction head to obtain the prediction trajectory;
[0014] The predicted trajectory and the real target sequence are matched using the Hungarian algorithm to obtain the target segmentation result.
[0015] Furthermore, high-dimensional and low-dimensional visual features of the video data are extracted using a pre-trained Video-Swin-T model, and initial word and sentence features are output using a pre-trained RoBERTa-base model. The Video-Swin-T model consists of 5 modules, where the outputs of the last 3 modules are considered as high-dimensional visual features, and the outputs of the first 2 modules are considered as low-dimensional visual features.
[0016] Further, the specific steps of correcting the initial word features to obtain corrected word features, and then using a Transformer encoder to perceive the semantic entity information of a given visual feature based on the corrected word features, include:
[0017] Visual context information is obtained based on the high-dimensional visual features;
[0018] Obtain text context features based on the initial word features;
[0019] The correlation matrix between the visual context information and the text context features was obtained through analysis;
[0020] The initial filter kernel is updated by performing a Hadamard operation on the correlation matrix after enhanced cross-modal interaction, resulting in the updated filter kernel.
[0021] The corrected word features are obtained based on the initial word features and the updated filter kernel;
[0022] The Transformer encoder perceives semantic entity information of a given visual feature based on the corrected word features.
[0023] Furthermore, the semantic entity information is as follows:
[0024]
[0025]
[0026] In the formula, For semantic entity information, Softmax is a network layer. The result of linear mapping of semantic entity information. For matrix multiplication, and C is the result of linear mapping of word features. out For output dimensions, For high-dimensional visual features, q v k w and v w For linear mapping layer, These are the corrected word features.
[0027] Furthermore, the corrected word features are as follows:
[0028]
[0029]
[0030]
[0031]
[0032] In the formula, For the corrected word features, For the updated filter kernel, For filtering operations, Ω is the initial word feature. i To initialize the filter kernel, ⊙ represents the Hadamard operation. For relation enhancement matrices, Conv k×k The function is a k×k convolution, with ReLU as the activation layer and BN as the batch normalization layer. For visual context information, For text context features, This is matrix multiplication.
[0033] Furthermore, the specific steps for extracting the high-dimensional wavelet features and low-dimensional wavelet features of the high-dimensional visual features, refining the high-dimensional wavelet features and low-dimensional wavelet features, and obtaining the pixel space mapping based on the refined high-dimensional wavelet features and low-dimensional wavelet features include:
[0034] High-dimensional and low-dimensional wavelet features of the high-dimensional visual features are extracted by continuous discrete wavelet transform.
[0035] The target entity is located by aggregating low-dimensional wavelet features and sentence features, resulting in refined low-dimensional wavelet features.
[0036] The refined low-dimensional wavelet features and the spliced high-dimensional wavelet features are subjected to a cross-modal attention operation to obtain the refined high-dimensional wavelet features.
[0037] The refined high-dimensional wavelet features and low-dimensional wavelet features are mapped back to the pixel space by inverse feature transformation, resulting in pixel space mapping.
[0038] Furthermore, the continuous discrete wavelet transform is:
[0039]
[0040] In the formula, DWT s For a series of s discrete wavelet transform operations, As high-dimensional visual features, It is a low-dimensional wavelet feature. and It is a high-dimensional wavelet feature.
[0041] Furthermore, the pixel space mapping is as follows:
[0042]
[0043] In the formula, For pixel space mapping, Sigmoid is a non-linear activation layer, and IDWT is used. s This is the inverse wavelet transform operation; Concat is the concatenation operation. The features are low-dimensional wavelet features, and Softmax is a network layer. This is the refined result of the spliced high-dimensional wavelet features. The result is after low-dimensional wavelet feature transformation. For matrix multiplication, and The result of statement feature mapping, C out For output dimensions, This is the result after high-dimensional wavelet feature mapping. and This is the result after low-dimensional wavelet feature mapping.
[0044] Furthermore, the prediction head includes a location bounding box head, a masking head, and a category head. The location bounding box head is implemented using a 3-layer feedforward network to predict the location of the target entity. The masking head generates parameters for the instance kernel to predict instance-level features. The category head outputs a binary probability to indicate whether the instance is visible.
[0045] According to another aspect of the present invention, a video target segmentation system based on wavelet correction learning is provided, comprising:
[0046] The video data acquisition module is used to acquire video data and extract the high-dimensional and low-dimensional visual features of the video data.
[0047] The text data acquisition module is used to acquire text data and extract initial word features and sentence features from the text data;
[0048] The semantic entity information acquisition module is used to correct the initial word features to obtain corrected word features, and to perceive the semantic entity information of the given visual features through the Transformer encoder based on the corrected word features.
[0049] The pixel space mapping acquisition module is used to extract the high-dimensional wavelet features and low-dimensional wavelet features of the high-dimensional visual features, refine the high-dimensional wavelet features and low-dimensional wavelet features, and obtain the pixel space mapping based on the refined high-dimensional wavelet features and low-dimensional wavelet features.
[0050] The prediction instance acquisition module is used to fuse the semantic entity information and pixel space mapping, and concatenate them into enhanced high-dimensional visual features. Based on the enhanced high-dimensional visual features and instance query, the prediction instance is obtained through the Transformer decoder.
[0051] The predicted trajectory acquisition module is used to generate a prediction head based on the prediction instance, and to segment the target entity using the prediction head to obtain the predicted trajectory;
[0052] The target segmentation result acquisition module is used to match the predicted trajectory with the real target sequence using the Hungarian algorithm to obtain the target segmentation result.
[0053] Compared with the prior art, the present invention has the following beneficial effects:
[0054] 1. This invention obtains visual context information based on high-dimensional visual features, obtains text context features based on initial word features, analyzes the correlation matrix between visual context information and text context features, updates the initial filter kernel by performing a Hadamard operation on the correlation matrix after enhanced cross-modal interaction, and obtains the updated filter kernel. Based on the initial word features and the updated filter kernel, the corrected word features are obtained. The Transformer encoder perceives the semantic entity information of the given visual features based on the corrected word features, making it accurately match the video content, unifying cross-modal understanding, and significantly improving the accuracy of entity recognition.
[0055] 2. This invention utilizes wavelet transform to perceive target entities in wavelet space. By extracting high-dimensional and low-dimensional wavelet features from high-dimensional visual features, the high-dimensional and low-dimensional wavelet features are refined. Pixel space mapping is obtained based on the refined high-dimensional and low-dimensional wavelet features, effectively modeling global spatiotemporal integrity and solving the problem of incomplete segmentation details. Attached Figure Description
[0056] Figure 1 This is a flowchart illustrating a method for segmenting referential video targets based on wavelet correction learning proposed in this invention.
[0057] Figure 2 This is a schematic diagram of the structure of a video target segmentation system based on wavelet correction learning proposed in this invention;
[0058] Figure 3 This is a comparison diagram of the segmentation effect in the embodiments. Detailed Implementation
[0059] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.
[0060] Example 1
[0061] This embodiment provides a method for segmenting referential video targets based on wavelet correction learning, such as... Figure 1 As shown, it includes the following steps:
[0062] S1. Acquire video data and extract high-dimensional and low-dimensional visual features from the video data.
[0063] Given video The Video-Swin-T model, which is pre-trained, extracts high-dimensional and low-dimensional visual features from video data. The model consists of five modules, with the outputs of the last three modules considered as high-dimensional visual features. The outputs of the first two modules are considered as low-dimensional visual features. In the formula, N is the number of spatiotemporal pixels, and C out For visual feature channel dimensions.
[0064] S2. Obtain text data and extract initial word features and sentence features from the text data.
[0065] The pre-trained RoBERTa-base model outputs initial word features and sentence features. The initial word features are represented as follows: Statement characteristics are represented as In the formula, L represents the number of words, and C... in This represents the initial word feature dimension.
[0066] S3. Correct the initial word features to obtain corrected word features, and use the Transformer encoder to perceive the semantic entity information of the given visual features based on the corrected word features.
[0067] The specific steps of correcting the initial word features to obtain corrected word features, and then using the Transformer encoder to perceive the semantic entity information of a given visual feature based on the corrected word features, include:
[0068] Visual context information is obtained based on high-dimensional visual features. The visual context information is as follows:
[0069]
[0070] In the formula, For visual context information, For high-dimensional visual features, GAP is global average pooling, and GMP is global max pooling. For summation operations.
[0071] The text context features are obtained based on the initial word features. The text context features are as follows:
[0072]
[0073] In the formula, For text context features, Using initial word features, GAP is global average pooling, and GMP is global max pooling. For summation operations.
[0074] Interactive visual and textual contextual features are used to uncover the correlation between them. The resulting correlation matrix between visual contextual information and textual contextual features is as follows:
[0075]
[0076] In the formula, The correlation matrix, For visual context information, For matrix multiplication, These are text context features.
[0077] Correlation matrix To enhance cross-modal interaction, a series of operation layers are used, namely batch normalization (BN), ReLU activation layer, and k×k convolution, as shown in the following formula:
[0078]
[0079] In the formula, For relation enhancement matrices, Conv k×k It is a k×k convolution, with ReLU as the activation layer and BN as the batch normalization layer.
[0080] The initial filter kernel is updated by performing a Hadamard operation on the correlation matrix after enhanced cross-modal interaction, resulting in the updated filter kernel:
[0081]
[0082] In the formula, For the updated filter kernel, Ω i To initialize the filter kernel, ⊙ represents the Hadamard operation.
[0083] The corrected word features are obtained based on the initial word features and the updated filter kernel. The corrected word features are as follows:
[0084]
[0085] In the formula, For the corrected word features, For the updated filter kernel, For filtering operations, These are the initial word features.
[0086] The Transformer encoder perceives semantic entity information of a given visual feature based on the corrected word features. The semantic entity information is as follows:
[0087]
[0088] In the formula, For semantic entity information, Softmax is a network layer. The result of linear mapping of semantic entity information. For matrix multiplication, and C is the result of linear mapping of word features. out For output dimensions, For high-dimensional visual features, q v k w and v w For linear mapping layer, These are the corrected word features.
[0089] S4. Extract the high-dimensional and low-dimensional wavelet features of the high-dimensional visual features, refine the high-dimensional and low-dimensional wavelet features, and obtain the pixel space mapping based on the refined high-dimensional and low-dimensional wavelet features.
[0090] The specific steps for extracting high-dimensional and low-dimensional wavelet features from high-dimensional visual features, refining these features, and obtaining pixel space mappings based on the refined high-dimensional and low-dimensional wavelet features include:
[0091] High-dimensional and low-dimensional wavelet features of high-dimensional visual features are extracted using continuous discrete wavelet transform. The continuous discrete wavelet transform is as follows:
[0092]
[0093] In the formula, DWT s For a series of s discrete wavelet transform operations, As high-dimensional visual features, It is a low-dimensional wavelet feature. and It is a high-dimensional wavelet feature.
[0094] The target entity is located by aggregating low-dimensional wavelet features and sentence features, resulting in refined low-dimensional wavelet features. The refined low-dimensional wavelet features are as follows:
[0095]
[0096] In the formula, The features are refined into low-dimensional wavelet features, with Softmax representing a network layer. The result is after low-dimensional wavelet feature transformation. For matrix multiplication, and The result of statement feature mapping, C out For the output dimension, q ll k s and v s For linear mapping layer, It is a low-dimensional wavelet feature. These are characteristics of the statement.
[0097] A cross-modal attention operation is performed on the refined low-dimensional wavelet features and the concatenated high-dimensional wavelet features to obtain the refined result of the concatenated high-dimensional wavelet features. The concatenated high-dimensional wavelet features are as follows:
[0098]
[0099] In the formula, For the concatenated high-dimensional wavelet features, Concat is... It is a high-dimensional wavelet feature.
[0100] The refined result of the spliced high-dimensional wavelet features is as follows:
[0101]
[0102] In the formula, This is the refined result of the spliced high-dimensional wavelet features. This is the result after linear mapping of high-dimensional wavelet features. and The result after linear mapping of low-dimensional wavelet features, q h k ll and v ll For linear mapping layer, The spliced high-dimensional wavelet features, This represents the refined low-dimensional wavelet features.
[0103] By using inverse feature transformation, the refined high-dimensional wavelet features and low-dimensional wavelet features are mapped back to the pixel space, resulting in the pixel space mapping:
[0104]
[0105] In the formula, For pixel space mapping, Sigmoid is a non-linear activation layer, and IDWT is used. s This is the inverse wavelet transform operation; Concat is the concatenation operation. The features are refined into low-dimensional wavelet features, with Softmax representing a network layer. This is the refined result of the spliced high-dimensional wavelet features.
[0106] S5. The semantic entity information and pixel space mapping are fused and concatenated into enhanced high-dimensional visual features. Based on the enhanced high-dimensional visual features and instance queries, the predicted instance is obtained through the Transformer decoder.
[0107] The semantic entity information and pixel space mapping are fused to comprehensively consider the clues of the referent entity. The specific formula is as follows:
[0108]
[0109] In the formula, For the i-th high-dimensional visual feature after fusion, For semantic entity information, ⊙ represents the Hadamard operation. This is a pixel space mapping.
[0110] The three fused visual features are stitched together to form an enhanced high-dimensional visual feature. Enhanced high-dimensional visual features Feeding into the Transformer decoder, intended to combine and instance query To generate N q 1 prediction instance, where instance query It is based on statement features It was created.
[0111] S6. Generate a prediction head based on the prediction instance, and segment the target entity using the prediction head to obtain the prediction trajectory.
[0112] Application N q Each prediction instance generates three lightweight prediction heads: a bounding box header, a mask header, and a category header, for target entity segmentation. The bounding box header is implemented using a three-layer feedforward network to predict the location of the target entity, represented as... The parameters for generating instance kernels by the masking tool are used to predict instance-level features. Where H and W represent the height and width of the video, respectively; the category header outputs binary probabilities. This is used to indicate whether an instance is visible, where K represents the number of categories. These prediction heads collectively output N. q The predicted trajectory of each prediction is represented as:
[0113] S7. The Hungarian algorithm is used to match the predicted trajectory with the real target sequence to obtain the target segmentation result.
[0114] Using the Hungarian algorithm for matching Find the best prediction using the true target sequence. This method was experimentally validated on four publicly available datasets (Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences, and JHMDB-Sentences). Ref-Youtube-VOS is the largest benchmark dataset, containing 3978 videos and nearly 15,000 language descriptions. Ref-DAVIS17 contains 90 videos, each providing a language description for a specific object. A2D-Sentences contains 3782 videos, each with 3-5 frames of pixel-level segmentation mask annotations. JHMDB-Sentences contains 928 videos, each associated with a description covering 21 different action categories.
[0115] To ensure fairness in experimental comparisons, the evaluation metrics used were the same as those used in previous studies. Specifically, the Ref-Youtube-VOS and Ref-DAVIS17 benchmark tests employed three standard evaluation metrics: region similarity (J), contour accuracy (F), and mean (J&F). On the A2D-Sentences and JHMDB-Sentences datasets, the proposed method was evaluated by calculating the overall intersection-union ratio (OIoU), mean intersection-union ratio (MIoU), and mAP (ratio 0.50:0.05:0.95). OIoU and MIoU respectively measure the ratio of the total intersection area to the total union area of all test samples, and the mean IoU of all test samples.
[0116] To evaluate the effectiveness of the present invention on different datasets, the experimental results are shown in Tables 1 and 2.
[0117] Table 1 Comparison of experimental results on Ref-Youtube-VOS and Ref-DAVIS17
[0118]
[0119] Table 2 compares experimental results on the A2D-Sentences and JHMDB-Sentences datasets.
[0120]
[0121]
[0122] As can be seen from Tables 1 and 2, the video target segmentation method based on wavelet correction learning of the present invention outperforms the state-of-the-art methods in almost all metrics, demonstrating the effectiveness and generalization of the present invention.
[0123] To further verify the effectiveness of the proposed method, segmentation results were selected from the Ref-Youtube-VOS dataset, such as... Figure 3 As shown in the diagram, the top row represents the input image, the English text descriptions corresponding to the image, SgMg uses Fourier transform to achieve global spatiotemporal context modeling, ReferFormer directly uses language features as the target query, and all instances of "this method" indicate the result of reference image segmentation using this method under a specific text description. It can be seen that this method can predict the target object more accurately.
[0124] Example 2
[0125] This embodiment provides a video target segmentation system based on wavelet correction learning, such as... Figure 2 As shown, it includes:
[0126] The video data acquisition module is used to acquire video data and extract high-dimensional and low-dimensional visual features from the video data.
[0127] The text data acquisition module is used to acquire text data and extract initial word features and sentence features from the text data;
[0128] The semantic entity information acquisition module is used to correct the initial word features to obtain corrected word features, and then use the Transformer encoder to perceive the semantic entity information of the given visual features based on the corrected word features.
[0129] The pixel space mapping acquisition module is used to extract high-dimensional wavelet features and low-dimensional wavelet features of high-dimensional visual features, refine the high-dimensional wavelet features and low-dimensional wavelet features, and obtain the pixel space mapping based on the refined high-dimensional wavelet features and low-dimensional wavelet features.
[0130] The prediction instance acquisition module is used to fuse semantic entity information and pixel space mapping, and concatenate them into enhanced high-dimensional visual features. Based on the enhanced high-dimensional visual features and instance query, the prediction instance is obtained through the Transformer decoder.
[0131] The predicted trajectory acquisition module is used to generate a prediction head based on the prediction instance, and to segment the target entity using the prediction head to obtain the predicted trajectory;
[0132] The target segmentation result acquisition module is used to match the predicted trajectory with the real target sequence using the Hungarian algorithm to obtain the target segmentation result.
[0133] The rest is the same as in Example 1.
[0134] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.
Claims
1. A method for segmenting referential video targets based on wavelet correction learning, characterized in that, Includes the following steps: Acquire video data and extract high-dimensional and low-dimensional visual features from the video data; Acquire text data and extract initial word features and sentence features from the text data; The initial word features are corrected to obtain corrected word features, and the semantic entity information of the given visual features is perceived by the Transformer encoder based on the corrected word features. Extract the high-dimensional wavelet features and low-dimensional wavelet features of the high-dimensional visual features, refine the high-dimensional wavelet features and low-dimensional wavelet features, and obtain the pixel space mapping based on the refined high-dimensional wavelet features and low-dimensional wavelet features. The semantic entity information and pixel space mapping are fused and concatenated into enhanced high-dimensional visual features. Based on the enhanced high-dimensional visual features and instance queries, the predicted instance is obtained through the Transformer decoder. A prediction head is generated based on the prediction instance, and the target entity is segmented using the prediction head to obtain the prediction trajectory; The predicted trajectory and the real target sequence are matched using the Hungarian algorithm to obtain the target segmentation result.
2. The method for segmenting referential video targets based on wavelet correction learning according to claim 1, characterized in that, The high-dimensional and low-dimensional visual features of the video data are extracted by a pre-trained Video-Swin-T model, and the initial word features and sentence features are output by a pre-trained RoBERTa-base model. The Video-Swin-T model consists of 5 modules, of which the outputs of the last 3 modules are regarded as high-dimensional visual features and the outputs of the first 2 modules are regarded as low-dimensional visual features.
3. The method for segmenting referential video targets based on wavelet correction learning according to claim 1, characterized in that, The specific steps of correcting the initial word features to obtain corrected word features, and then using a Transformer encoder to perceive the semantic entity information of a given visual feature based on the corrected word features, include: Visual context information is obtained based on the high-dimensional visual features; Obtain text context features based on the initial word features; The correlation matrix between the visual context information and the text context features was obtained through analysis; The initial filter kernel is updated by performing a Hadamard operation on the correlation matrix after enhanced cross-modal interaction, resulting in the updated filter kernel. The corrected word features are obtained based on the initial word features and the updated filter kernel; The Transformer encoder perceives semantic entity information of a given visual feature based on the corrected word features.
4. The method for segmenting referential video targets based on wavelet correction learning according to claim 1, characterized in that, The semantic entity information is as follows: In the formula, For semantic entity information, Softmax is a network layer. The result of linear mapping of semantic entity information. For matrix multiplication, and C is the result of linear mapping of word features. out For output dimensions, For high-dimensional visual features, q v k w and v w For linear mapping layer, These are the corrected word features.
5. The method for segmenting referential video targets based on wavelet correction learning according to claim 1, characterized in that, The corrected word features are: In the formula, For the corrected word features, For the updated filter kernel, For filtering operations, Ω is the initial word feature. i To initialize the filter kernel, ⊙ represents the Hadamard operation. For relation enhancement matrices, Conv k×k The function is a k×k convolution, with ReLU as the activation layer and BN as the batch normalization layer. For visual context information, For text context features, This is matrix multiplication.
6. The method for segmenting referential video targets based on wavelet correction learning according to claim 1, characterized in that, The specific steps for extracting high-dimensional and low-dimensional wavelet features from the high-dimensional visual features, refining these features, and obtaining the pixel space mapping based on the refined features include: High-dimensional and low-dimensional wavelet features of the high-dimensional visual features are extracted by continuous discrete wavelet transform. The target entity is located by aggregating low-dimensional wavelet features and sentence features, resulting in refined low-dimensional wavelet features. The refined low-dimensional wavelet features and the spliced high-dimensional wavelet features are subjected to a cross-modal attention operation to obtain the refined high-dimensional wavelet features. The refined high-dimensional wavelet features and low-dimensional wavelet features are mapped back to the pixel space by inverse feature transformation, resulting in pixel space mapping.
7. The method for segmenting referential video targets based on wavelet correction learning according to claim 6, characterized in that, The continuous discrete wavelet transform is: In the formula, DWT s For s consecutive discrete wavelet transform operations, As high-dimensional visual features, It is a low-dimensional wavelet feature. and It is a high-dimensional wavelet feature.
8. The method for segmenting referential video targets based on wavelet correction learning according to claim 6, characterized in that, The pixel space mapping is as follows: In the formula, For pixel space mapping, Sigmoid is a non-linear activation layer, and IDWT is used. s This is the inverse wavelet transform operation; Concat is the concatenation operation. The features are low-dimensional wavelet features, and Softmax is a network layer. This is the refined result of the spliced high-dimensional wavelet features. The result is after low-dimensional wavelet feature transformation. For matrix multiplication, and The result of statement feature mapping, C out For output dimensions, This is the result after high-dimensional wavelet feature mapping. and This is the result after low-dimensional wavelet feature mapping.
9. The method for segmenting referential video targets based on wavelet correction learning according to claim 1, characterized in that, The prediction head includes a location bounding box head, a masking head, and a category head. The location bounding box head is implemented using a 3-layer feedforward network and is used to predict the location of the target entity. The masking head generates parameters for the instance kernel and is used to predict instance-level features. The category head outputs a binary probability to indicate whether the instance is visible.
10. A video target segmentation system based on wavelet correction learning, characterized in that, include: The video data acquisition module is used to acquire video data and extract the high-dimensional and low-dimensional visual features of the video data. The text data acquisition module is used to acquire text data and extract initial word features and sentence features from the text data; The semantic entity information acquisition module is used to correct the initial word features to obtain corrected word features, and to perceive the semantic entity information of the given visual features through the Transformer encoder based on the corrected word features. The pixel space mapping acquisition module is used to extract the high-dimensional wavelet features and low-dimensional wavelet features of the high-dimensional visual features, refine the high-dimensional wavelet features and low-dimensional wavelet features, and obtain the pixel space mapping based on the refined high-dimensional wavelet features and low-dimensional wavelet features. The prediction instance acquisition module is used to fuse the semantic entity information and pixel space mapping, and concatenate them into enhanced high-dimensional visual features. Based on the enhanced high-dimensional visual features and instance query, the prediction instance is obtained through the Transformer decoder. The predicted trajectory acquisition module is used to generate a prediction head based on the prediction instance, and to segment the target entity using the prediction head to obtain the predicted trajectory. The target segmentation result acquisition module is used to match the predicted trajectory with the real target sequence using the Hungarian algorithm to obtain the target segmentation result.