Video label type prediction and recommendation method based on multi-modal feature extraction

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By fusing textual and visual information through multimodal feature extraction and cross-modal attention mechanisms, a deep neural network classifier is constructed, which solves the problems of low efficiency, insufficient accuracy, and fragmented recommendation in video annotation, and achieves high-precision multi-label type prediction and personalized recommendation.

CN122244548APending Publication Date: 2026-06-19EAST CHINA NORMAL UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: EAST CHINA NORMAL UNIV
Filing Date: 2026-04-13
Publication Date: 2026-06-19

Application Information

Patent Timeline

13 Apr 2026

Application

19 Jun 2026

Publication

CN122244548A

IPC: G06V10/764; G06V10/80; G06V10/82; G06V10/20; G06V20/40; G06F16/735; G06N3/047

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Video processing method, machine learning model training method and related device and equipment
CN114419515A
Video data processing methods and devices
CN113688951B
Video cover extraction method and device, equipment and computer readable storage medium
CN113762052A
Multi-modal label recommendation method based on unidirectional supervision attention
CN113704547A
A multimodal sentiment analysis method and system based on attention mechanism
CN116563751B

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing video annotation methods rely on manual labor, which is inefficient and costly. The annotation results are highly subjective, the labeling system is rigid and cannot adapt to emerging video styles and user needs. Traditional models have insufficient accuracy in feature extraction, and label prediction and recommendation are disconnected.

⚗Method used

We employ a deep learning-based multimodal feature extraction method, utilize cross-modal attention mechanism to fuse textual and visual information, construct a multilayer perceptron deep neural network classifier, and combine user behavior data for personalized recommendations.

🎯Benefits of technology

It achieves high-precision multi-label type prediction and recommendation, adapts to emerging video styles, reduces manual costs, and improves the accuracy of labeling and the precision of recommendations.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244548A_ABST

Patent Text Reader

Abstract

This invention provides a video tag type prediction and recommendation method based on multimodal feature extraction, comprising: collecting basic data of video works and preprocessing the basic data; extracting and fusing multimodal features from the preprocessed data to obtain multimodal feature vectors; constructing a deep neural network classifier based on a multilayer perceptron and inputting the multimodal feature vectors into the deep neural network classifier to obtain the independent probability of each preset tag type; acquiring users' historical behavior data, constructing user profiles, and making personalized recommendations. Compared with existing technologies, the technical solution of this invention utilizes advanced deep learning models to automatically extract deep features from multimodal data, and through a cross-modal attention mechanism, achieves deep fusion of textual and visual information, resulting in more accurate tag labeling conclusions and possessing strong application value.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer information technology and artificial intelligence, and in particular to a method for predicting and recommending video tag types based on multimodal feature extraction. Background Technology

[0002] With the development of multimedia information technology, various types of short videos, movies, and other artistic works are emerging in large numbers. Adding tags to videos has become a key logical clue for users to filter the content they want from a massive amount of video content. However, traditional movie tagging methods have the following problems that urgently need to be addressed: Labeling methods primarily rely on manual labor. Manual labeling is inefficient, cannot achieve automated batch labeling, and incurs high labor costs.

[0003] 1) The annotation results are highly subjective. Different annotators may have different definitions of the same video work, which may result in incomplete labeling of film and television works, causing users to lose the possibility of accurately searching for the content they want.

[0004] 2) The tagging system is relatively rigid. With the development of social entertainment, the tagging system is updated and iterated rapidly. The rigid system cannot adapt to emerging video styles and users' personalized needs.

[0005] 3) Currently, existing technologies include non-human prediction and recommendation methods that automatically analyze elements such as movie synopses, trailers, and cast information by building models and using machine learning models such as Support Vector Machines (SVM) and decision trees for classification. However, this approach also has certain shortcomings in its application: 1) The tag library needs to be manually organized. Before applying the model, the tag library needs to be fully organized manually, and the model needs to be trained to finally complete the correspondence between works and tags. This method is highly dependent on the completeness and scientific nature of the tag library system. In addition, the output tag type is usually a single tag conclusion, but in reality, most works are a mixture of multiple types and cannot be summarized by a single tag.

[0006] 2) Traditional model feature extraction accuracy is insufficient. When applying the model for prediction, it fails to fully extract the inherent meaning of the film, resulting in deviations in the accuracy of labeling.

[0007] 3) Tag prediction and recommendation are somewhat disconnected. The extraction of user needs still relies primarily on the domain of core tags for recommendations, failing to uncover deeper user needs, and the accuracy of recommendations needs improvement.

[0008] In summary, how to integrate multimodal information to achieve high-precision multi-label type prediction and further build a recommendation engine that meets user needs are problems that still need to be solved in the current technology field. Summary of the Invention

[0009] In view of this, the video tag type prediction and recommendation method based on multimodal feature extraction proposed in this invention uses an advanced deep learning model to automatically extract deep features from multimodal data. Through a cross-modal attention mechanism, it achieves deep fusion of textual and visual information, and obtains more accurate tag annotation conclusions, which has strong application value.

[0010] Embodiments of the present invention provide a video tag type prediction and recommendation method based on multimodal feature extraction, including: Collect basic data of video works and preprocess the basic data; Multimodal feature extraction and fusion are performed on the preprocessed data to obtain multimodal feature vectors; A deep neural network classifier based on a multilayer perceptron is constructed, and the multimodal feature vector is input into the deep neural network classifier to obtain the independent probability of each preset label type; Obtain users' historical behavior data, build user profiles, and make personalized recommendations.

[0011] For example, the basic data includes first basic data, second basic data, and third basic data. The first basic data is text data collected using web crawlers, including title, synopsis, cast and crew list, and user reviews. The second basic data is video image data, including movie posters and keyframe screenshots. The third basic data is non-plot data, including release year, country, and production company.

[0012] For example, the preprocessing of the basic data includes: The first basic data is cleaned and characterized, including using regular expressions to remove redundant characters and performing text case standardization to obtain text sequence data. A word segmenter is then applied to the text sequence data for word segmentation, and non-semantic words are removed. The second basic data is cleaned, including converting the video file into an image frame sequence using a content-based adaptive sampling method, removing invalid and blurry frames by calculating the average pixel intensity and standard deviation of the image, and using the TransNet V2 model to detect scene boundaries, extracting key frames within each scene boundary, and performing normalization and data augmentation processing. The third basic data is cleaned, including using an embedding method to convert the director and lead actor into low-dimensional vectors, and normalizing and scaling the year information.

[0013] For example, the step of extracting and fusing multimodal features from the preprocessed data to obtain a multimodal feature vector includes: The RoBERTa model based on the Transformer architecture is used to extract text feature vectors by inputting the preprocessed first basic data into the RoBERTa model. A deep convolutional neural network is used to input the preprocessed second basic data into the deep convolutional neural network, and the output of the fully connected layer is used as a visual feature vector. A cross-modal attention mechanism is used to fuse the text feature vector and the visual feature vector, and then the dimensionality is reduced and integrated to obtain a multimodal feature vector.

[0014] For example, the step of using the RoBERTa model based on the Transformer architecture to extract text feature vectors by inputting the preprocessed first basic data into the RoBERTa model includes: The RoBERTa model's word segmenter transforms the preprocessed first basic data into a labeled sequence, and adds a sequence start symbol and an end symbol to define the sequence range. The labeled sequence is converted into an embedding vector and superimposed with positional encoding information and input into a multi-cascaded Transformer encoding layer. Each layer of the encoding layer models the global context through a multi-head self-attention module so that any position in the sequence is fused with information other than that position. The text feature vector is obtained by nonlinear transformation through a feedforward neural network and normalization.

[0015] For example, the deep convolutional neural network is a pre-trained ResNet-50 neural network. The step of using a deep convolutional neural network, inputting the pre-processed second basic data into the deep convolutional network, and using the output of the fully connected layer as a visual feature vector includes: The preprocessed second basic data is input into the convolutional layer for visual feature extraction; The extracted visual features are aggregated in spatial dimensions by the global average pooling layer at the end of the network, and then the output is passed to the fully connected layer. The activation values of the fully connected layer are output as the visual feature vector.

[0016] For example, the step of using a cross-modal attention mechanism to fuse the text feature vector and the visual feature vector, and then reducing their dimensions to obtain a multimodal feature vector includes: Using the text feature vector as the query vector and the visual feature vector as the key vector and value vector, the attention distribution from text to vision is calculated to generate a visual context vector associated with the text semantics. The visual context vector and the text feature vector are concatenated, and the concatenated high-dimensional vector is nonlinearly transformed and dimensionality reduced by a fully connected layer to obtain the multimodal feature vector.

[0017] For example, the deep neural network classifier introduces a multilayer perceptron with gated and residual connections. The multilayer perceptron includes several hidden layers, each of which uses the GELU activation function, and skip connections and gated linear units are introduced between the hidden layers. The step of inputting the multimodal feature vector into the deep neural network classifier to obtain the independent probability of each preset label type includes: The multimodal feature vectors are input into a linear transformation layer, and the GELU activation function is applied to obtain intermediate features. The intermediate feature is input into the gated linear unit, which divides the intermediate feature into two equal sub-vectors along the feature dimension, wherein the sub-vectors include a first sub-vector and a second sub-vector. The first sub-vector is activated by the Sigmoid activation function, and the output is used as a control gate. The control gate is multiplied element-wise with the second sub-vector to generate a control feature vector. The control feature vector is then added with the multimodal feature vector to obtain the final output vector. The higher-order final output vector is fed back to the fully connected output layer, a linear transformation is performed on each neuron in the output layer, and the scalar value output by each neuron is input into the activation function to map and obtain the probability value.

[0018] For example, it also includes: The new label type is obtained by learning and training the deep neural network classifier, and the new label is incorporated into the preset label.

[0019] The video tag type prediction and recommendation method based on multimodal feature extraction according to claim 1 is characterized in that the step of acquiring users' historical behavior data, constructing user profiles, and making personalized recommendations includes: Collect users' first subjective feedback data and second subjective feedback data, wherein the first subjective feedback data is ratings and comments, and the second subjective feedback data is clicks, browsing time and search history, to form a dynamic preference vector of users; The cosine similarity between the dynamic preference vector and the candidate recommended movie vector is calculated as the content matching score, and then weighted and fused with the score based on neural collaborative filtering to obtain the final recommendation score, forming a recommendation list.

[0020] This invention provides a video tag type prediction and recommendation method based on multimodal feature extraction, comprising: collecting basic data of video works and preprocessing the basic data; extracting and fusing multimodal features from the preprocessed data to obtain multimodal feature vectors; constructing a deep neural network classifier based on a multilayer perceptron and inputting the multimodal feature vectors into the deep neural network classifier to obtain the independent probability of each preset tag type; acquiring users' historical behavior data, constructing user profiles, and making personalized recommendations. Compared with existing technologies, the technical solution of this invention utilizes advanced deep learning models to automatically extract deep features from multimodal data, and through a cross-modal attention mechanism, achieves deep fusion of textual and visual information, resulting in more accurate tag labeling conclusions and possessing strong application value. Attached Figure Description

[0021] To more clearly illustrate the technical solution of the present invention, the accompanying drawings used in the embodiments will be briefly described below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope of protection of the present invention. In the various drawings, similar components are numbered similarly.

[0022] Figure 1 The flowchart of the video tag type prediction and recommendation method based on multimodal feature extraction provided in this embodiment of the invention is as follows: Figure 2 This is a flowchart of step S101 provided in an embodiment of the present invention; Figure 3 This is a flowchart of step S102 provided in an embodiment of the present invention; Figure 4 This is a flowchart of step S103 provided in an embodiment of the present invention; Figure 5 The flowchart of step S104 is provided for an embodiment of the present invention. Detailed Implementation

[0023] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments.

[0024] Current technologies include non-manual prediction and recommendation methods that automatically analyze elements such as movie synopses, trailers, and actors using models, and classify them using machine learning models such as Support Vector Machines (SVM) and decision trees. However, this approach also has certain shortcomings, such as the need for manual compilation of the tag library, insufficient accuracy of traditional model feature extraction, and a disconnect between tag prediction and recommendation. This invention utilizes advanced deep learning models to automatically extract deep features from multimodal data. Through a cross-modal attention mechanism, it achieves deep fusion of textual and visual information, resulting in more accurate tag annotation conclusions and demonstrating significant application value.

[0025] Example Please refer to Figure 1 This embodiment proposes a video tag type prediction and recommendation method based on multimodal feature extraction, including: Step S101: Collect basic data of the video works and preprocess the basic data; Here, the basic data includes first basic data, second basic data, and third basic data. The first basic data is text data collected using web crawlers, including titles, synopses, cast and crew lists, and user reviews. The second basic data is video and image data, including movie posters and keyframe screenshots. The third basic data is non-plot data, including the year of release, country, and production company.

[0026] Step S102: Perform multimodal feature extraction and fusion on the preprocessed data to obtain a multimodal feature vector; Step S103: Construct a deep neural network classifier based on a multilayer perceptron, and input the multimodal feature vectors into the deep neural network classifier to obtain the independent probability of each preset label type; Step S104: Obtain the user's historical behavior data, build a user profile, and make personalized recommendations.

[0027] First, the first, second, and third basic data are obtained from the movie database in a structured manner using a distributed web crawler framework. The text data needs to be denoised and standardized, the video image data is decoded from the video stream and representative frames are extracted, and the non-plot data is filled with missing values and encoded.

[0028] Reference Figure 2 Step S101, which involves preprocessing the basic data, includes: Step S201: Perform data cleaning and feature analysis on the first basic data, including using regular expressions to remove redundant characters and performing text case standardization to obtain text sequence data, applying a word segmenter to perform word segmentation on the text sequence data, and removing non-semantic words. Step S202 involves cleaning the second basic data, including converting the video file into an image frame sequence using a content-based adaptive sampling method, removing invalid and blurry frames by calculating the average pixel intensity and standard deviation of the images, using the TransNet V2 model for scene boundary detection, extracting keyframes within each scene boundary, and performing normalization and data augmentation processing. Step S203 involves cleaning the third basic data, including using an embedding method to convert the director and lead actors into low-dimensional vectors, and normalizing and scaling the year information.

[0029] Specifically, this embodiment of the invention transforms the collected raw, heterogeneous basic data into a standardized format suitable for subsequent deep learning model processing. This solution designs corresponding preprocessing procedures for three different types of basic data: First, for the initial data, regular expressions are used to remove irrelevant symbols, garbled characters, and other redundant characters, and all English characters are uniformly converted to lowercase to achieve text standardization. Subsequently, a word segmenter is applied to divide the text sequence into independent lexical units and filter out stop words with low semantic contribution, thereby obtaining a lexical sequence that can effectively represent the core semantics of the text.

[0030] Secondly, for the second set of foundational data, a content-based adaptive sampling method is employed to decode the video into a sequence of image frames. By calculating statistical measures such as the average pixel intensity and standard deviation of the images, invalid or low-quality images such as black screens and blurry frames are automatically identified and removed. Furthermore, a scene boundary detection model (TransNet V2) is used to intelligently identify scene transition points in the video and select the most informative keyframes within each consecutive scene to ensure coverage of the main content of the video. Finally, these keyframes undergo size normalization and data augmentation processing (such as random rotation and flipping) to enhance data diversity and the model's generalization ability.

[0031] Third, for the third basic data, an embedding method is used to map it into a low-dimensional dense real number vector, so that entities with similar attributes are located closer together in the vector space. For continuous values such as "release year", normalization is performed to scale them to a specific interval, which is [0,1] in this embodiment of the invention, in order to eliminate the influence of dimensions and accelerate model convergence.

[0032] Reference Figure 3 Step S102 includes: Step S301: Using the RoBERTa model based on the Transformer architecture, the preprocessed first basic data is input into the RoBERTa model for feature extraction to obtain text feature vectors; Specifically, the model's built-in word segmenter transforms text words into a tokenized sequence, adding special symbols to clearly define sequence boundaries. Subsequently, the tokenized sequence is converted into an embedding vector, added to positional encoding information, and then fed into a multi-layer Transformer encoder. Each layer of the encoder is equipped with a multi-head self-attention mechanism, which dynamically weighs the relationships between all words in the sequence to capture long-range semantic dependencies. Finally, through a feedforward neural network and layer normalization processing, a text feature vector containing global contextual information is output.

[0033] Step S302: Using a deep convolutional neural network, the preprocessed second basic data is input into the deep convolutional neural network, and the output of the fully connected layer is used as a visual feature vector. Here, a pre-trained ResNet-50 deep convolutional neural network is used to process the pre-processed image data. The image is input into the network's convolutional layers to extract hierarchical visual features from shallow to deep. Subsequently, the spatial dimension features are aggregated using a global average pooling layer at the end of the network to form a compact feature representation. This feature is then passed to a fully connected layer, and its output activation value is used as a visual feature vector representing the image content.

[0034] Step S303: Use a cross-modal attention mechanism to fuse text feature vectors and visual feature vectors, and then reduce the dimensionality to obtain a multimodal feature vector. Specifically, to achieve deep interaction between text and visual information, a cross-modal attention mechanism is employed. Specifically, the text feature vector is used as the query vector, and the visual feature vector serves as both the key and value vectors. By calculating the similarity between the query and the key, an attention distribution is generated, which determines which parts of the visual features should be emphasized when generating the final joint representation. Based on this distribution, the value vectors (visual features) are weighted and summed to generate a visual context vector highly correlated with the text semantics. Finally, this visual context vector is concatenated with the original text feature vector, and a fully connected layer is used for non-linear transformation and dimensionality reduction, ultimately outputting a unified feature vector that integrates bimodal information, i.e., a multimodal feature vector.

[0035] Reference Figure 4 Step S103 includes: Step S401: Input the multimodal feature vector into the linear transformation layer and apply the GELU activation function to obtain intermediate features; Step S402: Input the intermediate features into the gated linear unit. The gated linear unit divides the intermediate features into two equal sub-vectors in the feature dimension, wherein the sub-vectors include a first sub-vector and a second sub-vector. Step S403: The first sub-vector is calculated using the Sigmoid activation function, and the output is used as the control gate. The control gate is multiplied element-wise with the second sub-vector to generate the regulation feature vector. The regulation feature vector is added with the multimodal feature vector to obtain the final output vector. Step S404: Feed the higher-order final output vector back to the fully connected output layer, perform a linear transformation on each neuron in the output layer, and input the scalar value output by each neuron into the activation function to obtain the probability value.

[0036] Specifically, in the feature transformation and gating mechanism mentioned in this embodiment of the invention, the input multimodal feature vector is passed through a linear transformation layer and nonlinearly mapped using the GELU activation function to obtain intermediate features. These intermediate features are then input to a gating linear unit. This unit equally divides the feature vector into two sub-vectors in terms of dimension. The Sigmoid activation function is applied to the first sub-vector, compressing its value to between 0 and 1, forming a gating signal. This gating signal is used to regulate the information flow: it is multiplied element-wise with the second sub-vector, thereby filtering out important features and suppressing secondary features, generating a regulated feature vector.

[0037] Meanwhile, to address the gradient vanishing problem that may occur in deep networks and promote information flow, this embodiment of the invention adds the modulated feature vector to the initial input multimodal feature vector, i.e., a residual connection. This operation ensures that the original information is preserved and allows the network to learn only the necessary incremental changes. This process can be repeated multiple times to form a deep network structure. Finally, the higher-order output vector is fed into a fully connected output layer, the number of neurons in which is consistent with the number of preset label types. A sigmoid activation function is applied to the output of each neuron, mapping it to an independent probability value that represents the likelihood of the video belonging to the corresponding label type.

[0038] It should be noted that the method in this embodiment of the invention has continuous learning capabilities. When a new video style or type emerges that is not covered by the preset tag system, relevant sample data can be collected, and a deep neural network classifier can be used for incremental training. The model can learn from the new data and identify new feature patterns, thereby automatically discovering and defining new tag types, which, after verification, are dynamically incorporated into the original preset tag set.

[0039] Reference Figure 5 Step S104 includes: Step S501: Collect the user's first subjective feedback data and second subjective feedback data, wherein the first subjective feedback data is ratings and comments, and the second subjective feedback data is clicks, browsing time and search history, to form the user's dynamic preference vector; Step S502: Calculate the cosine similarity between the dynamic preference vector and the candidate recommended movie vector as the content matching score, and perform a weighted fusion with the score based on neural collaborative filtering to obtain the final recommendation score, forming a recommendation list.

[0040] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.

Claims

1. A video tag type prediction and recommendation method based on multimodal feature extraction, characterized in that, include: Collect basic data of video works and preprocess the basic data; Multimodal feature extraction and fusion are performed on the preprocessed data to obtain multimodal feature vectors; A deep neural network classifier based on a multilayer perceptron is constructed, and the multimodal feature vector is input into the deep neural network classifier to obtain the independent probability of each preset label type; Obtain users' historical behavior data, build user profiles, and make personalized recommendations.

2. The video tag type prediction and recommendation method based on multimodal feature extraction according to claim 1, characterized in that, The basic data includes a first basic data, a second basic data, and a third basic data. The first basic data is text data collected using web crawlers, including the title, synopsis, cast and crew list, and user reviews. The second basic data is video image data, including movie posters and keyframe screenshots. The third basic data is non-plot data, including the year of release, country, and production company.

3. The video tag type prediction and recommendation method based on multimodal feature extraction according to claim 2, characterized in that, The preprocessing of the basic data includes: The first basic data is cleaned and characterized, including using regular expressions to remove redundant characters and performing text case standardization to obtain text sequence data. A word segmenter is then applied to the text sequence data for word segmentation, and non-semantic words are removed. The second basic data is cleaned, including converting the video file into an image frame sequence using a content-based adaptive sampling method, removing invalid and blurry frames by calculating the average pixel intensity and standard deviation of the image, and using the TransNet V2 model to detect scene boundaries, extracting key frames within each scene boundary, and performing normalization and data augmentation processing. The third basic data is cleaned, including using an embedding method to convert the director and lead actor into low-dimensional vectors, and normalizing and scaling the year information.

4. The video tag type prediction and recommendation method based on multimodal feature extraction according to claim 3, characterized in that, The process of extracting and fusing multimodal features from the preprocessed data to obtain a multimodal feature vector includes: The RoBERTa model based on the Transformer architecture is used to extract text feature vectors by inputting the preprocessed first basic data into the RoBERTa model. A deep convolutional neural network is used to input the preprocessed second basic data into the deep convolutional neural network, and the output of the fully connected layer is used as a visual feature vector. A cross-modal attention mechanism is used to fuse the text feature vector and the visual feature vector, and then the dimensionality is reduced and integrated to obtain a multimodal feature vector.

5. The video tag type prediction and recommendation method based on multimodal feature extraction according to claim 4, characterized in that, The RoBERTa model, based on the Transformer architecture, is used to extract text feature vectors by inputting the preprocessed first basic data into the RoBERTa model. The RoBERTa model's word segmenter transforms the preprocessed first basic data into a labeled sequence, and adds a sequence start symbol and an end symbol to define the sequence range. The labeled sequence is converted into an embedding vector and superimposed with positional encoding information and input into a multi-cascaded Transformer encoding layer. Each layer of the encoding layer models the global context through a multi-head self-attention module so that any position in the sequence is fused with information other than that position. The text feature vector is obtained by nonlinear transformation through a feedforward neural network and normalization.

6. The video tag type prediction and recommendation method based on multimodal feature extraction according to claim 4, characterized in that, The deep convolutional neural network is a pre-trained ResNet-50 neural network. The process of using a deep convolutional neural network involves inputting the pre-processed second basic data into the deep convolutional network and using the output of the fully connected layer as a visual feature vector. The preprocessed second basic data is input into the convolutional layer for visual feature extraction; The extracted visual features are aggregated in spatial dimensions by the global average pooling layer at the end of the network, and then the output is passed to the fully connected layer. The activation values of the fully connected layer are output as the visual feature vector.

7. The video tag type prediction and recommendation method based on multimodal feature extraction according to claim 4, characterized in that, The step of employing a cross-modal attention mechanism to fuse the text feature vector and the visual feature vector, and then reducing their dimensions to obtain a multimodal feature vector, includes: Using the text feature vector as the query vector and the visual feature vector as the key vector and value vector, the attention distribution from text to vision is calculated to generate a visual context vector associated with the text semantics. The visual context vector and the text feature vector are concatenated, and the concatenated high-dimensional vector is nonlinearly transformed and dimensionality reduced by a fully connected layer to obtain the multimodal feature vector.

8. The video tag type prediction and recommendation method based on multimodal feature extraction according to claim 4, characterized in that, The deep neural network classifier incorporates a multilayer perceptron with gated and residual connections. The multilayer perceptron includes several hidden layers, each employing the GELU activation function, and skip connections and gated linear units are introduced between the hidden layers. The process of inputting the multimodal feature vector into the deep neural network classifier to obtain the independent probability for each preset label type includes: The multimodal feature vectors are input into a linear transformation layer, and the GELU activation function is applied to obtain intermediate features. The intermediate feature is input into the gated linear unit, which divides the intermediate feature into two equal sub-vectors along the feature dimension, wherein the sub-vectors include a first sub-vector and a second sub-vector. The first sub-vector is activated by the Sigmoid activation function, and the output is used as a control gate. The control gate is multiplied element-wise with the second sub-vector to generate a control feature vector. The control feature vector is then added with the multimodal feature vector to obtain the final output vector. The higher-order final output vector is fed back to the fully connected output layer, a linear transformation is performed on each neuron in the output layer, and the scalar value output by each neuron is input into the activation function to map and obtain the probability value.

9. The video tag type prediction and recommendation method based on multimodal feature extraction according to claim 8, characterized in that, Also includes: The new label type is obtained by learning and training the deep neural network classifier, and the new label is incorporated into the preset label.

10. The video tag type prediction and recommendation method based on multimodal feature extraction according to claim 1, characterized in that, The process of acquiring users' historical behavior data, constructing user profiles, and making personalized recommendations includes: Collect users' first subjective feedback data and second subjective feedback data, wherein the first subjective feedback data is ratings and comments, and the second subjective feedback data is clicks, browsing time and search history, to form a dynamic preference vector of users; The cosine similarity between the dynamic preference vector and the candidate recommended movie vector is calculated as the content matching score, and then weighted and fused with the score based on neural collaborative filtering to obtain the final recommendation score, forming a recommendation list.