Label processing model training method, label determination method and device

By extracting and interacting features from resource content models and resource tag models, the problem of inaccurate prediction of multimedia resource tags is solved, and the accuracy of tag processing and the ability to express related information are improved.

CN117009847BActive Publication Date: 2026-06-26TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2022-10-19
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

The inaccurate prediction of multimedia resource tags leads to poor tag recommendation performance.

Method used

By acquiring training samples and models to be trained, feature extraction and interactive processing are performed using resource content models and resource label models. The model is then trained by combining labeled association information to generate a target label processing model.

Benefits of technology

It improves the accuracy of multimedia resource tag prediction and the ability to express related information, thereby enhancing the accuracy of tag processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117009847B_ABST
    Figure CN117009847B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of machine learning, in particular to a label processing model training method and device and a label determination method. The label processing model training method comprises the following steps: obtaining a training sample and a to-be-trained model; performing content feature extraction on the sample multimedia resource based on a resource content model to obtain resource content features; performing label feature extraction on the candidate label based on a label feature extraction layer to obtain label attribute features of the candidate label under multiple feature attributes; performing feature interaction processing on the label attribute features under the multiple feature attributes based on a label feature interaction layer to obtain label interaction features; and training the to-be-trained model based on the labeling association information and the prediction association information to obtain a target label processing model. The application can improve the accuracy of multimedia resource label prediction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of machine learning technology, and in particular to a method for training a label processing model, a method for determining labels, and an apparatus. Background Technology

[0002] Tags, as an effective carrier of multimedia resources, can effectively reflect the main information of multimedia resource content and play an important role in multimedia resource recommendation scenarios. However, due to the large number of tags for multimedia resources, directly using classification models to predict tags may lead to inaccurate predictions. Summary of the Invention

[0003] The technical problem to be solved by this application is to provide a tag processing model training method, a tag determination method and apparatus, which can improve the accuracy of multimedia resource tag prediction.

[0004] To address the aforementioned technical problems, this application provides a method for training a label processing model, comprising:

[0005] Acquire training samples and a model to be trained; the training samples include sample multimedia resources, candidate tags corresponding to the sample multimedia resources, and annotation association information between the sample multimedia resources and the candidate tags; the model to be trained includes a resource content model and a resource tag model; the resource tag model includes a tag feature extraction layer and a tag feature interaction layer;

[0006] Based on the resource content model, content features are extracted from the sample multimedia resources to obtain resource content features;

[0007] Based on the label feature extraction layer, label features are extracted from the candidate labels to obtain the label attribute features of the candidate labels under multiple feature attributes;

[0008] Based on the tag feature interaction layer, feature interaction processing is performed on the tag attribute features under the multiple feature attributes to obtain tag interaction features;

[0009] Based on the resource content features and the tag interaction features, feature fusion is performed to obtain the predicted association information between the multimedia resources and the candidate tags;

[0010] The target label processing model is trained based on the labeled association information and the predicted association information.

[0011] On the other hand, this application provides a label determination method, including:

[0012] Obtain multiple predicted tags corresponding to the target multimedia resource;

[0013] The target multimedia resource and the multiple predicted labels are input into the target label processing model for label processing to obtain target association information corresponding to the target multimedia resource and the multiple predicted labels respectively; the target label processing model is obtained based on the above-described label processing model training method.

[0014] Based on the target association information corresponding to the target multimedia resource and the multiple predicted tags respectively, the target tag is determined from the multiple predicted tags.

[0015] On the other hand, this application provides a label processing model training apparatus, comprising:

[0016] The first acquisition module is used to acquire training samples and a model to be trained; the training samples include sample multimedia resources, candidate tags corresponding to the sample multimedia resources, and annotation association information between the sample multimedia resources and the candidate tags; the model to be trained includes a resource content model and a resource tag model; the resource tag model includes a tag feature extraction layer and a tag feature interaction layer.

[0017] The resource content feature extraction module is used to extract content features from the sample multimedia resources based on the resource content model to obtain resource content features.

[0018] The label feature extraction module is used to extract label features from the candidate labels based on the label feature extraction layer, and obtain the label attribute features of the candidate labels under multiple feature attributes.

[0019] The feature interaction module is used to perform feature interaction processing on the tag attribute features under the multiple feature attributes based on the tag feature interaction layer to obtain tag interaction features;

[0020] The association information prediction module is used to perform feature fusion based on the resource content features and the tag interaction features to obtain the predicted association information between the multimedia resource and the candidate tag;

[0021] The training module is used to train the model to be trained based on the labeled association information and the predicted association information to obtain the target label processing model.

[0022] On the other hand, this application provides a label determining device, comprising:

[0023] The second acquisition module is used to acquire multiple predicted tags corresponding to the target multimedia resource;

[0024] The tag processing module is used to input the target multimedia resource and the multiple predicted tags into the target tag processing model for tag processing, so as to obtain the target association information corresponding to the target multimedia resource and the multiple predicted tags respectively; the target tag processing model is obtained based on the above-mentioned tag processing model training device;

[0025] The target label determination module is used to determine the target label from the multiple predicted labels based on the target association information corresponding to the target multimedia resource and the multiple predicted labels respectively.

[0026] On the other hand, this application provides an electronic device, the device including a processor and a memory, the memory storing at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by the processor to implement the label processing model training method or the label determination method as described above.

[0027] On the other hand, this application provides a computer storage medium storing at least one instruction or at least one program, wherein the at least one instruction or at least one program is loaded and executed by a processor as described above for the label processing model training method or label determination method.

[0028] Implementing the embodiments of this application has the following beneficial effects:

[0029] This application trains a model based on multiple pre-defined label prediction models to predict candidate labels for sample multimedia resources, as well as the annotation association information between candidate labels and sample multimedia resources. Specifically, the feature interaction layer of the model under test performs feature interaction processing on the label attribute features of candidate labels under multiple feature attributes, enabling the label attribute features under multiple feature attributes to influence each other, thereby improving the feature representation ability of candidate label features. Furthermore, feature fusion is performed based on the label interaction features after feature interaction and the resource content features of sample multimedia resources to obtain predicted association information, which can improve the accuracy of predicted association information. Then, the model is trained based on the predicted association information and the annotation association information, which can improve the ability of the model under test to express the association information of sample multimedia resources and candidate labels, thereby improving the accuracy of the target label processing model in label processing. Attached Figure Description

[0030] To more clearly illustrate the technical solutions and advantages in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0031] Figure 1 This is a schematic diagram of the implementation environment provided in the embodiments of this application;

[0032] Figure 2 This is a flowchart of a label processing model training method provided in an embodiment of this application;

[0033] Figure 3 This is a flowchart of a method for generating tag attribute features based on tag source information provided in an embodiment of this application;

[0034] Figure 4 This is a flowchart of a method for generating tag attribute features based on tag confidence information provided in an embodiment of this application;

[0035] Figure 5 This is a flowchart of a method for generating tag attribute features based on tag content statistical features provided in an embodiment of this application;

[0036] Figure 6 This is a flowchart of a resource content feature extraction method provided in an embodiment of this application;

[0037] Figure 7 This is a flowchart of a feature fusion method provided in an embodiment of this application;

[0038] Figure 8 This is a flowchart of a tag attribute feature transformation method provided in an embodiment of this application;

[0039] Figure 9 This is a flowchart of a label determination method provided in an embodiment of this application;

[0040] Figure 10 This is a schematic diagram of the target label processing model provided in the embodiments of this application;

[0041] Figure 11 This is a schematic diagram of a label processing model training device provided in an embodiment of this application;

[0042] Figure 12 This is a schematic diagram of a label determining device provided in an embodiment of this application;

[0043] Figure 13 This is a schematic diagram of the electronic device structure provided in the embodiments of this application. Detailed Implementation

[0044] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of this application, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of this application.

[0045] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or server that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or devices.

[0046] Machine learning (ML) is a multidisciplinary field involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It specifically studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to endow computers with intelligence; its applications span all areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and instructional learning.

[0047] Please see Figure 1 The illustration shows an implementation environment provided in the embodiments of this application. The implementation environment may include: a user terminal 110 and a tag processing server 120; the user terminal 110 and the tag processing server 120 can communicate data via a network.

[0048] Specifically, user terminal 110 can upload multimedia resources, and tag processing server 120 can use a tag processing model to determine the tags of the multimedia resources uploaded by user terminal 110, thereby obtaining tags corresponding to the multimedia resources.

[0049] Furthermore, the tag processing server 120 can also receive multimedia resource requests sent by the user terminal 110, and the tag processing server 120 can recommend and distribute multimedia resources based on the tags of various multimedia resources.

[0050] User terminal 110 can communicate with tag processing server 120 based on browser / server (B / S) or client / server (C / S) mode. User terminal 110 may include physical devices such as smartphones, tablets, laptops, digital assistants, smart wearable devices, in-vehicle terminals, and servers, and may also include software running on the physical device, such as applications. The operating system running on user terminal 110 in this embodiment may include, but is not limited to, Android, iOS, Linux, and Windows.

[0051] The tag processing server 120 and the user terminal 110 can establish a communication connection via wired or wireless means. The tag processing server 120 may include a stand-alone server, a distributed server, or a server cluster consisting of multiple servers, wherein the server may be a cloud server.

[0052] To address the problem of inaccurate multimedia resource tag prediction in existing technologies, this application provides a tag processing model training method, the execution entity of which can be the aforementioned tag processing server; specifically, please refer to... Figure 2 It illustrates a method for training a label processing model, which may include:

[0053] S210. Obtain training samples and the model to be trained.

[0054] The training samples include sample multimedia resources, candidate labels corresponding to the sample multimedia resources, and annotation association information between the sample multimedia resources and the candidate labels. The annotation association information characterizes the degree of association between the sample multimedia resources and the candidate labels; the higher the degree of association between a candidate label and the sample multimedia resources, the greater the probability that it will be selected as the actual label for the sample multimedia resources. The annotation association information can be determined by calculating the degree of association between the sample multimedia resources and the candidate labels. In this embodiment, the multimedia resources may include resources in the form of video, audio, text, images, etc.

[0055] The candidate labels are obtained by predicting labels for the sample multimedia resources based on multiple preset label prediction models; these multiple preset label prediction models are trained using different label prediction algorithms. Furthermore, for the same sample multimedia resource, different preset label prediction models can be used to predict its labels, resulting in multimedia resource labels corresponding to each preset label prediction model. The multimedia resource labels corresponding to each preset label prediction model may differ. In this embodiment, the multimedia resource labels corresponding to each preset label prediction model can be deduplicated to obtain candidate labels. In this embodiment, the number of candidate labels can be one or more. When training the preset label prediction models, algorithms such as decision trees, random forests, gradient boosting decision trees (GBDT), and extreme gradient boosting (XGB) can be used.

[0056] The model to be trained includes a resource content model and a resource tag model; the resource tag model includes a tag feature extraction layer and a tag feature interaction layer. The resource content model and the resource tag model can be two parallel models. The resource content model can be used to process information from the multimedia resource side, and the resource tag model can be used to process information from the resource tag side. Further, the resource tag model may include a tag feature extraction layer and a tag feature interaction layer. The tag feature extraction layer can be used to extract features from the tags of the output model to obtain tag attribute features of candidate tags under multiple feature attributes; the tag feature interaction layer can be used to perform feature interaction processing on the tag attribute features output by the tag feature extraction layer to obtain tag interaction features.

[0057] S220. Based on the resource content model, extract the content features of the sample multimedia resources to obtain resource content features.

[0058] In this embodiment, content feature extraction is mainly based on the text content information corresponding to the multimedia resources to obtain the resource content features of the multimedia resources. Specifically, when the multimedia resource is a video, text recognition can be performed on the video title and the text appearing in the video, and speech recognition can be performed on the audio information contained in the video to obtain the text content information corresponding to the video. When the multimedia resource is an audio, speech recognition can be performed on the audio to obtain the text content information corresponding to the audio. When the multimedia resource is an image and the image contains text, text recognition can be performed on the text in the image to obtain the text content information corresponding to the image.

[0059] S230. Based on the label feature extraction layer, perform label feature extraction on the candidate label to obtain the label attribute features of the candidate label under multiple feature attributes.

[0060] Candidate labels have different characteristics under different feature attributes. Therefore, the features of candidate labels under multiple feature attributes can be extracted through the label feature extraction layer. Then, by representing the candidate labels with features under multiple feature attributes, the feature representation ability of candidate labels can be improved.

[0061] S240. Based on the tag feature interaction layer, perform feature interaction processing on the tag attribute features under the multiple feature attributes to obtain tag interaction features.

[0062] Since the label attribute features of each feature attribute can represent a certain label attribute feature of the candidate label, the label attribute features under multiple feature attributes are independent of each other. It may not be possible to directly judge whether the candidate label is trustworthy based on a certain label attribute feature. Therefore, feature interaction is carried out through the label attribute features under multiple feature attributes, so that the label attribute features can be connected. The resulting label interaction features can comprehensively reflect the label features of the candidate label and the trustworthiness of the candidate label.

[0063] When there is only one candidate label, the label attribute features of the candidate label under multiple feature attributes can be directly processed to obtain the label interaction features.

[0064] When there are multiple candidate labels, feature interaction processing can be performed on the label attribute features of each candidate label under multiple feature attributes to obtain the interaction features corresponding to each candidate label; then, based on the interaction features corresponding to each of the multiple candidate labels, the label interaction features are obtained.

[0065] S250. Based on the resource content features and the tag interaction features, feature fusion is performed to obtain the predicted association information between the multimedia resource and the candidate tag.

[0066] By obtaining the corresponding resource content features and tag interaction features through parallel resource content models and resource tag models, feature fusion can be performed on the resource content features and tag interaction features to obtain the predicted association information between multimedia resources and candidate tags.

[0067] When performing feature fusion, the similarity between resource content features and tag interaction features can be calculated, and then the similarity can be used to predict the associated information.

[0068] The model to be trained may also include a feature fusion layer, so that the feature fusion of resource content features and tag interaction features can be achieved through the feature fusion layer.

[0069] If the model to be trained does not include a feature fusion layer, an output layer can be connected after the model to be trained, and feature fusion of resource content features and tag interaction features can be achieved through the output layer.

[0070] S260. The model to be trained is trained based on the labeled association information and the predicted association information to obtain the target label processing model.

[0071] Based on the labeled association information and the predicted association information, the training loss information can be determined; based on the training loss information, the model to be trained can be trained to obtain the target label processing model.

[0072] This application trains a model based on multiple pre-defined label prediction models to predict candidate labels for sample multimedia resources, as well as the annotation association information between candidate labels and sample multimedia resources. Specifically, the feature interaction layer of the model under test performs feature interaction processing on the label attribute features of candidate labels under multiple feature attributes, enabling the label attribute features under multiple feature attributes to influence each other, thereby improving the feature representation ability of candidate label features. Furthermore, feature fusion is performed based on the label interaction features after feature interaction and the resource content features of sample multimedia resources to obtain predicted association information, which can improve the accuracy of predicted association information. Then, the model is trained based on the predicted association information and the annotation association information, which can improve the ability of the model under test to express the association information of sample multimedia resources and candidate labels, thereby improving the accuracy of the target label processing model in label processing.

[0073] In this embodiment, candidate labels may also carry related additional information; the additional information may be information identifying the source of the candidate label, information characterizing the confidence level of the candidate label, etc. Specifically, candidate labels may carry model identifiers, which are used to characterize the source information of the candidate label; for example, candidate label 1 carries the model identifier of preset label prediction model 1 and the model identifier of preset label prediction model 3, indicating that the predicted labels of the sample multimedia resources by preset label prediction model 1 and preset label prediction model 3 both include candidate label 1; candidate label 2 carries the model identifier of preset label prediction model 1 and preset label prediction model 2, indicating that the predicted labels of the sample multimedia resources by preset label prediction model 1 and preset label prediction model 2 both include candidate label 2; and so on. It can be seen that the larger the number of model identifiers carried by the candidate label, the more sources the candidate label has, and thus the higher the confidence level of the candidate label.

[0074] Accordingly, the feature attributes corresponding to the source information of the candidate labels can be the label source attributes; please refer to [link to relevant documentation]. Figure 3 It illustrates a method for generating tag attribute features based on tag source information, which may include:

[0075] S310. Based on the label feature extraction layer, perform feature extraction on the model identifier carried by the candidate label to obtain the label source feature corresponding to the label source attribute.

[0076] S320. Based on the tag source features, obtain the tag attribute features of the candidate tag under the multiple feature attributes.

[0077] The label feature extraction layer can extract the label source features of candidate labels from the model identifiers carried by the candidate labels. The label source features of candidate labels can be a label source feature sequence or a label source feature vector. The number of elements contained in the label source feature sequence or label source feature vector is consistent with the number of preset label prediction models. That is, the label source features include the feature representation corresponding to each preset label prediction model.

[0078] Taking the label source feature vector as an example, the vector dimension of the label source feature vector is consistent with the number of preset label prediction models. Each dimension corresponds to one preset label prediction model, and the number of preset label prediction models is N. Then, the feature element of the first dimension of the label source feature vector corresponds to preset label prediction model 1, the feature element of the second dimension corresponds to preset label prediction model 2, and so on, until the feature element of the Nth dimension corresponds to preset label prediction model N. Specifically, if the number of preset label prediction models is 5, candidate label 1 carries the model identifier of preset label prediction model 1, the model identifier of preset label prediction model 3, and the model identifier of preset label prediction model 5, thus indicating that candidate label 1 originates from preset label prediction model 1, preset label prediction model 3, and preset label prediction model 5. The corresponding label source feature can be represented as a label source feature vector (1,0,1,0,1), where 1 indicates that the candidate label originates from the preset label prediction model corresponding to this vector dimension, and 0 indicates that the candidate label does not originate from the preset label prediction model corresponding to this vector dimension.

[0079] Therefore, given the tag source features corresponding to the tag source attributes, tag attribute features under various feature attributes can be generated based on the tag source features corresponding to the tag source attributes and features under other feature attributes.

[0080] The label feature extraction layer extracts features from candidate labels to obtain label source features under the label source attribute. The label source features can characterize the source of candidate labels, specifically the number of sources and the model category of the source. Based on the number of sources and the model category of the source, the credibility of the candidate label can be determined. The more sources a candidate label has, the higher its credibility. This expands the feature expression types of candidate labels and improves the feature expression ability of candidate labels.

[0081] Furthermore, candidate labels may also carry confidence information; the corresponding feature attributes may include label confidence attributes corresponding to the confidence information; for the predicted candidate label, it may also have confidence information on the prediction of the candidate label, and the confidence information can characterize the credibility of the predicted candidate label.

[0082] Accordingly, please refer to Figure 4 It illustrates a method for generating label attribute features based on label confidence information, which may include:

[0083] S410. Based on the label feature extraction layer, feature extraction is performed on the confidence information carried by the candidate label to obtain the label confidence feature corresponding to the label confidence attribute.

[0084] S420. Based on the label confidence feature, obtain the label attribute features of the candidate label under the multiple feature attributes.

[0085] The confidence information carried by the candidate label can be further realized based on the model identifier carried by the candidate label. That is, in addition to the model identifier carried by the candidate label, the confidence information of the preset label prediction model corresponding to the model identifier for the prediction of the candidate label can also be carried.

[0086] The label feature extraction layer can extract the label confidence features of candidate labels from the confidence information carried by the candidate labels. The label confidence features of candidate labels can be a label confidence feature sequence or a label confidence feature vector. The number of elements contained in the label confidence feature sequence or label confidence feature vector is consistent with the number of preset label prediction models. That is, the label confidence features include the feature representation corresponding to each preset label prediction model.

[0087] Taking the label confidence feature of candidate labels as an example, the vector dimension of the label confidence feature vector is the same as the number of preset label prediction models. Each dimension corresponds to one preset label prediction model, and the number of preset label prediction models is N. Therefore, the feature element of the first dimension in the label confidence feature vector corresponds to preset label prediction model 1, the feature element of the second dimension corresponds to preset label prediction model 2, and so on. The feature element of the Nth dimension in the label source feature vector corresponds to preset label prediction model N. Specifically, if the number of preset label prediction models is 5, the label source feature can be represented as the label source feature vector (1,0,1,0,1), and the corresponding label confidence feature vector is (0.85,0,0.6,0,0.9), indicating that the confidence of preset label prediction model 1 in predicting candidate labels is 0.85, the confidence of preset label prediction model 3 in predicting candidate labels is 0.6, and the confidence of preset label prediction model 5 in predicting candidate labels is 0.9.

[0088] Therefore, given the label confidence feature corresponding to the label confidence attribute, label attribute features under various feature attributes can be generated based on the label confidence feature corresponding to the label confidence attribute, the label source feature corresponding to the label source attribute, and so on.

[0089] The label feature extraction layer extracts features from candidate labels to obtain label confidence features under the label confidence attribute. The label confidence features can represent the predicted confidence information of candidate labels. Based on the predicted confidence information, the credibility of candidate labels can be determined, thus expanding the feature expression types of candidate labels and improving the feature expression ability of candidate labels.

[0090] Feature attributes may also include tag content statistical attributes; please refer to the relevant documentation for details. Figure 5 It illustrates a method for generating tag attribute features based on tag content statistical features, which may include:

[0091] S510. Based on the tag feature extraction layer, perform feature content statistical feature extraction on the candidate tags to obtain the tag content statistical features corresponding to the tag content statistical attributes.

[0092] S520. Based on the statistical features of the tag content, obtain the tag attribute features of the candidate tag under the multiple feature attributes.

[0093] Tag content statistical features can characterize the features formed by statistically analyzing content information based on candidate tags across multiple multimedia information sources; specifically, tag content statistical features may include frequency features, entity features, and classification features, etc.

[0094] For frequency features, the frequency of candidate tags appearing in the title of the text content information corresponding to the sample multimedia resources can be statistically analyzed, as can the frequency of their appearance in the body text, the frequency of their appearance in paragraphs within the body text, and the ratio of the total frequency of each candidate tag to the sum of the total frequencies of all candidate tags. Frequency features can be specifically represented by frequency feature sequences or frequency feature vectors. Taking frequency feature vectors as an example, the dimension of the frequency feature vector is the same as the number of items in the statistical features. For example, the frequency feature vector of candidate tag 1 is (1, 10, 4, 0.5), which means that candidate tag 1 appears 1 time in the title, 10 times in the body text, and 4 times in paragraphs within the body text. The ratio of the total frequency of candidate tag 1 to the sum of the total frequencies of all candidate tags is 0.5.

[0095] For entity features, it can be determined whether the candidate label is a person's name, a place name, an organization name, or a sports event. Entity features can be represented by entity feature sequences or entity feature vectors. Taking entity feature vectors as an example, the dimension of the entity feature vector is consistent with the preset number of entity items. For example, the entity feature vector of candidate label 1 is (0,0,0,1), which indicates that candidate label 1 is a sports event, thus having a high degree of credibility.

[0096] For classification features, a preset number of multimedia resources can be randomly selected. Each of these resources corresponds to a primary category, which may include: news, entertainment, health, sports, nature, finance, film and television, etc. Based on the text content information of each multimedia resource under each primary category, the number of multimedia resources containing candidate tags is determined. The distribution of candidate tags is determined based on the ratio of the number of multimedia resources containing candidate tags to the preset number under each primary category. The classification features of the candidate tags can then be determined based on this distribution. Classification features can be represented by classification feature sequences and classification feature vectors. Taking a classification feature vector as an example, the dimension of the classification feature vector is consistent with the number of primary categories. For instance, the classification feature vector for candidate tag 1 is (0.4,0,0.1,0.5,0,0,0), indicating that candidate tag 1 has a probability of 0.4 under the news category, 0 under the entertainment category, 0.1 under the health category, 0.5 under the sports category, and 0 under categories such as nature, finance, and film and television.

[0097] The label feature extraction layer extracts features from candidate labels to obtain label content statistical features under the label content statistical attributes. The label content statistical features can characterize the features formed by statistically analyzing the content information of candidate labels in multiple multimedia information, and have good generalization ability. This expands the feature expression types of candidate labels and improves the feature expression ability of candidate labels.

[0098] Furthermore, the feature attributes can also include label text semantic attributes, thus enabling text semantic extraction of candidate labels based on the label feature extraction layer, obtaining the label text semantic features of the candidate labels. Label text semantic features can represent candidate labels semantically, achieving a feature expression of the original semantics of the labels.

[0099] When extracting content features from sample multimedia resources based on the resource content model, feature extraction can be performed based on the text content information corresponding to the sample multimedia resources; please refer to [link / reference needed] for details. Figure 6 It illustrates a method for extracting resource content features, which may include:

[0100] S610. Based on the text content layout information corresponding to the sample multimedia resources, the sample multimedia resources are divided into text regions to obtain multiple text regions.

[0101] S620. Based on the resource content model, extract content features from the text content corresponding to the multiple text regions respectively to obtain resource content features corresponding to the multiple text regions respectively.

[0102] Text content layout information can include the layout between the text title and the text body, as well as the layout between paragraphs within the text body. Based on this layout information, text regions can be divided, resulting in multiple text regions such as a title text region, paragraph 1 text region, paragraph 2 text region, etc. During content feature extraction, content features can be extracted separately for each text region, yielding resource content features corresponding to each region.

[0103] Taking sports news as an example of multimedia resources, the text content of such sports news may involve various sports events such as basketball, swimming, and badminton. If the title and body text are not divided into regions, and the paragraphs in the body text are not divided into regions, and feature extraction is performed directly based on all the text content information, it will lead to semantic confusion and semantic deviation. Therefore, by dividing the title and body text into regions, and dividing the paragraphs in the body text into regions, and then extracting features from each text region, it is possible to realize the resource content feature expression of the text region of multimedia information, avoid semantic confusion and semantic deviation, and improve the accuracy of the resource content feature expression of multimedia resource content.

[0104] Given resource content features obtained from a resource content model and tag interaction features obtained from a resource tag model, feature fusion processing can be performed on the resource content features and tag interaction features. Based on the feature fusion result, the predicted association information between the sample multimedia resources and candidate tags can be determined. For details, please refer to [link to relevant documentation]. Figure 7 It illustrates a feature fusion method, which may include:

[0105] S710. Calculate the similarity between the tag interaction features and the resource content features corresponding to the multiple text regions to obtain multiple similarity information.

[0106] S720. Perform information fusion processing on the multiple similarity information to obtain the predicted association information.

[0107] By calculating the similarity between the tag interaction features and the resource content features corresponding to multiple text regions, we obtain multiple similarity information. These multiple similarity information represent the similarity relationship between the tag interaction features and different text regions of the sample multimedia resources. In order to comprehensively represent the global similarity between the tag interaction features and the sample multimedia resources, we can perform information fusion processing on the multiple similarity information to obtain the predicted association information between the sample multimedia resources and the tag interaction features.

[0108] Similarity information can be specifically presented as a similarity score or a similarity probability. When fusing multiple similarity metrics, the average of the scores can be taken; alternatively, a weighted average can be applied based on the weights corresponding to the different text regions within the main body. For instance, since the title effectively summarizes the semantics of the main text, the weight assigned to the title can be greater than the weights assigned to each text region within the main body. This allows the retained information in the fused similarity data to be biased towards the title.

[0109] After obtaining the resource content features corresponding to multiple text regions of the sample multimedia resources, similarity calculations are performed between these features and the tag interaction features. This reduces the granularity of the similarity calculation, improves the comprehensiveness of the similarity calculation, and ultimately enhances the accuracy of the predicted association information calculation.

[0110] In this embodiment, the tag attribute features under various feature attributes may include tag source features, tag confidence features, tag content statistical features, and tag text semantic features. The specific numerical values ​​of these various features may differ in type; some features have integer values, while others have floating-point values. Therefore, the numerical values ​​of various tag attribute features can be uniformly processed. Please refer to [link / reference] for details. Figure 8 It illustrates a label attribute feature transformation method, including:

[0111] S810. Based on the label feature extraction layer, perform label feature extraction on the candidate label to obtain the label extraction features of the candidate label under multiple feature attributes.

[0112] S820. Perform integer transformation processing on the extracted label features to obtain the label transformation features.

[0113] S830. Map the label transformation feature to a label feature of a target dimension; the target dimension is greater than the dimension of the label transformation feature.

[0114] S840. The label features of the target dimension are determined as the label attribute features.

[0115] By extracting label features from candidate labels through a label feature extraction layer, label extraction features of candidate labels under various feature attributes can be obtained. The numerical types of the label extraction features under different feature attributes may be different. In this embodiment, the label extraction features can be processed by integer transformation to obtain label transformation features. For example, for float type feature values ​​(range 0-1), the feature value can be multiplied by a hyperparameter (e.g., the hyperparameter is 100) and then converted into an int type feature value, thereby converting label transformation features of different numerical types into the same dimension, which is convenient for feature measurement under different feature attributes.

[0116] After obtaining the label transformation features, the label transformation matrix can be dimensionally mapped. Specifically, the label transformation features are mapped to label features of the target dimension. If the target dimension is larger than the dimension of the label transformation features, the label transformation features are mapped to a high-dimensional dense vector. For example, the mapping size can be set to s, then one int-type feature can be converted into a one-dimensional dense feature vector, which includes s float-type values.

[0117] By performing integer transformation on the label extraction features under multiple feature attributes, label transformation features are obtained, which allows the label extraction features under different feature attributes to be converted into the same dimension, facilitating the comparison of label transformation features under different feature attributes. In addition, by mapping the label transformation features to dense vectors and using more data to represent the label features, the feature representation ability of the label features can be improved.

[0118] The target label processing model trained in this embodiment can be specifically applied to label filtering scenarios; please refer to [link / reference needed]. Figure 9 It illustrates a label determination method, which may include:

[0119] S910. Obtain multiple predicted tags corresponding to the target multimedia resource.

[0120] S920. Input the target multimedia resource and the multiple predicted labels into the target label processing model for label processing to obtain the target association information corresponding to the target multimedia resource and the multiple predicted labels respectively.

[0121] S930. Based on the target association information corresponding to the target multimedia resource and the plurality of predicted tags respectively, determine the target tag from the plurality of predicted tags.

[0122] Multiple predicted labels can be obtained by predicting labels for the target multimedia resource based on multiple preset label prediction models; these multiple preset label prediction models are trained based on different label prediction algorithms. The multiple preset label prediction models used when determining labels can be the same as or different from the multiple preset label prediction models used when training the models.

[0123] Please see Figure 10 This diagram illustrates the structure of the target tag processing model. The model includes a resource content model on the multimedia resource side and a resource tag model on the tag side. The resource content model can specifically include a resource feature extraction layer, and the resource tag model can further include a tag feature extraction layer and a tag feature interaction layer. Since titles, body text, and tags can all be input in text format, both the resource feature extraction layer and the tag feature extraction layer can be text feature extraction layers, allowing the resource content model and the resource tag model to share the same text feature extraction layer. The text feature extraction layer can employ models such as Word2Vec, BERT (Bidirectional Encoder Representation from Transformers), ALBERT, and RoBERT.

[0124] The title and body paragraphs of the target multimedia resource are divided into text regions, resulting in n text regions: content(content1,...,content...). n ), will (content1,...,content n Input the resource content model to obtain n content feature vectors, namely Emb con1 ,...,Emb conn .

[0125] For m preset tag prediction models, deduplication of tags for the target multimedia resource is performed to obtain p predicted tags (tag1,...,tag...). p ), and (tag1,...,tag p Input resource tagging model. The tag feature extraction layer in the resource tagging prediction model can extract tag attribute features of p predicted tags under multiple feature attributes, which may include tag source feature src_fea, tag confidence feature sce_fea, tag content statistical feature meta_fea, and tag text semantic feature Emb. tag Therefore, the tag attribute features corresponding to each predicted tag can be represented as (Emb...tag , src_fea, sce_fea, meta_fea).

[0126] For each predicted label, feature interaction processing is performed on the label attribute features under q feature attributes, which can be achieved through the following formula:

[0127] Emb_F=DeepFM(emb1,emb2,...,emb q (1)

[0128] DeepFM (Deep Factorization-Machine) can correspond to the label feature interaction layer in the resource label model. In this embodiment, MoE (Mixture of experts) can also be used to interact and fuse label attribute features under multiple feature attributes. By interacting the label attribute features of the predicted label under multiple feature attributes, the credibility of the predicted label can be better reflected.

[0129] This will enable the network output of resource content (Emb) tit Emb con1 ,...,Emb conn ), and the resource tag model output (Emb_F1,...,Emb_F) p The similarity score is obtained by performing a similarity calculation. tt scorec t1 ,...,scorec tn The similarity scores are fused to obtain the predicted association information (Mean). Further, for each predicted tag, there is a corresponding predicted association information (Mean). The predicted association information (Mean) characterizes the degree of association between the target multimedia resource and multiple predicted tags. Therefore, based on the predicted association information corresponding to each predicted tag, the target tag can be determined from multiple predicted tags. Specifically, a preset number of predicted tags with a larger predicted association information are selected as the target tag; alternatively, predicted tags with a predicted association information greater than the preset association information can be determined as the target tag.

[0130] Therefore, based on the above model training method, which can improve the accuracy of the target label processing model in label processing, the label filtering process based on the target label processing model can improve the accuracy of label filtering, that is, to filter out target labels that are highly relevant to multimedia resources; furthermore, the recommendation of multimedia resources based on target labels can improve the accuracy of multimedia resource recommendation.

[0131] In this embodiment, tag attribute features under multiple feature attributes are constructed. Since the multi-path preset tag prediction model outputs multiple tag sets (and corresponding scores), the more sources a tag has from preset tag prediction models, the more credible the tag is (the higher the score of the preset tag prediction model, the more credible the tag). In addition, the higher the frequency and proportion of a tag's occurrence, the more credible the tag will be. Furthermore, if the tag is an entity tag (and the tag has also appeared in the text content), the credibility of the tag will further increase. Finally, the higher the consistency of the tag's classification distribution with other tags, that is, other tags have "voted" for this tag, the more credible the tag will be. The tag attribute features under multiple feature attributes are interacted with to improve the representational ability of the tag features. For example, for the two types of meta-information features, "entity features" and "frequency features," simply looking at whether a tag is an entity tag cannot indicate the credibility of the tag, because an entity tag that has appeared in both the title and the body text (and has a very high frequency) is significantly more credible than an entity tag that has not appeared in either the title or the body text.

[0132] It should be noted that the methods described above in this embodiment can be combined in actual implementation and have corresponding beneficial effects, which will not be elaborated here.

[0133] Please see Figure 11 It illustrates a label processing model training device, comprising:

[0134] The first acquisition module 1110 is used to acquire training samples and a model to be trained; the training samples include sample multimedia resources, candidate labels corresponding to the sample multimedia resources, and annotation association information between the sample multimedia resources and the candidate labels; the model to be trained includes a resource content model and a resource label model; the resource label model includes a label feature extraction layer and a label feature interaction layer.

[0135] The resource content feature extraction module 1120 is used to extract content features from the sample multimedia resources based on the resource content model to obtain resource content features.

[0136] The label feature extraction module 1130 is used to extract label features from the candidate label based on the label feature extraction layer, and obtain the label attribute features of the candidate label under multiple feature attributes.

[0137] Feature interaction module 1140 is used to perform feature interaction processing on the tag attribute features under the multiple feature attributes based on the tag feature interaction layer to obtain tag interaction features;

[0138] The association information prediction module 1150 is used to perform feature fusion based on the resource content features and the tag interaction features to obtain the predicted association information between the multimedia resource and the candidate tag;

[0139] The training module 1160 is used to train the model to be trained based on the labeled association information and the predicted association information to obtain the target label processing model.

[0140] Furthermore, the candidate labels are obtained by predicting labels for the sample multimedia resources based on multiple preset label prediction models; the multiple preset label prediction models are trained based on different label prediction algorithms; the candidate labels carry model identifiers, which are used to characterize the source information of the candidate labels; the feature attributes include label source attributes;

[0141] The label feature extraction module 1130 includes:

[0142] The first extraction module is used to extract features from the model identifier carried by the candidate label based on the label feature extraction layer, so as to obtain the label source feature corresponding to the label source attribute;

[0143] The first determining module is used to obtain the tag attribute features of the candidate tag under the multiple feature attributes based on the tag source features.

[0144] Furthermore, the candidate labels carry confidence information; the feature attributes include label confidence attributes;

[0145] The label feature extraction module 1130 includes:

[0146] The second extraction module is used to extract features from the confidence information carried by the candidate labels based on the label feature extraction layer, so as to obtain the label confidence feature corresponding to the label confidence attribute.

[0147] The second determining module is used to obtain the label attribute features of the candidate label under the multiple feature attributes based on the label confidence features.

[0148] Furthermore, the feature attributes include tag content statistical attributes;

[0149] The label feature extraction module 1130 includes:

[0150] The third extraction module is used to extract the statistical features of the candidate tags based on the tag feature extraction layer, so as to obtain the tag content statistical features corresponding to the tag content statistical attributes.

[0151] The third determining module is used to obtain the tag attribute features of the candidate tag under the multiple feature attributes based on the statistical features of the tag content.

[0152] Furthermore, the device also includes:

[0153] The region segmentation module is used to segment the sample multimedia resources into multiple text regions based on the text content layout information corresponding to the sample multimedia resources.

[0154] The resource content feature extraction module 1120 includes:

[0155] The fourth extraction module is used to extract content features from the text content corresponding to the multiple text regions based on the resource content model, so as to obtain resource content features corresponding to the multiple text regions respectively.

[0156] Furthermore, the association information prediction module 1150 includes:

[0157] The similarity calculation module is used to calculate the similarity between the tag interaction features and the resource content features corresponding to the multiple text regions respectively, and obtain multiple similarity information.

[0158] The similarity information fusion module is used to perform information fusion processing on the multiple similarity information to obtain the predicted association information.

[0159] Furthermore, the label feature extraction module 1130 includes:

[0160] The fifth extraction module is used to extract label features from the candidate labels based on the label feature extraction layer, and obtain the label extraction features of the candidate labels under multiple feature attributes;

[0161] The transformation module is used to perform integer transformation processing on the extracted label features to obtain the transformed label features;

[0162] A mapping module is used to map the label transformation features to label features of a target dimension; the target dimension is greater than the dimension of the label transformation features.

[0163] The fourth determining module is used to determine the label features of the target dimension as the label attribute features.

[0164] Please see Figure 12 It illustrates a label identifying device, comprising:

[0165] The second acquisition module 1210 is used to acquire multiple predicted tags corresponding to the target multimedia resource;

[0166] The tag processing module 1220 is used to input the target multimedia resource and the multiple predicted tags into the target tag processing model for tag processing, so as to obtain the target association information corresponding to the target multimedia resource and the multiple predicted tags respectively; the target tag processing model is obtained based on the above-mentioned tag processing model training device;

[0167] The target label determination module 1230 is used to determine the target label from the multiple predicted labels based on the target association information corresponding to the target multimedia resource and the multiple predicted labels respectively.

[0168] The apparatus provided in the above embodiments can execute the methods provided in any embodiment of this application, and has the corresponding functional modules and beneficial effects for executing the method. Technical details not described in detail in the above embodiments can be found in the methods provided in any embodiment of this application.

[0169] This embodiment also provides a computer-readable storage medium storing at least one instruction or at least one program, which is loaded by a processor and executed as any of the methods described above in this embodiment.

[0170] According to one aspect of this application, a computer program product or computer program is provided, comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform any of the methods described above.

[0171] This embodiment also provides an electronic device, the structural diagram of which can be found in [reference needed]. Figure 13The device 1300 can vary significantly in configuration or performance, and may include one or more central processing units (CPUs) 1322 (e.g., one or more processors) and memory 1332, and one or more storage media 1330 (e.g., one or more mass storage devices) for storing applications 1342 or data 1344. The memory 1332 and storage media 1330 may be temporary or persistent storage. Programs stored in the storage media 1330 may include one or more modules (not shown), each module including a series of instruction operations on the device. Furthermore, the CPU 1322 may be configured to communicate with the storage media 1330 and execute the series of instruction operations in the storage media 1330 on the device 1300. The device 1300 may also include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input / output interfaces 1358, and / or one or more operating systems 1341, such as Windows Server. TM Mac OS X TM Unix TM Linux TM FreeBSD TM Etc. Any of the methods described above in this embodiment can be based on... Figure 13 The equipment shown is used for implementation.

[0172] This specification provides the operational steps of the methods described in the embodiments or flowcharts, but more or fewer operational steps may be included based on conventional or non-inventive labor. The steps and order listed in the embodiments are merely one possible execution order among many steps and do not represent the only execution order. In actual system or interrupt product execution, the methods shown in the embodiments or drawings can be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment).

[0173] The structure shown in this embodiment is only a partial structure related to the solution of this application and does not constitute a limitation on the device to which the solution of this application is applied. Specific devices may include more or fewer components than shown, or combinations of certain components, or arrangements of different components. It should be understood that the methods, apparatuses, etc., disclosed in this embodiment can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For example, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection between devices or unit modules through some interfaces.

[0174] Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0175] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed in this specification can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of each example have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0176] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A method for training a label processing model, characterized in that, include: Obtain training samples and the model to be trained; The training samples include sample multimedia resources, candidate labels corresponding to the sample multimedia resources, and annotation association information between the sample multimedia resources and the candidate labels; The model to be trained includes a resource content model and a resource tag model; the resource tag model includes a tag feature extraction layer and a tag feature interaction layer; the candidate tags are obtained by predicting the tags of the sample multimedia resources based on multiple preset tag prediction models; the multiple preset tag prediction models are trained based on different tag prediction algorithms; The candidate label carries a model identifier and confidence information. The model identifier is used to characterize the source information of the candidate label. The more model identifiers a candidate label carries, the higher its confidence level. Based on the resource content model, content features are extracted from the sample multimedia resources to obtain resource content features; Based on the label feature extraction layer, label features are extracted from the candidate labels to obtain label attribute features of the candidate labels under multiple feature attributes; the feature attributes include label source attribute and label confidence attribute; the step of extracting label features from the candidate labels based on the label feature extraction layer to obtain label attribute features of the candidate labels under multiple feature attributes includes: Based on the label feature extraction layer, feature extraction is performed on the model identifier carried by the candidate label and the confidence information carried by the candidate label to obtain the label source feature corresponding to the label source attribute and the label confidence feature corresponding to the label confidence attribute. Based on the tag source features and the tag confidence features, the tag attribute features of the candidate tags under the various feature attributes are obtained; Based on the tag feature interaction layer, feature interaction processing is performed on the tag attribute features under the multiple feature attributes to obtain tag interaction features; Based on the resource content features and the tag interaction features, feature fusion is performed to obtain the predicted association information between the sample multimedia resources and the candidate tags; The target label processing model is trained based on the labeled association information and the predicted association information.

2. The method according to claim 1, characterized in that, The feature attributes include tag content statistical attributes; The step of extracting label features from the candidate labels based on the label feature extraction layer to obtain label attribute features of the candidate labels under multiple feature attributes includes: Based on the tag feature extraction layer, feature content statistical features are extracted from the candidate tags to obtain the tag content statistical features corresponding to the tag content statistical attributes. Based on the statistical features of the tag content, the tag attribute features of the candidate tag under the various feature attributes are obtained.

3. The method according to any one of claims 1-2, characterized in that, Before extracting content features from the sample multimedia resources based on the resource content model to obtain resource content features, the method further includes: Based on the text content layout information corresponding to the sample multimedia resources, the sample multimedia resources are divided into text regions to obtain multiple text regions. The step of extracting content features from the sample multimedia resources based on the resource content model to obtain resource content features includes: Based on the resource content model, content features are extracted from the text content corresponding to the multiple text regions to obtain resource content features corresponding to the multiple text regions.

4. The method according to claim 3, characterized in that, The step of fusing features based on the resource content features and the tag interaction features to obtain the predicted association information between the multimedia resource and the candidate tags includes: The similarity between the tag interaction features and the resource content features corresponding to the multiple text regions is calculated to obtain multiple similarity information. The predicted association information is obtained by performing information fusion processing on the multiple similarity information.

5. The method according to claim 1, characterized in that, The step of extracting label features from the candidate labels based on the label feature extraction layer to obtain label attribute features of the candidate labels under multiple feature attributes includes: Based on the label feature extraction layer, label features are extracted from the candidate labels to obtain the label extraction features of the candidate labels under multiple feature attributes; The extracted label features are subjected to integer transformation to obtain the label transformation features; The label transformation features are mapped to label features of a target dimension; the target dimension is greater than the dimension of the label transformation features. The label features of the target dimension are determined as the label attribute features.

6. A label identification method, characterized in that, include: Obtain multiple predicted tags corresponding to the target multimedia resource; The multiple predicted labels are obtained by predicting the labels of the target multimedia resources based on multiple preset label prediction models. The multiple preset label prediction models are trained based on different label prediction algorithms; Each of the multiple predicted labels carries a model identifier and confidence information. The model identifier is used to characterize the source information of the multiple predicted labels. The candidate label with a larger number of model identifiers carries higher confidence. The target multimedia resource and the multiple predicted labels are input into the target label processing model for label processing to obtain the target association information corresponding to the target multimedia resource and the multiple predicted labels respectively. The target label processing model is obtained based on the label processing model training method as described in any one of claims 1-5; Based on the target association information corresponding to the target multimedia resource and the multiple predicted tags respectively, the target tag is determined from the multiple predicted tags.

7. A label processing model training device, characterized in that, include: The first acquisition module is used to acquire training samples and the model to be trained. The training samples include sample multimedia resources, candidate labels corresponding to the sample multimedia resources, and annotation association information between the sample multimedia resources and the candidate labels; The model to be trained includes a resource content model and a resource tag model; the resource tag model includes a tag feature extraction layer and a tag feature interaction layer; the candidate tags are obtained by predicting the tags of the sample multimedia resources based on multiple preset tag prediction models; the multiple preset tag prediction models are trained based on different tag prediction algorithms; The candidate label carries a model identifier and confidence information. The model identifier is used to characterize the source information of the candidate label. The more model identifiers a candidate label carries, the higher its confidence level. The resource content feature extraction module is used to extract content features from the sample multimedia resources based on the resource content model to obtain resource content features. A tag feature extraction module is used to extract tag features from the candidate tags based on the tag feature extraction layer, obtaining tag attribute features of the candidate tags under multiple feature attributes; the feature attributes include tag source attribute and tag confidence attribute; the step of extracting tag features from the candidate tags based on the tag feature extraction layer to obtain tag attribute features of the candidate tags under multiple feature attributes includes: Based on the label feature extraction layer, feature extraction is performed on the model identifier carried by the candidate label and the confidence information carried by the candidate label to obtain the label source feature corresponding to the label source attribute and the label confidence feature corresponding to the label confidence attribute. Based on the tag source features and the tag confidence features, the tag attribute features of the candidate tags under the various feature attributes are obtained; The feature interaction module is used to perform feature interaction processing on the tag attribute features under the multiple feature attributes based on the tag feature interaction layer to obtain tag interaction features; The association information prediction module is used to perform feature fusion based on the resource content features and the tag interaction features to obtain the predicted association information between the sample multimedia resources and the candidate tags; The training module is used to train the model to be trained based on the labeled association information and the predicted association information to obtain the target label processing model.

8. The apparatus according to claim 7, characterized in that, The feature attributes include tag content statistical attributes; The label feature extraction module includes: The third extraction module is used to extract the statistical features of the candidate tags based on the tag feature extraction layer, so as to obtain the tag content statistical features corresponding to the tag content statistical attributes. The third determining module is used to obtain the tag attribute features of the candidate tag under the multiple feature attributes based on the statistical features of the tag content.

9. The apparatus according to any one of claims 7-8, characterized in that, The device further includes: The region segmentation module is used to segment the sample multimedia resources into multiple text regions based on the text content layout information corresponding to the sample multimedia resources. The resource content feature extraction module includes: The fourth extraction module is used to extract content features from the text content corresponding to the multiple text regions based on the resource content model, so as to obtain resource content features corresponding to the multiple text regions respectively.

10. The apparatus according to claim 9, characterized in that, The association information prediction module includes: The similarity calculation module is used to calculate the similarity between the tag interaction features and the resource content features corresponding to the multiple text regions respectively, and obtain multiple similarity information. The similarity information fusion module is used to perform information fusion processing on the multiple similarity information to obtain the predicted association information.

11. The apparatus according to claim 7, characterized in that, The label feature extraction module includes: The fifth extraction module is used to extract label features from the candidate labels based on the label feature extraction layer, and obtain the label extraction features of the candidate labels under multiple feature attributes; The transformation module is used to perform integer transformation processing on the extracted label features to obtain the transformed label features; A mapping module is used to map the label transformation features to label features of a target dimension; the target dimension is greater than the dimension of the label transformation features. The fourth determining module is used to determine the label features of the target dimension as the label attribute features.

12. A label determining device, characterized in that, include: The second acquisition module is used to acquire multiple predicted tags corresponding to the target multimedia resource; The multiple predicted labels are obtained by predicting labels for the target multimedia resources based on multiple preset label prediction models; the multiple preset label prediction models are trained based on different label prediction algorithms. Each of the multiple predicted labels carries a model identifier and confidence information. The model identifier is used to characterize the source information of the multiple predicted labels. The candidate label with a larger number of model identifiers carries higher confidence. The tag processing module is used to input the target multimedia resource and the multiple predicted tags into the target tag processing model for tag processing, and obtain the target association information corresponding to the target multimedia resource and the multiple predicted tags respectively; The target label processing model is obtained based on the label processing model training device as described in claim 7; The target label determination module is used to determine the target label from the multiple predicted labels based on the target association information corresponding to the target multimedia resource and the multiple predicted labels respectively.

13. An electronic device, characterized in that, The device includes a processor and a memory, the memory storing at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by the processor to implement the label processing model training method as described in any one of claims 1 to 5, or the label determination method as described in claim 6.

14. A computer storage medium, characterized in that, The storage medium stores at least one instruction or at least one program segment, which is loaded and executed by a processor according to any one of claims 1 to 5, or according to claim 6, the label processing model training method.

15. A computer program product, characterized in that, The computer program product includes computer instructions stored in a computer-readable storage medium; a processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the label processing model training method as described in any one of claims 1 to 5, or the label determination method as described in claim 6.