Article classification model training method and article classification method, device and medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By integrating feature vectors and semantic information from descriptive text, audio text, and image text in short videos, and updating model parameters, the problem of inaccurate classification in short videos is solved, thus improving the accuracy of product recommendations.

CN115482490BActive Publication Date: 2026-06-12BEIJING DAJIA INTERNET INFORMATION TECH CO LTD

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING DAJIA INTERNET INFORMATION TECH CO LTD
Filing Date: 2022-09-22
Publication Date: 2026-06-12

Application Information

Patent Timeline

22 Sep 2022

Application

12 Jun 2026

Publication

CN115482490B

IPC: G06V20/40; G06V10/764; G06V10/80; G06V10/82; G06F40/30; G06Q30/0601

CPC: G06V20/41; G06V10/765; G06V10/806; G06V10/82; G06F40/30; G06Q30/0643

AI Tagging

Application Domain

Semantic analysis Character and pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

The uneven distribution of image and text feature vectors in short videos leads to inaccurate product classification results from the model, which in turn affects the accuracy of product recommendations.

⚗Method used

By acquiring the feature vectors and semantic information of descriptive text, audio text, and image text from video sample data, fusing the semantic information of different text feature vectors, and updating the neural network parameters of the model to be trained, an item classification model is obtained.

🎯Benefits of technology

This improves the accuracy of the model's classification results, ensuring the accuracy of product recommendations.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115482490B_ABST

Patent Text Reader

Abstract

The present disclosure relates to the technical field of computer, in particular to an article classification model training, an article classification method, an article classification model training device, an article classification device, a computer readable storage medium and an electronic device, comprising: obtaining video sample data; obtaining feature vectors of each text and semantic information of each text; determining the text feature vectors corresponding to the video sample data; fusing the image feature vectors and the text feature vectors to obtain first fusion feature vectors corresponding to the video sample data; obtaining a predicted article category according to the first fusion feature vectors corresponding to the video sample data; and updating neural network parameters of a to-be-trained model according to the article category label and the predicted article category. Through the technical scheme of the embodiment of the present disclosure, the problem of inaccurate classification in the prior art can be solved.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and more specifically, to a method for training an item classification model, an item classification method, an item classification model training device, an item classification device, a computer-readable storage medium, and an electronic device. Background Technology

[0002] With the rapid development of the internet, short video e-commerce is becoming increasingly widespread. Merchants can showcase, introduce, and sell products featured in short videos. In some scenarios, products in short videos can be identified, and relevant links can be provided for potential buyers to view them.

[0003] In related technologies, image feature vectors can be extracted using an image encoder, and text feature vectors can be extracted using a text encoder. The image feature vectors and text feature vectors are then fused to obtain product feature vectors. Based on these product feature vectors, the categories of products in the short video can be determined.

[0004] However, due to the uneven distribution of image or text feature vectors in short videos, the classification results output by the model may be inaccurate, leading to errors in product recommendations.

[0005] It should be noted that the information disclosed in the background section above is only used to enhance the understanding of the background of this disclosure, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention

[0006] The purpose of this disclosure is to provide an item classification model training, an item classification model training device, a computer-readable storage medium, and an electronic device, which can solve the problem of inaccurate classification in the prior art.

[0007] Other features and advantages of this disclosure will become apparent from the following detailed description, or may be learned in part from practice of this disclosure.

[0008] According to a first aspect of this disclosure, a method for training an item classification model is provided, comprising: acquiring video sample data; wherein the video sample data includes text data, image data, and item category labels, and the text data includes descriptive text, speech text, and image text; inputting the video sample data into the model to be trained, acquiring feature vectors and first semantic information of the descriptive text, acquiring feature vectors and second semantic information of the speech text, and acquiring feature vectors and third semantic information of the image text; determining text feature vectors corresponding to the video sample data based on the feature vectors and first semantic information of the descriptive text, the feature vectors and second semantic information of the speech text, and the feature vectors and third semantic information of the image text; acquiring image feature vectors corresponding to the video sample data based on the image data in the video sample data, and fusing the image feature vectors and text feature vectors to obtain a first fused feature vector corresponding to the video sample data; obtaining a predicted item category based on the first fused feature vector corresponding to the video sample data, and updating the neural network parameters of the model to be trained based on the item category labels and the predicted item categories to obtain the item classification model.

[0009] Optionally, based on the aforementioned scheme, the text feature vector corresponding to the video sample data is determined by obtaining the feature vector of the descriptive text and the first semantic information of the descriptive text, obtaining the feature vector of the speech text and the second semantic information of the speech text, and obtaining the feature vector of the image text and the third semantic information of the image text. This includes: merging the first semantic information, the second semantic information, and the third semantic information into overall semantic information based on the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information; and determining the text feature vector corresponding to the video sample data based on the overall semantic information and the feature vectors of the descriptive text, the speech text, and the image text.

[0010] Optionally, based on the aforementioned scheme, the neural network parameters of the model to be trained are updated according to the item category label and the predicted item category, including: determining the first loss function of the model to be trained according to the item category label and the predicted item category; and updating the neural network parameters of the model to be trained according to the first loss function.

[0011] Optionally, based on the aforementioned scheme, the first loss function is an asymmetric loss function; wherein, the video sample data includes positive samples and negative samples, the exponential coefficient for negative samples in the asymmetric loss function is greater than the exponential coefficient for positive samples, and negative samples are removed when the predicted probability of the predicted item category corresponding to the negative sample is less than a preset threshold.

[0012] Optionally, based on the aforementioned scheme, the neural network parameters of the model to be trained are updated according to the first loss function of the model to be trained, including: inputting video sample data into the momentum model; wherein, the neural network parameters of the momentum model are updated by sliding according to the changes in the neural network parameters during the training process of the model to be trained; obtaining the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text, obtaining the momentum feature vector of the speech text and the fifth semantic information of the speech text, obtaining the momentum feature vector of the image text and the sixth semantic information of the image text; and updating the neural network parameters of the momentum model according to the changes in the neural network parameters during the training process of the model to be trained; obtaining the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text, the momentum feature vector of the speech text and the sixth semantic information of the image text; and updating the neural network parameters of the momentum model according to the changes in the neural network parameters during the training process of the model to be trained; obtaining the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text, the momentum feature vector of the speech text and the sixth semantic information of the image text; and updating the neural network parameters of the image text ... and updating the neural network parameters of the image text according to the changes in the neural network parameters during the training process of the model to be trained; and updating the neural network parameters of the image text according to the changes in the neural network parameters during the training process of the model to be trained; and updating the neural network parameters of the image text according to the changes in the neural network parameters during the training process of the model to be trained; and updating the neural network parameters of the image text according to the changes in the neural network parameters during the training process of the model to be trained; and updating the neural network parameters of The fifth semantic information of the audio text, the momentum feature vector of the image text, and the sixth semantic information of the image text are used to determine the momentum text feature vector corresponding to the video sample data. The momentum image feature vector corresponding to the video sample data is obtained, and the momentum image feature vector is fused with the momentum text feature vector to obtain the second fused feature vector corresponding to the video sample data. The second loss function of the model to be trained is determined based on the second fused feature vector of the video sample data. The overall loss function is determined based on the first loss function and the second loss function of the model to be trained. The neural network parameters of the model to be trained are updated through the overall loss function to obtain the item classification model.

[0013] Optionally, based on the aforementioned scheme, the second loss function of the model to be trained is determined according to the second fusion feature vector of the video sample data, including: obtaining item category pseudo-labels according to the second fusion feature vector of the video sample data; and determining the second loss function of the model to be trained based on the item category pseudo-labels and the predicted item categories.

[0014] Optionally, based on the aforementioned scheme, the second loss function is an asymmetric loss function; wherein, the video sample data includes positive samples and negative samples, and the exponential coefficient for negative samples in the asymmetric loss function is greater than the exponential coefficient for positive samples. When the predicted probability of the predicted item category corresponding to a negative sample is less than the predicted probability of the pseudo-label of the item category corresponding to the negative sample, the negative sample is removed; when the predicted probability of the predicted item category corresponding to a positive sample is greater than the predicted probability of the pseudo-label of the item category corresponding to the positive sample, the positive sample is removed.

[0015] According to a second aspect of this disclosure, an item classification method is provided, the method comprising: acquiring video data; wherein the video data includes text data and image data, the text data including descriptive text, audio text, and image text; inputting the video data into an item classification model to obtain item categories; wherein the item classification model is trained using an item classification model as described in any of the above embodiments.

[0016] According to a third aspect of this disclosure, an item classification model training apparatus is provided. The apparatus includes: a sample data acquisition unit configured to acquire video sample data; wherein the video sample data includes text data, image data, and item category labels, and the text data includes descriptive text, speech text, and image text; a semantic information acquisition unit configured to input the video sample data into a model to be trained, acquire feature vectors of the descriptive text and first semantic information of the descriptive text, acquire feature vectors of the speech text and second semantic information of the speech text, and acquire feature vectors of the image text and third semantic information of the image text; and a text feature acquisition unit configured to acquire text features based on the feature vectors of the descriptive text and the first semantic information of the descriptive text. The system uses the first semantic information of the spoken text, the feature vector of the spoken text, the second semantic information of the spoken text, the feature vector of the image text, and the third semantic information of the image text to determine the text feature vector corresponding to the video sample data. The feature fusion unit is configured to obtain the image feature vector corresponding to the video sample data based on the image data in the video sample data, and fuse the image feature vector with the text feature vector to obtain the first fused feature vector corresponding to the video sample data. The parameter update unit is configured to obtain the predicted item category based on the first fused feature vector corresponding to the video sample data, and update the neural network parameters of the model to be trained based on the item category label and the predicted item category to obtain the item classification model.

[0017] Optionally, based on the aforementioned scheme, the device further includes: a semantic merging unit configured to merge the first semantic information, the second semantic information, and the third semantic information into overall semantic information based on the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information; and a text feature vector determination unit configured to determine the text feature vector corresponding to the video sample data based on the overall semantic information and the feature vectors of the descriptive text, the speech text, and the image text.

[0018] Optionally, based on the aforementioned scheme, the device further includes: a first loss function determination unit configured to determine a first loss function of the model to be trained based on the item category label and the predicted item category; and a first loss function training unit configured to update the neural network parameters of the model to be trained based on the first loss function of the model to be trained.

[0019] Optionally, the first loss function based on the aforementioned scheme is an asymmetric loss function; wherein, the video sample data includes positive samples and negative samples, and the exponential coefficient for negative samples in the asymmetric loss function is greater than the exponential coefficient for positive samples. When the predicted probability of the predicted item category corresponding to the negative sample is less than a preset threshold, the negative sample is removed.

[0020] Optionally, based on the aforementioned scheme, the device further includes: a momentum model input unit configured to input video sample data into a momentum model; wherein the neural network parameters of the momentum model are updated by sliding according to the changes in the neural network parameters during the training process of the model under training; a second semantic information acquisition unit configured to acquire the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text, acquire the momentum feature vector of the speech text and the fifth semantic information of the speech text, and acquire the momentum feature vector of the image text and the sixth semantic information of the image text; and a second text feature vector determination unit configured to determine the text feature vector based on the momentum feature vector of the descriptive text, the fourth semantic information of the descriptive text, and the momentum feature vector of the speech text. The system includes five semantic information (the fifth semantic information of the speech text), momentum feature vectors (the momentum feature vectors of the image text), and sixth semantic information (the sixth semantic information of the image text) to determine the momentum text feature vector corresponding to the video sample data. A second fusion feature vector acquisition unit is configured to acquire the momentum image feature vector corresponding to the video sample data and fuse the momentum image feature vector with the momentum text feature vector to obtain the second fusion feature vector corresponding to the video sample data. An overall loss function acquisition unit is configured to determine the second loss function of the model to be trained based on the second fusion feature vector of the video sample data. A training unit is configured to determine the overall loss function based on the first and second loss functions of the model to be trained, and update the neural network parameters of the model to be trained using the overall loss function to obtain the item classification model.

[0021] Optionally, based on the aforementioned scheme, the device further includes: an item category pseudo-label acquisition unit configured to obtain item category pseudo-labels based on the second fusion feature vector of the video sample data; and a second loss function determination unit configured to determine the second loss function of the model to be trained based on the item category pseudo-labels and the predicted item categories.

[0022] Optionally, the second loss function based on the aforementioned scheme is an asymmetric loss function; wherein, the video sample data includes positive samples and negative samples, and the exponential coefficient for negative samples in the asymmetric loss function is greater than the exponential coefficient for positive samples. When the predicted probability of the predicted item category corresponding to the negative sample is less than the predicted probability of the pseudo-label of the item category corresponding to the negative sample, the negative sample is removed; when the predicted probability of the predicted item category corresponding to the positive sample is greater than the predicted probability of the pseudo-label of the item category corresponding to the positive sample, the positive sample is removed.

[0023] According to a fourth aspect of this disclosure, an item classification apparatus is provided, the apparatus comprising: a video acquisition unit configured to acquire video data; wherein the video data includes text data and image data, the text data including descriptive text, speech text, and image text; and an item category acquisition unit configured to input the video data into an item classification model to obtain an item category; wherein the item classification model is trained using an item classification model as described above.

[0024] According to a fifth aspect of this disclosure, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the item classification model training of the first aspect and the item classification method of the second aspect as described in the above embodiments.

[0025] According to a sixth aspect of this disclosure, an electronic device is provided, comprising:

[0026] Processor; and

[0027] A memory is used to store one or more programs, which, when executed by one or more processors, enable the one or more processors to implement the item classification model training of the first aspect and the item classification method of the second aspect as described in the above embodiments.

[0028] According to a seventh aspect of the present disclosure, a computer program product is provided, a computer program / instruction, characterized in that, when the computer program / instruction is executed by a processor, it implements the item classification model training and item classification method described above.

[0029] The technical solutions provided in this disclosure may have the following beneficial effects:

[0030] In the training of an item classification model provided in one embodiment of this disclosure, video sample data can be acquired and input into the model to be trained. Feature vectors and first semantic information of descriptive text, feature vectors and second semantic information of speech text, and feature vectors and third semantic information of image text can be acquired. Based on the feature vectors and first semantic information of descriptive text, speech text, and image text, text feature vectors corresponding to the video sample data are determined. Image feature vectors corresponding to the video sample data are acquired based on the image data in the video sample data. The image feature vectors and text feature vectors are fused to obtain a first fused feature vector corresponding to the video sample data. The predicted item category is obtained based on the first fused feature vector corresponding to the video sample data. The neural network parameters of the model to be trained are updated based on the item category label and the predicted item category to obtain the item classification model.

[0031] The embodiments disclosed herein can integrate semantic information corresponding to different text feature vectors, taking into account the impact of the semantic information corresponding to each text on the classification results, thereby improving the accuracy of the classification results output by the model and ensuring more accurate product recommendations.

[0032] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description

[0033] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure. It is obvious that the drawings described below are merely some embodiments of this disclosure, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort. In the drawings:

[0034] Figure 1 The illustration shows a schematic diagram of an exemplary system architecture for training an item classification model in an exemplary embodiment of the present disclosure;

[0035] Figure 2 This schematically illustrates a flowchart of an item classification model training method in an exemplary embodiment of the present disclosure;

[0036] Figure 3 This schematically illustrates a flowchart of determining the text feature vector corresponding to video sample data based on overall semantic information and feature vectors of descriptive text, speech text, and image text in an exemplary embodiment of this disclosure;

[0037] Figure 4 This schematically illustrates a flowchart of updating the neural network parameters of the model to be trained according to a first loss function of the model to be trained in an exemplary embodiment of the present disclosure.

[0038] Figure 5 The flowchart illustrates an exemplary embodiment of this disclosure, showing how the neural network parameters of the model to be trained are updated using a global loss function to obtain an item classification model.

[0039] Figure 6 This schematically illustrates a flowchart of a process for determining a second loss function of a model to be trained based on an item category pseudo-label and a predicted item category in an exemplary embodiment of this disclosure.

[0040] Figure 7 This diagram illustrates a method for obtaining the first fusion feature corresponding to video sample data in an exemplary embodiment of this disclosure.

[0041] Figure 8 A flowchart illustrating an exemplary embodiment of the present disclosure is shown.

[0042] Figure 9 This schematic diagram illustrates the composition of an item classification model training device according to an exemplary embodiment of the present disclosure;

[0043] Figure 10 This schematic diagram illustrates the composition of an article sorting device according to an exemplary embodiment of the present disclosure;

[0044] Figure 11 The schematic diagram illustrates a structural schematic of a computer system suitable for implementing an electronic device according to exemplary embodiments of the present disclosure. Detailed Implementation

[0045] Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, these exemplary embodiments can be implemented in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided to make this disclosure more comprehensive and complete, and to fully convey the concept of the exemplary embodiments to those skilled in the art. The described feature vectors, structures, or characteristics can be combined in any suitable manner in one or more embodiments. Furthermore, the described feature vectors, structures, or characteristics can be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of this disclosure. However, those skilled in the art will recognize that the technical solutions of this disclosure can be practiced without one or more of the specific details, or other methods, components, apparatuses, steps, etc., can be employed. In other instances, well-known structures, methods, apparatuses, implementations, materials, or operations are not shown or described in detail to avoid obscuring various aspects of this disclosure.

[0046] The block diagrams shown in the accompanying drawings are merely functional entities and do not necessarily correspond to physically independent entities. That is, these functional entities can be implemented in software, or in one or more software-hardened modules, or in different network and / or processor devices and / or microcontroller devices.

[0047] Figure 1 A schematic diagram of an exemplary system architecture for training an item classification model or an item classification method that can be applied to embodiments of this disclosure is shown.

[0048] like Figure 1 As shown, system architecture 1000 may include one or more of terminal devices 1001, 1002, and 1003, network 1004, and server 1005. Network 1004 is used as a medium to provide a communication link between terminal devices 1001, 1002, and 1003 and server 1005. Network 1004 may include various connection types, such as wired or wireless communication links or fiber optic cables, etc.

[0049] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included. For example, server 1005 could be a server cluster composed of multiple servers.

[0050] Users can use terminal devices 1001, 1002, and 1003 to interact with server 1005 via network 1004 to receive or send messages, etc. Terminal devices 1001, 1002, and 1003 can be various electronic devices with displays, including but not limited to smartphones, tablets, laptops, and desktop computers. Additionally, server 1005 can be a server providing various services.

[0051] In one embodiment, the execution entity of the item classification model training method of this disclosure can be a server 1005. The server 1005 can acquire video sample data sent by terminal devices 1001, 1002, and 1003, and input the video sample data into the model to be trained. It can acquire the feature vector of the descriptive text and the first semantic information of the descriptive text, the feature vector of the speech text and the second semantic information of the speech text, and the feature vector of the image text and the third semantic information of the image text. Based on the feature vector of the descriptive text and the first semantic information of the descriptive text, the feature vector of the speech text and the second semantic information of the speech text, and the feature vector of the image text and the third semantic information of the image text, it can determine the text feature vector corresponding to the video sample data. Based on the image data in the video sample data, it can acquire the image feature vector corresponding to the video sample data. It can fuse the image feature vector and the text feature vector to obtain the first fused feature vector corresponding to the video sample data. Based on the first fused feature vector corresponding to the video sample data, it can obtain the predicted item category. Based on the item category label and the predicted item category, it can update the neural network parameters of the model to be trained to obtain the item classification model. In addition, the item classification model training method disclosed herein can be executed through terminal devices 1001, 1002, 1003, etc., to achieve the process of fusing image feature vectors and text feature vectors to obtain a first fused feature vector corresponding to video sample data, obtaining a predicted item category based on the first fused feature vector corresponding to the video sample data, and updating the neural network parameters of the model to be trained based on the item category label and the predicted item category to obtain the item classification model.

[0052] Furthermore, the training process of the item classification model disclosed herein can also be jointly implemented by terminal devices 1001, 1002, and 1003 and server 1005. For example, terminal devices 1001, 1002, and 1003 can acquire video sample data and then send the acquired video sample data to server 1005, so that server 1005 can input the video sample data into the model to be trained, acquire the feature vector of the descriptive text and the first semantic information of the descriptive text, acquire the feature vector of the speech text and the second semantic information of the speech text, acquire the feature vector of the image text and the third semantic information of the image text, determine the text feature vector corresponding to the video sample data based on the feature vector of the descriptive text and the first semantic information of the descriptive text, the feature vector of the speech text and the second semantic information of the speech text, and the feature vector of the image text and the third semantic information of the image text, determine the text feature vector corresponding to the video sample data, acquire the image feature vector corresponding to the video sample data based on the image data in the video sample data, fuse the image feature vector and the text feature vector to obtain the first fused feature vector corresponding to the video sample data, obtain the predicted item category based on the first fused feature vector corresponding to the video sample data, update the neural network parameters of the model to be trained based on the item category label and the predicted item category, so as to obtain the item classification model.

[0053] With the rapid development of the internet, short video e-commerce is becoming increasingly widespread. Merchants can showcase, introduce, and sell products featured in short videos. In some scenarios, products in short videos can be identified, and relevant links can be provided for potential buyers to view them.

[0054] In related technologies, image feature vectors can be extracted using an image encoder, and text feature vectors can be extracted using a text encoder. These image and text feature vectors are then fused to obtain product feature vectors. These product feature vectors are then used for classification to determine the category of the products in the short video. For example, the category of products in a short video can be identified, and similar products can be recommended to users based on their preferences.

[0055] However, due to the uneven distribution of image or text feature vectors in short videos, the classification results output by the model may be inaccurate, leading to errors in product recommendations.

[0056] According to the training method of the item classification model provided in this exemplary embodiment, video sample data can be acquired, input into the model to be trained, feature vectors and first semantic information of descriptive text can be acquired, feature vectors and second semantic information of speech text can be acquired, feature vectors and third semantic information of image text can be acquired, and text feature vectors corresponding to the video sample data can be determined based on the feature vectors and first semantic information of descriptive text, the feature vectors and second semantic information of speech text, and the feature vectors and third semantic information of image text. Image feature vectors corresponding to the video sample data can be acquired based on the image data in the video sample data, and the image feature vectors and text feature vectors can be fused to obtain a first fused feature vector corresponding to the video sample data. The predicted item category can be obtained based on the first fused feature vector corresponding to the video sample data. The neural network parameters of the model to be trained can be updated based on the item category label and the predicted item category to obtain the item classification model. Figure 2 As shown, the training method for this item classification model may include the following steps S210 to S250:

[0057] Step S210: Obtain video sample data; wherein, the video sample data includes text data, image data, and item category labels, and the text data includes descriptive text, audio text, and image text;

[0058] Step S220: Input video sample data into the model to be trained, obtain the feature vector of the descriptive text and the first semantic information of the descriptive text, obtain the feature vector of the speech text and the second semantic information of the speech text, and obtain the feature vector of the image text and the third semantic information of the image text.

[0059] Step S230: Determine the text feature vector corresponding to the video sample data based on the feature vector of the descriptive text and the first semantic information of the descriptive text, the feature vector of the speech text and the second semantic information of the speech text, the feature vector of the image text and the third semantic information of the image text.

[0060] Step S240: Obtain the image feature vector corresponding to the video sample data based on the image data in the video sample data, and fuse the image feature vector with the text feature vector to obtain the first fused feature vector corresponding to the video sample data.

[0061] Step S250: Obtain the predicted item category based on the first fusion feature vector corresponding to the video sample data, and update the neural network parameters of the model to be trained based on the item category label and the predicted item category to obtain the item classification model.

[0062] The embodiments disclosed herein can integrate semantic information corresponding to different text feature vectors, taking into account the impact of the semantic information corresponding to each text on the classification results, thereby improving the accuracy of the classification results output by the model and ensuring more accurate product recommendations.

[0063] The steps S210 to S250 of the item classification model training method in this exemplary embodiment will now be described in more detail with reference to the accompanying drawings and embodiments.

[0064] Step S210: Obtain video sample data; wherein, the video sample data includes text data, image data, and item category labels, and the text data includes descriptive text, audio text, and image text;

[0065] In one example embodiment of this disclosure, video sample data can be acquired. This video sample data includes text data, image data, and item category tags. The text data includes descriptive text, audio text, and image text. Specifically, the video sample data can be video data captured by a camera module, or it can be video data synthesized manually. It should be noted that this disclosure does not impose any special limitations on the source of the video sample data.

[0066] Specifically, text data can include descriptive text, audio text, and image text.

[0067] In one example embodiment of this disclosure, the descriptive text may include text used by the video creator or video administrator to explain, interpret, or guide the video. For example, the descriptive text may include a video title, video summary, video introduction, etc. For instance, when a video creator uploads the video, they can add a video introduction, such as "This video mainly reviews xx brand mobile phones," which serves as the descriptive text for the video.

[0068] Furthermore, video sample data can include multiple sets of descriptive text. For example, the video title and video description corresponding to the video data can both serve as descriptive text for that video sample data.

[0069] It should be noted that this disclosure does not impose any special restrictions on the specific type of text described.

[0070] In one exemplary embodiment of this disclosure, speech text refers to text obtained by speech recognition of audio information in video data. For example, speech text may include text obtained by speech recognition of audio information (e.g., human voice) in a video.

[0071] In one exemplary embodiment of this disclosure, image text refers to text obtained by recognizing text information in video data through text recognition. For example, video data can be segmented into multiple frames, and text recognition can be performed on these frames to obtain speech text.

[0072] In one example embodiment of this disclosure, the video sample data also includes item category tags. Specifically, the item category tags can be used to indicate the tags of items in the video data corresponding to the video sample data. For example, the item category tags are hat, handbag, and knitwear.

[0073] Furthermore, video data can be used to indicate multiple items, so multiple item category labels can be configured for the video data.

[0074] Furthermore, the items indicated in the video data can correspond to multiple item category tags, so multiple item category tags can be configured for the video data. For example, if the item indicated in the video data is a crossbody bag, multiple item category tags can be configured for the video data, such as bags, shoulder bags, women's bags, etc.

[0075] It should be noted that this disclosure does not impose any special restrictions on the specific number or form of item category labels.

[0076] Step S220: Input video sample data into the model to be trained, obtain the feature vector of the descriptive text and the first semantic information of the descriptive text, obtain the feature vector of the speech text and the second semantic information of the speech text, and obtain the feature vector of the image text and the third semantic information of the image text.

[0077] In one example embodiment of this disclosure, after obtaining video sample data through the above steps, the video sample data can be input into the model to be trained. Specifically, the model to be trained refers to a model established to complete the item classification task. An item classification model can be obtained by training the model to be trained, thereby completing the item classification task. It should be noted that this disclosure does not impose any special limitations on the specific structure of the model to be trained.

[0078] Furthermore, after obtaining the video sample data, the descriptive text, speech text, and image text in the video sample data can be obtained through the model to be trained.

[0079] For example, the descriptive text, speech text, and image text corresponding to video sample data can be obtained through the text recognition sub-model in the model to be trained. For instance, the speech text corresponding to video sample data can be obtained through the speech recognition sub-model; or, the image text corresponding to video sample data can be obtained through the text recognition sub-model.

[0080] It should be noted that this disclosure does not impose any special limitations on the specific methods for obtaining descriptive text, audio text, and image text in video sample data.

[0081] In one exemplary embodiment of this disclosure, after obtaining the descriptive text, speech text, and image text from the video sample data, the feature vector and first semantic information of the descriptive text can be obtained through the model to be trained; the feature vector and second semantic information of the speech text can be obtained; and the feature vector and third semantic information of the image text can be obtained. Specifically, the first semantic information of the descriptive text can be used to indicate the semantics of the descriptive text of the video data corresponding to the video sample data, and this semantics integrates the meaning of each word / phrase in the descriptive text; the second semantic information of the speech text can be used to indicate the semantics of the speech text of the video data corresponding to the video sample data, and this semantics integrates the meaning of each word / phrase in the speech text; and the third semantic information of the image text can be used to indicate the semantics of the image text of the video data corresponding to the video sample data, and this semantics integrates the meaning of each word / phrase in the image text.

[0082] Specifically, the first semantic information describing the text, the second semantic information describing the speech text, and the third semantic information describing the image text are in the form of feature vectors. For example, the first semantic information describing the text is [CLS]embedding (a vector containing semantic information).

[0083] For example, the first text encoder in the model under test can convert the descriptive text in the video sample data into a feature vector of descriptive text and extract the first semantic information of the descriptive text. The second text encoder in the model under test can convert the speech text in the video sample data into a feature vector of speech text and extract the second semantic information of the speech text. The third text encoder in the model under test can convert the descriptive text in the video sample data into a feature vector of descriptive text and extract the second semantic information of the image text. For example, the feature vector of the descriptive text can be in the form of token embedding.

[0084] It should be noted that this disclosure does not impose any special limitations on the specific methods for obtaining the feature vector and first semantic information of the descriptive text, obtaining the feature vector and second semantic information of the speech text, and obtaining the feature vector and third semantic information of the image text.

[0085] Step S230: Determine the text feature vector corresponding to the video sample data based on the feature vector of the descriptive text and the first semantic information of the descriptive text, the feature vector of the speech text and the second semantic information of the speech text, the feature vector of the image text and the third semantic information of the image text.

[0086] In one exemplary embodiment of this disclosure, after obtaining the feature vector and first semantic information of the descriptive text, the feature vector and second semantic information of the speech text, and the feature vector and third semantic information of the image text through the above steps, the text feature vector corresponding to the video sample data can be determined based on the feature vector and first semantic information of the descriptive text, the feature vector and second semantic information of the speech text, and the feature vector and third semantic information of the image text. Specifically, the text feature vector corresponding to the video sample data can be used to indicate the text information indicated by the video data corresponding to the video sample data, which includes the descriptive text, speech text, and image text information of the video data.

[0087] Specifically, the feature vectors describing the text and the first semantic information of the text, the feature vectors of the speech text and the second semantic information of the speech text, and the feature vectors of the image text and the third semantic information of the image text can be directly concatenated to obtain the text feature vectors corresponding to the video sample data.

[0088] Alternatively, different fusion weights can be assigned to the feature vectors of different texts. Based on the fusion weights corresponding to the descriptive text, the speech text, and the image text, the feature vectors of the descriptive text and the first semantic information of the descriptive text, the feature vectors of the speech text and the second semantic information of the speech text, and the feature vectors of the image text and the third semantic information of the image text can be directly fused to obtain the text feature vectors corresponding to the video sample data.

[0089] Alternatively, different fusion weights can be assigned to the semantic information of different texts. Based on the fusion weights corresponding to the first semantic information, the second semantic information, and the third semantic information, the feature vectors of the descriptive text and the first semantic information of the descriptive text, the feature vectors of the speech text and the second semantic information of the speech text, and the feature vectors of the image text and the third semantic information of the image text can be directly fused to obtain the text feature vectors corresponding to the video sample data.

[0090] It should be noted that this disclosure does not impose any special limitations on the specific method of determining the text feature vector corresponding to the video sample data based on the feature vector of the descriptive text and the first semantic information of the descriptive text, the feature vector of the speech text and the second semantic information of the speech text, the feature vector of the image text and the third semantic information of the image text.

[0091] Step S240: Obtain the image feature vector corresponding to the video sample data based on the image data in the video sample data, and fuse the image feature vector with the text feature vector to obtain the first fused feature vector corresponding to the video sample data.

[0092] In one example embodiment of this disclosure, an image feature vector corresponding to the video sample data can be obtained based on the image data in the video sample data. Specifically, the image feature vector corresponding to the video sample data refers to the image feature vector of the video data corresponding to the video sample data. The image feature vector corresponding to the video sample data can be used to indicate the image information of the video data corresponding to the video sample data.

[0093] For example, the video data corresponding to the video sample data can be converted into the image feature vector corresponding to the video sample data through the image encoder in the model to be trained.

[0094] Furthermore, after obtaining the video sample data, the video data corresponding to the video sample data can be first divided into multiple frames of images, and then the image feature vectors corresponding to the video sample data can be obtained based on the multiple frames of images.

[0095] It should be noted that this disclosure does not impose any special limitations on the specific method of obtaining the image feature vectors corresponding to the video sample data or the specific form of the image feature vectors corresponding to the video sample data.

[0096] In one exemplary embodiment of this disclosure, after obtaining the text feature vector and the image feature vector corresponding to the video sample data through the above steps, the image feature vector and the text feature vector can be fused to obtain the first fused feature vector corresponding to the video sample data. Specifically, the text feature vector and the image feature vector corresponding to the video sample data can be directly fused to obtain the first fused feature vector corresponding to the video sample data. For example, the text feature vector and the image feature vector corresponding to the video sample data can be fused by dot product.

[0097] For example, the decoder structure in the transformer model can be used to fuse image feature vectors and text feature vectors to obtain the first fused feature vector corresponding to the video sample data. For instance, the text feature vector can be used as the query, and the image feature vector can be used as the key and value to obtain the first fused feature vector corresponding to the video sample data.

[0098] Alternatively, the text feature vectors and image feature vectors corresponding to the video sample data can be fused according to the fusion weights corresponding to the text feature vectors and the fusion weights corresponding to the image feature vectors to obtain the first fused feature vector corresponding to the video sample data.

[0099] It should be noted that this disclosure does not impose any special limitations on the specific method of fusing image feature vectors and text feature vectors to obtain the first fused feature vector corresponding to the video sample data.

[0100] Step S250: Obtain the predicted item category based on the first fusion feature vector corresponding to the video sample data, and update the neural network parameters of the model to be trained based on the item category label and the predicted item category to obtain the item classification model.

[0101] In one example embodiment of this disclosure, after obtaining the first fusion feature vector corresponding to the video sample data through the above steps, the predicted item category can be obtained based on the first fusion feature vector corresponding to the video sample data. Specifically, the model to be trained can be used to obtain the predicted item category based on the first fusion feature vector corresponding to the video sample data, and the predicted item category can be used to indicate the predicted category of the item in the video data corresponding to the video sample data.

[0102] Specifically, multiple predicted item categories can be obtained based on the first fusion feature vector corresponding to the video sample data. For example, the video data corresponding to the video sample data may include multiple items, each corresponding to a different item category. In this case, multiple predicted item categories corresponding to the multiple items can be obtained based on the first fusion feature vector corresponding to the video sample data. Alternatively, the items indicated in the video data may correspond to multiple item category labels. Therefore, multiple item category labels can be configured for the video data. In this case, multiple predicted item categories corresponding to one item can be obtained based on the first fusion feature vector corresponding to the video sample data.

[0103] In one example embodiment of this disclosure, the model to be trained may include multiple hidden layers, which may include convolutional layers, normalization layers, activation layers, etc. The first fusion feature vector corresponding to the video sample data can be sequentially input into the multiple hidden layers of the model to be trained to obtain the hidden layer calculation results, and the predicted item category can be obtained through the hidden layer calculation results.

[0104] It should be noted that this disclosure does not impose any special limitations on the specific method of predicting the item category based on the first fusion feature vector corresponding to the video sample data.

[0105] In one exemplary embodiment of this disclosure, after obtaining the predicted item category through the above steps, the neural network parameters of the model to be trained can be updated based on the item category label and the predicted item category to obtain an item classification model. Specifically, the predicted item category can be used to indicate the predicted category of items in the video data corresponding to the video sample data. The predicted item category is a predicted value. At this time, the true value in the video sample data, i.e., the item category label, can be obtained. This item category label can be used to indicate the true category of items in the video data corresponding to the video sample data. Then, the predicted item category (predicted value) can be compared with the item category label (true value) to obtain the prediction difference between the predicted item category (predicted value) and the item category label (true value). The neural network parameters of the model to be trained can be updated based on this prediction difference to obtain an item classification model.

[0106] Specifically, the neural network parameters of the model to be trained may include the number of model layers, the number of feature vector channels, and the learning rate. When updating the neural network parameters of the model to be trained based on the prediction difference, the number of model layers, the number of feature vector channels, and the learning rate of the model to be trained can be updated to train the item classification model.

[0107] In one example embodiment of this disclosure, the neural network parameters of the model to be trained can be updated using the backpropagation algorithm, and an item classification model can be obtained after training.

[0108] It should be noted that this disclosure does not impose any special limitations on the specific method of updating the neural network parameters of the model to be trained based on the item category label and the predicted item category.

[0109] In one example embodiment of this disclosure, the neural network parameters of the training model can be updated based on the item category label and the predicted item category. When the training model meets the convergence condition, it is determined to be an item classification model. Specifically, meeting the convergence condition means that the training model has high prediction accuracy and can be applied. For example, the convergence condition may include the number of training iterations, such as ending training after N iterations; or the convergence condition may include the training duration, such as ending training after T training sessions.

[0110] It should be noted that this disclosure does not impose any special limitations on the specific content of the convergence conditions. By applying convergence conditions to the model, the training process of the model to be trained can be better controlled, the problem of overtraining of neural networks can be avoided, and the training efficiency of the model to be trained can be improved.

[0111] In one example embodiment of this disclosure, the first semantic information, the second semantic information, and the third semantic information can be merged into overall semantic information based on the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information. Then, the text feature vector corresponding to the video sample data can be determined based on the overall semantic information and the feature vectors of the descriptive text, the speech text, and the image text. (Refer to...) Figure 3 As shown, determining the text feature vector corresponding to the video sample data based on the overall semantic information and the feature vectors of the descriptive text, the speech text, and the image text may include the following steps S310 to S320:

[0112] Step S310: Based on the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information, merge the first semantic information, the second semantic information, and the third semantic information into the overall semantic information.

[0113] In one exemplary embodiment of this disclosure, after obtaining the first semantic information of the descriptive text, the second semantic information of the speech text, and the third semantic information of the image text through the above steps, a first weight corresponding to the first semantic information, a second weight corresponding to the second semantic information, and a third weight corresponding to the third semantic information can be obtained. The first semantic information, the second semantic information, and the third semantic information are then merged into overall semantic information based on these weights. Specifically, since different types of text have different focuses, different weights can be assigned to the semantic information corresponding to different types of text to better improve the accuracy of the text feature vectors corresponding to the video sample data.

[0114] For example, among the descriptive text, audio text, and image text in video sample data, the descriptive text is usually the most representative, and its content is most correlated with the video data. Therefore, the descriptive text is of high importance to the text feature vector corresponding to the video sample data, meaning that the first semantic information of the descriptive text can be given a higher weight. On the other hand, for audio text, there may be useless text (such as audio text converted from background music in the video data), meaning that the audio text is less correlated with the video data. Therefore, the audio text is of low importance to the text feature vector corresponding to the video sample data, meaning that the second audio information of the audio text can be given a lower weight.

[0115] By assigning different weights to the first semantic information, the second semantic information, and the third semantic information, different degrees of emphasis can be placed on descriptive text, speech text, and image text, so as to control the degree of contribution of different texts to the text feature vectors corresponding to the video sample data.

[0116] For example, the first semantic information, the second semantic information, and the third semantic information can be merged into the overall semantic information by adding attention weights.

[0117] It should be noted that this disclosure does not impose any special limitations on the specific values of the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information, or on the specific method of merging the first semantic information, the second semantic information, and the third semantic information into the overall semantic information based on the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information.

[0118] Step S320: Determine the text feature vector corresponding to the video sample data based on the overall semantic information and the feature vectors of the descriptive text, the speech text, and the image text.

[0119] In one exemplary embodiment of this disclosure, after obtaining the overall semantic information through the above steps, the text feature vector corresponding to the video sample data can be determined based on the overall semantic information and the feature vectors of the descriptive text, the speech text, and the image text. Specifically, the overall semantic information can be used as the semantic information of the video text (descriptive text, speech text, and image text), and then the feature vectors of the descriptive text, the speech text, and the image text can be concatenated with the overall semantic information to obtain the text feature vector corresponding to the video sample data.

[0120] Furthermore, different weights can be assigned to the feature vectors of descriptive text, speech text, and image text respectively. Based on the weights corresponding to the feature vectors of descriptive text, speech text, and image text, the feature vectors of descriptive text, speech text, and image text are merged to obtain the overall feature vector. Then, the overall semantic information is merged with the overall feature vector to obtain the text feature vector corresponding to the video sample data.

[0121] It should be noted that this disclosure does not impose any special limitations on the specific method of determining the text feature vector corresponding to the video sample data based on the overall semantic information and the feature vectors of the descriptive text, the feature vectors of the speech text, and the feature vectors of the image text.

[0122] Through the aforementioned steps S310-S320, the first semantic information, the second semantic information, and the third semantic information can be merged into overall semantic information based on the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information. Then, based on the overall semantic information and the feature vectors of the descriptive text, the speech text, and the image text, the text feature vector corresponding to the video sample data is determined. Through the embodiments of this disclosure, different weights can be assigned to different types of text, with different emphases for different types of text, so that the trained item classification model has higher accuracy.

[0123] In one example embodiment of this disclosure, a first loss function for the model to be trained can be determined based on the item category label and the predicted item category, and the neural network parameters of the model to be trained can be updated according to the first loss function. (Refer to...) Figure 4 As shown, updating the neural network parameters of the model to be trained according to the first loss function of the model to be trained may include the following steps S410 to S420:

[0124] Step S410: Determine the first loss function of the model to be trained based on the item category label and the predicted item category;

[0125] Step S420: Update the neural network parameters of the model to be trained according to the first loss function of the model to be trained.

[0126] In one example embodiment of this disclosure, after obtaining the predicted item category through the above steps, a first loss function for the model to be trained can be determined based on the item category label and the predicted item category. Specifically, the predicted item category obtained by the model to be trained can be compared with the item category label in the video sample data, the prediction difference between the predicted item category obtained by the model to be trained and the item category label can be calculated, and the first loss function for the model to be trained can be determined based on the prediction difference.

[0127] In one example embodiment of this disclosure, after obtaining the first loss function of the model to be trained through the above steps, the model to be trained can be trained using the first loss function. For example, the training gradient can be calculated using the first loss function, and the neural network parameters of the model to be trained can be updated using the training gradient to obtain the item classification model.

[0128] It should be noted that this disclosure does not impose any special restrictions on the specific form of the first loss function or the specific method of determining the first loss function of the model to be trained based on the item category label and the predicted item category.

[0129] Through the above steps S410 to S420, the first loss function of the model to be trained can be determined based on the item category label and the predicted item category, and the neural network parameters of the model to be trained can be updated based on the first loss function.

[0130] In one example embodiment of this disclosure, the first loss function is an asymmetric loss function. The video sample data includes positive and negative samples. In the asymmetric loss function, the exponential coefficient for negative samples is greater than that for positive samples. Negative samples are removed when the predicted probability of the predicted item category corresponding to a negative sample is less than a preset threshold. Specifically, the first loss function may include a loss function based on focal loss, assigning different exponential coefficients to positive and negative samples in focal loss, i.e., the exponential coefficient for negative samples is greater than that for positive samples. Furthermore, negative samples with lower discrimination difficulty are removed, so that when updating the neural network parameters of the model to be trained according to the first loss function, more attention is paid to difficult negative samples, thereby improving the classification accuracy of the item classification model. The expression of the first loss function in this embodiment is as follows, where L... classify Let y be the first loss function, γ+ be the exponential coefficient of positive samples, γ- be the exponential coefficient of negative samples, y be the true value, p be the predicted value, and m be the preset threshold.

[0131] L classify = -y(1-p) γ+ log(p)-(1-y)max(pm,0) γ- log(1-p)

[0132] In one exemplary embodiment of this disclosure, video sample data can be input into a momentum model to obtain momentum feature vectors and fourth semantic information of descriptive text, momentum feature vectors and fifth semantic information of speech text, and momentum feature vectors and sixth semantic information of image text. Based on the momentum feature vectors and fourth semantic information of descriptive text, momentum feature vectors and fifth semantic information of speech text, and momentum feature vectors and sixth semantic information of image text, the momentum text feature vector corresponding to the video sample data is determined, and the momentum image feature vector corresponding to the video sample data is obtained. The momentum image feature vector and the momentum text feature vector are fused to obtain a second fused feature vector corresponding to the video sample data. The second fused feature vector of the video sample data is used to determine the second loss function of the model to be trained. The first loss function and the second loss function of the model to be trained are used to determine the overall loss function. The neural network parameters of the model to be trained are updated using the overall loss function to obtain an item classification model. (Refer to...) Figure 5As shown, updating the neural network parameters of the model to be trained using the overall loss function to obtain the item classification model may include the following steps S510 to S560:

[0133] Step S510: Input the video sample data into the momentum model;

[0134] In one example embodiment of this disclosure, video sample data can be input into a momentum model. Specifically, the neural network parameters of the momentum model are updated iteratively based on changes in the neural network parameters during the training process of the model to be trained. The momentum model is built with reference to the model to be trained, and its neural network parameters are updated iteratively as the model to be trained iterates through the training process.

[0135] For example, the neural network parameters of the momentum model have neural network parameters A, B, and C, and the model to be trained has neural network parameters A, B, and C. After the first training, the neural network parameters of the model to be trained are updated to neural network parameters A1, B1, and C1. After the second training, the neural network parameters of the model to be trained are updated to neural network parameters A2, B2, and C2. After the third training, the neural network parameters of the model to be trained are updated to neural network parameters A3, B3, and C3. At this point, the neural network parameters of the momentum model can be set to the average of the neural network parameters of the model to be trained during multiple training processes. For example, the neural network parameters of the momentum model can be set as follows: neural network parameter A = (neural network parameter A1 + neural network parameter A2 + neural network parameter A3) / 3, neural network parameter B = (neural network parameter B1 + neural network parameter B2 + neural network parameter B3) / 3, neural network parameter C = (neural network parameter C1 + neural network parameter C2 + neural network parameter C3) / 3, and so on, continuously updating the neural network parameters of the momentum model according to the iterative process of the model to be trained.

[0136] It should be noted that this disclosure does not impose any special limitations on the specific method of updating the neural network parameters of the momentum model by sliding based on the neural network parameters of the model to be trained.

[0137] Step S520: Obtain the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text; obtain the momentum feature vector of the speech text and the fifth semantic information of the speech text; obtain the momentum feature vector of the image text and the sixth semantic information of the image text.

[0138] In one exemplary embodiment of this disclosure, momentum feature vectors and fourth semantic information of descriptive text can be obtained through a momentum model; momentum feature vectors and fifth semantic information of speech text can be obtained; and momentum feature vectors and sixth semantic information of image text can be obtained. Specifically, the text encoder in the momentum model can convert the text data (descriptive text, speech text, and image text) corresponding to the video sample data into momentum text feature vectors.

[0139] Specifically, the fourth semantic information of the descriptive text can be used to indicate the semantics of the descriptive text of the video data corresponding to the video sample data, and this semantics integrates the meaning of each word / phrase in the descriptive text; the fifth semantic information of the speech text can be used to indicate the semantics of the speech text of the video data corresponding to the video sample data, and this semantics integrates the meaning of each word / phrase in the speech text; the sixth semantic information of the image text can be used to indicate the semantics of the image text of the video data corresponding to the video sample data, and this semantics integrates the meaning of each word / phrase in the image text.

[0140] For example, the fourth text encoder in the momentum model can convert the descriptive text in the video sample data into a momentum text feature vector of the descriptive text and extract the fourth semantic information of the descriptive text. The fifth text encoder in the momentum model can convert the speech text in the video sample data into a momentum text feature vector of the speech text and extract the fifth semantic information of the speech text. Finally, the sixth text encoder in the momentum model can convert the descriptive text in the video sample data into a momentum text feature vector of the descriptive text and extract the sixth semantic information of the image text. For example, the momentum feature vector of the descriptive text can be in the form of a token embedding.

[0141] It should be noted that this disclosure does not impose any special limitations on the specific methods for obtaining the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text, obtaining the momentum feature vector of the speech text and the fifth semantic information of the speech text, and obtaining the momentum feature vector of the image text and the sixth semantic information of the image text.

[0142] Step S530: Based on the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text, obtain the momentum feature vector of the speech text and the fifth semantic information of the speech text, obtain the momentum feature vector of the image text and the sixth semantic information of the image text, and determine the momentum text feature vector corresponding to the video sample data.

[0143] In one exemplary embodiment of this disclosure, after obtaining the momentum feature vector and fourth semantic information of the descriptive text through the above steps, obtaining the momentum feature vector and fifth semantic information of the speech text, and obtaining the momentum feature vector and sixth semantic information of the image text, the momentum text feature vector corresponding to the video sample data can be determined based on the momentum feature vector and fourth semantic information of the descriptive text, the obtained momentum feature vector and fifth semantic information of the speech text, and the obtained momentum feature vector and sixth semantic information of the image text. Specifically, the text feature vector corresponding to the video sample data can be used to indicate the text information indicated by the video data corresponding to the video sample data, which includes information about the descriptive text, speech text, and image text of the video data.

[0144] Specifically, the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text, the momentum feature vector of the speech text and the fifth semantic information of the speech text, and the momentum feature vector of the image text and the sixth semantic information of the image text can be directly concatenated to obtain the momentum text feature vector corresponding to the video sample data.

[0145] Alternatively, different fusion weights can be assigned to the feature vectors of different texts. Based on the fusion weights corresponding to the descriptive text, the speech text, and the image text, the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text are directly fused to obtain the momentum feature vector of the speech text and the fifth semantic information of the speech text, and the momentum feature vector of the image text and the sixth semantic information of the image text, to obtain the momentum text feature vector corresponding to the video sample data.

[0146] Alternatively, different fusion weights can be assigned to the semantic information of different texts. Based on the fusion weights corresponding to the fourth semantic information, the fifth semantic information, and the sixth semantic information, the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text are directly fused to obtain the momentum feature vector of the speech text and the fifth semantic information of the speech text, as well as the momentum feature vector of the image text and the sixth semantic information of the image text, to obtain the momentum text feature vector corresponding to the video sample data.

[0147] It should be noted that this disclosure does not impose any special limitations on the specific methods for obtaining the momentum feature vector of speech text and the fifth semantic information of speech text based on the momentum feature vector of descriptive text and the fourth semantic information of descriptive text, and for obtaining the momentum feature vector of image text and the sixth semantic information of image text to determine the momentum text feature vector corresponding to video sample data.

[0148] Step S540: Obtain the momentum image feature vector corresponding to the video sample data, and fuse the momentum image feature vector with the momentum text feature vector to obtain the second fused feature vector corresponding to the video sample data.

[0149] In one example embodiment of this disclosure, the momentum image feature vector corresponding to the video sample data can be obtained. Specifically, the momentum image feature vector corresponding to the video sample data refers to the image feature vector of the video data corresponding to the video sample data. The momentum image feature vector corresponding to the video sample data can be used to indicate the image information of the video data corresponding to the video sample data.

[0150] For example, the image encoder in the model to be trained can be used to convert the video data corresponding to the video sample data into the momentum image feature vector corresponding to the video sample data.

[0151] Furthermore, after obtaining the video sample data, the video data corresponding to the video sample data can be first divided into multiple frames, and then the momentum image feature vector corresponding to the video sample data can be obtained based on the multiple frames.

[0152] It should be noted that this disclosure does not impose any special limitations on the specific method for obtaining the momentum image feature vector corresponding to the video sample data or the specific form of the momentum image feature vector corresponding to the video sample data.

[0153] In one exemplary embodiment of this disclosure, after obtaining the momentum text feature vector and the momentum image feature vector corresponding to the video sample data through the above steps, the momentum image feature vector and the momentum text feature vector can be fused to obtain a second fused feature vector corresponding to the video sample data. Specifically, the momentum text feature vector and the momentum image feature vector corresponding to the video sample data can be directly fused to obtain the second fused feature vector corresponding to the video sample data. For example, the momentum text feature vector and the momentum image feature vector corresponding to the video sample data can be fused by dot product.

[0154] For example, the momentum image feature vector and the momentum text feature vector can be fused using the decoder structure in the transformer model to obtain the second fused feature vector corresponding to the video sample data. For instance, the momentum text feature vector can be used as the query, and the momentum image feature vector can be used as the key and value to obtain the second fused feature vector corresponding to the video sample data.

[0155] Alternatively, the momentum text feature vector and the momentum image feature vector corresponding to the video sample data can be fused according to the fusion weights corresponding to the momentum text feature vector and the fusion weights corresponding to the momentum image feature vector to obtain the second fused feature vector corresponding to the video sample data.

[0156] It should be noted that this disclosure does not impose any special limitations on the specific method of fusing momentum image feature vectors and momentum text feature vectors to obtain the second fused feature vector corresponding to the video sample data.

[0157] Step S550: Determine the second loss function of the model to be trained based on the second fused feature vector of the video sample data;

[0158] In one exemplary embodiment of this disclosure, after obtaining predicted item categories through the model to be trained and obtaining pseudo-labels for item categories through the momentum model, a second loss function for the model to be trained can be determined based on the pseudo-labels and predicted item categories. Specifically, the second loss function refers to the loss function for the model to be trained obtained from training the momentum model.

[0159] In one example embodiment of this disclosure, after obtaining the second loss function of the model to be trained through the above steps, the model to be trained can be trained using the second loss function of the model to be trained.

[0160] It should be noted that this disclosure does not impose any special limitations on the specific form of the second loss function or the specific method of determining the second loss function of the model to be trained based on the second fused feature vector of the video sample data.

[0161] Step S560: Determine the overall loss function based on the first loss function and the second loss function of the model to be trained, and update the neural network parameters of the model to be trained using the overall loss function to obtain the item classification model.

[0162] In one exemplary embodiment of this disclosure, after obtaining the first loss function and the second loss function through the above steps, the overall loss function can be determined using the first loss function and the second loss function of the model to be trained. The neural network parameters of the model to be trained are then updated using the overall loss function to obtain the item classification model. Specifically, the overall loss function can be obtained by summing the first loss function and the second loss function of the model to be trained.

[0163] It should be noted that this disclosure does not impose any special restrictions on the specific form of the overall loss function or the specific method of determining the overall loss function based on the first loss function and the second loss function of the model to be trained.

[0164] In one example embodiment of this disclosure, after obtaining the overall loss function of the model to be trained through the above steps, the model to be trained can be trained using the overall loss function. For example, the training gradient can be calculated using the overall loss function, and the neural network parameters of the model to be trained can be updated using the training gradient to obtain the item classification model.

[0165] It should be noted that this disclosure does not impose any special restrictions on the specific method of updating the neural network parameters of the model to be trained through the overall loss function.

[0166] Through the above steps S510-S560, video sample data can be input into the momentum model to obtain the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text, the momentum feature vector of the speech text and the fifth semantic information of the speech text, and the momentum feature vector of the image text and the sixth semantic information of the image text. Based on the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text, the momentum feature vector of the speech text and the fifth semantic information of the speech text, and the momentum feature vector of the image text and the sixth semantic information of the image text, the momentum text feature vector corresponding to the video sample data is determined, and the momentum image feature vector corresponding to the video sample data is obtained. The momentum image feature vector and the momentum text feature vector are fused to obtain the second fused feature vector corresponding to the video sample data. The second loss function of the model to be trained is determined based on the second fused feature vector of the video sample data. The overall loss function is determined based on the first loss function and the second loss function of the model to be trained. The neural network parameters of the model to be trained are updated through the overall loss function to obtain the item classification model.

[0167] In one example embodiment of this disclosure, item category pseudo-labels can be obtained based on the second fused feature vector of video sample data, and a second loss function for the model to be trained can be determined based on the item category pseudo-labels and the predicted item categories. (Refer to...) Figure 6 As shown, the second loss function of the model to be trained is determined based on the pseudo-label of the item category and the predicted item category, which may include the following steps S610 to S620:

[0168] Step S610: Obtain the pseudo-label of the item category based on the second fused feature vector of the video sample data;

[0169] In one exemplary embodiment of this disclosure, after obtaining the second fused feature vector of the video sample data through the above steps, a pseudo-label for the item category can be obtained based on the second fused feature vector of the video sample data. Specifically, the momentum model can be used to obtain a predicted item category based on the second fused feature vector corresponding to the video sample data. This predicted item category can be used to indicate the category of the item in the video data corresponding to the video sample data. Since the neural network parameters of the momentum model are updated by sliding according to the training process of the model to be trained, the momentum model has the training knowledge of the model to be trained. Therefore, the predicted item category obtained by the momentum model can be used as the pseudo-label for the item category, and the model to be trained can be trained using this pseudo-label.

[0170] Specifically, multiple item category pseudo-labels can be obtained based on the second fusion feature vector corresponding to the video sample data. For example, the video data corresponding to the video sample data may include multiple items, each corresponding to a different item category. In this case, multiple item category pseudo-labels can be obtained based on the second fusion feature vector corresponding to the video sample data. Alternatively, the items indicated in the video data may correspond to multiple item categories, thus multiple item category pseudo-labels can be obtained based on the second fusion feature vector corresponding to the video sample data.

[0171] In one example embodiment of this disclosure, the momentum model may include multiple hidden layers, which may include convolutional layers, normalization layers, activation layers, etc. The second fusion feature vector corresponding to the video sample data can be sequentially input into the multiple hidden layers of the momentum model to obtain the hidden layer calculation results. The item category pseudo-labels are then obtained from these hidden layer calculation results.

[0172] It should be noted that this disclosure does not impose any special limitations on the specific method of obtaining the pseudo-label of the item category based on the second fusion feature vector of the video sample data.

[0173] Step S620: Determine the second loss function of the model to be trained based on the pseudo-label of the item category and the predicted item category.

[0174] In one example embodiment of this disclosure, after obtaining the pseudo-labels for item categories through the above steps, a second loss function for the model to be trained can be determined based on the pseudo-labels and the predicted item categories. Specifically, the predicted item categories and pseudo-labels obtained through the model to be trained can be compared, the prediction difference between the predicted item categories and pseudo-labels can be calculated, and the second loss function for the model to be trained can be determined based on this prediction difference.

[0175] It should be noted that this disclosure does not impose any special restrictions on the specific form of the second loss function or the specific method of determining the second loss function of the model to be trained based on the pseudo-label of the item category and the predicted item category.

[0176] Through the above steps S610 to S620, the pseudo-label of the item category can be obtained based on the second fusion feature vector of the video sample data. Based on the pseudo-label of the item category and the predicted item category, the second loss function of the model to be trained can be determined.

[0177] In one example embodiment of this disclosure, the second loss function is an asymmetric loss function; wherein, the video sample data includes positive samples and negative samples, and the exponential coefficient for negative samples in the asymmetric loss function is greater than the exponential coefficient for positive samples. When the predicted probability of the predicted item category corresponding to the negative sample is less than the predicted probability of the item category pseudo-label corresponding to the negative sample, the negative sample is removed; when the predicted probability of the predicted item category corresponding to the positive sample is greater than the predicted probability of the item category pseudo-label corresponding to the positive sample, the positive sample is removed. Specifically, the second loss function may include a loss function based on focal loss, assigning different exponential coefficients to positive and negative samples in focal loss, i.e., the exponential coefficient of negative samples is greater than that of positive samples. The predicted probability of the item category corresponding to a negative sample is compared with the predicted probability of the item category pseudo-label corresponding to the negative sample. If the predicted probability of the item category corresponding to a negative sample is less than the predicted probability of the item category pseudo-label corresponding to the negative sample, it indicates that the accuracy of the predicted item category obtained by the model under training is high, and therefore the negative sample can be removed from the second loss function. If the predicted probability of the item category corresponding to a positive sample is greater than the predicted probability of the item category pseudo-label corresponding to the positive sample, it indicates that the accuracy of the predicted item category obtained by the model under training is high, and therefore the positive sample can be removed from the second loss function, thereby improving the classification accuracy of the item classification model. The expression of the second loss function in this embodiment is as follows, where L... classify_distill Here, γ+ is the exponential coefficient for positive samples, γ- is the exponential coefficient for negative samples, y is the true value, p is the predicted value of the model to be trained, and p′ is the predicted value of the momentum model. `clamp()` can restrict randomly changing values to a given interval [min, max].

[0178] L classify_distill =-y(p′-clamp(p,max=p′)) γ+ log(p)-(1-y)(clamp(p,min=p′)-p′)-log(1-p)

[0179] The embodiments of this disclosure can improve the robustness and generalization ability of the item classification model.

[0180] In one exemplary embodiment of this disclosure, the model to be trained can be trained using the first loss function and the second loss function obtained through the above steps. Specifically, an overall loss function can be obtained based on the first loss function and the second loss function, and the neural network parameters of the model to be trained can be updated using the overall loss function to obtain an item classification model.

[0181] In one exemplary embodiment of this disclosure, such as Figure 7 As shown, the descriptive text, speech text, and image text corresponding to the video sample data are obtained. These are then input into their respective text encoders to obtain the feature vector token embedding1 and the first semantic information of the descriptive text [cls1], the feature vector token embedding2 and the second semantic information of the speech text [cls2], and the feature vector token embedding3 and the third semantic information of the image text [cls3]. The obtained feature vectors and semantic information are then fused using attention. Based on the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information, the first, second, and third semantic information are merged into the overall semantic information [cls4]. The feature vector token embedding1, the speech text feature vector token embedding2, and the image text feature vector token embedding3 are then concatenated together using a [seq] method to obtain the text feature vector token embedding4 corresponding to the video sample data. Finally, the multi-frame image data corresponding to the video sample data is converted into the image feature vector [cls]patch corresponding to the video sample data using an image encoder. The embedding is then used to fuse the text feature vectors corresponding to the video sample data and the image feature vectors corresponding to the video sample data into the first fused feature vector [cls5]token embedding5 corresponding to the video sample data through a multimodal fusion encoder.

[0182] In one example embodiment of this disclosure, video data can be acquired and input into an item classification model to obtain item categories. (Refer to...) Figure 8 As shown, inputting video data into an item classification model to obtain item categories may include the following steps S810 to S820:

[0183] Step S810: Acquire video data;

[0184] Step S820: Input the video data into the item classification model to obtain the item category.

[0185] In one example embodiment of this disclosure, video data can be acquired. Specifically, this video data can be used to indicate one or more items. The video data can be input into an item classification model trained through the above steps, which can output the item types corresponding to one or more items.

[0186] For example, you can output the item type for each of multiple items separately, or you can output multiple item types for the same item.

[0187] It should be noted that this disclosure does not impose any special restrictions on the quantity of item types.

[0188] In one example embodiment of this disclosure, after obtaining the item type through the above steps, business functions can be implemented based on the item type. For example, multiple item links associated with the item type can be obtained and pushed to the user's client so that the user can click on the item links to learn about the relevant items.

[0189] It should be noted that this disclosure does not impose any special restrictions on the specific types of business functions.

[0190] Through the above steps S810 to S820, video data can be obtained, and the video data can be input into the item classification model to obtain the item category.

[0191] According to the training method of the item classification model provided in this exemplary embodiment, video sample data can be acquired, the video sample data can be input into the model to be trained, the feature vector of the descriptive text and the first semantic information of the descriptive text can be acquired, the feature vector of the speech text and the second semantic information of the speech text can be acquired, the feature vector of the image text and the third semantic information of the image text can be acquired, the text feature vector corresponding to the video sample data can be determined based on the feature vector of the descriptive text and the first semantic information of the descriptive text, the feature vector of the speech text and the second semantic information of the speech text, the feature vector of the image text and the third semantic information of the image text, the text feature vector corresponding to the video sample data can be determined, the image feature vector corresponding to the video sample data can be acquired based on the image data in the video sample data, the image feature vector and the text feature vector can be fused to obtain the first fused feature vector corresponding to the video sample data, the predicted item category can be obtained based on the first fused feature vector corresponding to the video sample data, and the neural network parameters of the model to be trained can be updated based on the item category label and the predicted item category to obtain the item classification model.

[0192] The embodiments disclosed herein can integrate semantic information corresponding to different text feature vectors, taking into account the impact of the semantic information corresponding to each text on the classification results, thereby improving the accuracy of the classification results output by the model and ensuring more accurate product recommendations.

[0193] It should be noted that the above figures are merely illustrative of the processes included in the method according to exemplary embodiments of this disclosure, and are not intended to be limiting. It is readily understood that the processes shown in the above figures do not indicate or limit the temporal order of these processes. Furthermore, it is readily understood that these processes may be executed synchronously or asynchronously, for example, in multiple modules.

[0194] Furthermore, in an exemplary embodiment of this disclosure, an item classification model training apparatus is also provided. (Refer to...) Figure 9 As shown, an item classification model training device 900 includes: a sample data acquisition unit 910, a semantic information acquisition unit 920, a text feature acquisition unit 930, a feature fusion unit 940, and a parameter update unit 950.

[0195] The system includes the following components: a sample data acquisition unit, configured to acquire video sample data, including text data, image data, and item category labels; a semantic information acquisition unit, configured to input the video sample data into the model to be trained, acquire the feature vector and first semantic information of the descriptive text, the feature vector and second semantic information of the speech text, and the feature vector and third semantic information of the image text; a text feature acquisition unit, configured to determine the text feature vector corresponding to the video sample data based on the feature vector and first semantic information of the descriptive text, the feature vector and second semantic information of the speech text, and the feature vector and third semantic information of the image text; a feature fusion unit, configured to acquire the image feature vector corresponding to the video sample data based on the image data in the video sample data, and fuse the image feature vector with the text feature vector to obtain the first fused feature vector corresponding to the video sample data; and a parameter update unit, configured to obtain the predicted item category based on the first fused feature vector corresponding to the video sample data, and update the neural network parameters of the model to be trained based on the item category labels and the predicted item category to obtain the item classification model.

[0196] In an exemplary embodiment of this disclosure, based on the aforementioned scheme, the device further includes: a semantic merging unit configured to merge the first semantic information, the second semantic information, and the third semantic information into overall semantic information based on the feature vector of the descriptive text and the first semantic information of the descriptive text; and a text feature vector determination unit configured to determine the text feature vector corresponding to the video sample data based on the feature vector of the descriptive text and the first semantic information and the first semantic information, the second semantic information, and the third semantic information, the third semantic information, the first semantic information, the second semantic information, and the third ... fourth semantic information, the third semantic information, the third semantic information, the fourth semantic information, the fifth semantic information, the third semantic information, the third semantic information, the fourth semantic information, the fifth semantic information, the sixth semantic information, the third semantic information, the third semantic information, the fourth semantic information, the fifth semantic information, the sixth semantic information, the third semantic information, the fifth semantic information, the sixth semantic information, the seventh semantic information, the fifth semantic information, the sixth semantic information, the seventh semantic information, the fifth semantic information, the sixth semantic information, the seventh semantic information, the fifth semantic information, the sixth semantic information, the seventh semantic information, the fifth semantic information, the sixth semantic information, the seventh semantic information, the fifth semantic information,

[0197] In an exemplary embodiment of this disclosure, based on the foregoing scheme, the apparatus further includes: a first loss function determination unit configured to determine a first loss function of the model to be trained based on the item category label and the predicted item category; and a first loss function training unit configured to update the neural network parameters of the model to be trained based on the first loss function of the model to be trained.

[0198] In an exemplary embodiment of this disclosure, based on the aforementioned scheme, the first loss function is an asymmetric loss function; wherein, the video sample data includes positive samples and negative samples, the exponential coefficient for negative samples in the asymmetric loss function is greater than the exponential coefficient for positive samples, and negative samples are removed when the predicted probability of the predicted item category corresponding to the negative sample is less than a preset threshold.

[0199] In an exemplary embodiment of this disclosure, based on the foregoing scheme, the apparatus further includes: a momentum model input unit configured to input video sample data into a momentum model; wherein the neural network parameters of the momentum model are updated according to changes in the neural network parameters during the training process of the model under training; a second semantic information acquisition unit configured to acquire momentum feature vectors of descriptive text and fourth semantic information of descriptive text, acquire momentum feature vectors of speech text and fifth semantic information of speech text, and acquire momentum feature vectors of image text and sixth semantic information of image text; and a second text feature vector determination unit configured to determine the text feature vectors based on the momentum feature vectors of descriptive text and the fourth semantic information of descriptive text, and the fourth semantic information of speech text. The momentum feature vector, along with the fifth semantic information of the speech text, the momentum feature vector of the image text, and the sixth semantic information of the image text, are used to determine the momentum text feature vector corresponding to the video sample data. A second fusion feature vector acquisition unit is configured to acquire the momentum image feature vector corresponding to the video sample data and fuse it with the momentum text feature vector to obtain the second fusion feature vector corresponding to the video sample data. An overall loss function acquisition unit is configured to determine the second loss function of the model to be trained based on the second fusion feature vector of the video sample data. A training unit is configured to determine the overall loss function based on the first and second loss functions of the model to be trained, and update the neural network parameters of the model to be trained using the overall loss function to obtain the item classification model.

[0200] In one exemplary embodiment of this disclosure, based on the aforementioned scheme, a second loss function of the model to be trained is determined according to the second fusion feature vector of the video sample data. The apparatus further includes: an item category pseudo-label acquisition unit, configured to obtain item category pseudo-labels according to the second fusion feature vector of the video sample data; and a second loss function determination unit, configured to determine the second loss function of the model to be trained based on the item category pseudo-labels and the predicted item categories.

[0201] In an exemplary embodiment of this disclosure, based on the aforementioned scheme, the second loss function is an asymmetric loss function; wherein, the video sample data includes positive samples and negative samples, and the exponential coefficient for negative samples in the asymmetric loss function is greater than the exponential coefficient for positive samples. When the predicted probability of the predicted item category corresponding to the negative sample is less than the predicted probability of the item category pseudo-label corresponding to the negative sample, the negative sample is removed; when the predicted probability of the predicted item category corresponding to the positive sample is greater than the predicted probability of the item category pseudo-label corresponding to the positive sample, the positive sample is removed.

[0202] Since the functional modules of the item classification model training device in the example embodiments of this disclosure correspond to the steps of the item classification model training method in the example embodiments described above, for details not disclosed in the device embodiments of this disclosure, please refer to the item classification model training method embodiments described above.

[0203] Furthermore, in an exemplary embodiment of this disclosure, an item classification model training apparatus is also provided. (Refer to...) Figure 10 As shown, an item classification device 10000 includes: a video acquisition unit 1010 and an item category acquisition unit 1020.

[0204] The video acquisition unit is configured to acquire video data, including text data and image data, with the text data including descriptive text, audio text, and image text. The item category acquisition unit is configured to input the video data into an item classification model to obtain item categories, wherein the item classification model is trained using an item classification model as described in any of the above embodiments.

[0205] Since the functional modules of the item sorting device in the example embodiments of this disclosure correspond to the steps of the example embodiments of the item sorting method described above, for details not disclosed in the device embodiments of this disclosure, please refer to the embodiments of the item sorting method described above.

[0206] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to embodiments of this disclosure, the feature vectors and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the feature vectors and functions of one module or unit described above can be further divided and embodied by multiple modules or units.

[0207] Furthermore, in an exemplary embodiment of this disclosure, an electronic device capable of implementing the above-described item classification model training method is also provided.

[0208] Those skilled in the art will understand that various aspects of this disclosure can be implemented as a system, method, or program product. Therefore, various aspects of this disclosure can be embodied in the following forms: a completely hardware embodiment, a completely software embodiment (including firmware, microcode, etc.), or an embodiment combining hardware and software aspects, collectively referred to herein as a "circuit," "module," or "system."

[0209] The following reference Figure 11 To describe an electronic device 1100 according to such an embodiment of the present disclosure. Figure 11The electronic device 1100 shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments disclosed herein.

[0210] like Figure 11 As shown, the electronic device 1100 is manifested in the form of a general-purpose computing device. The components of the electronic device 1100 may include, but are not limited to: at least one processing unit 1110, at least one storage unit 1120, a bus 1130 connecting different system components (including storage unit 1120 and processing unit 1110), and a display unit 1140.

[0211] The storage unit stores program code, which can be executed by the processing unit 1110 to perform the steps described in the "Exemplary Methods" section of this specification according to various exemplary embodiments of this disclosure. For example, the processing unit 1110 can perform actions such as... Figure 2 Step S210, as shown, involves acquiring video sample data, which includes text data, image data, and item category labels. The text data includes descriptive text, speech text, and image text. Step S220 involves inputting the video sample data into the model to be trained, acquiring the feature vector and first semantic information of the descriptive text, the feature vector and second semantic information of the speech text, and the feature vector and third semantic information of the image text. Step S230 involves determining the text feature vector corresponding to the video sample data based on the feature vector and first semantic information of the descriptive text, the feature vector and second semantic information of the speech text, and the feature vector and third semantic information of the image text. Step S240 involves acquiring the image feature vector corresponding to the video sample data based on the image data in the video sample data, and fusing the image feature vector with the text feature vector to obtain the first fused feature vector corresponding to the video sample data. Step S250 involves obtaining the predicted item category based on the first fused feature vector corresponding to the video sample data, and updating the neural network parameters of the model to be trained based on the item category labels and the predicted item category to obtain the item classification model.

[0212] Alternatively, it can also perform such as Figure 8 As shown in step S810, video data is acquired; in step S820, the video data is input into the item classification model to obtain the item category.

[0213] For example, electronic devices can achieve such Figure 2 and Figure 8 The steps shown.

[0214] Storage unit 1120 may include readable media in the form of volatile storage units, such as random access memory (RAM) 1121 and / or cache memory 1122, and may further include read-only memory (ROM) 1123.

[0215] Storage unit 1120 may also include a program / utility 1124 having a set (at least one) program module 1125, such program module 1125 including but not limited to: operating system, one or more application programs, other program modules and program data, each or some combination of these examples may include an implementation of a network environment.

[0216] Bus 1130 can represent one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local bus using any of the various bus structures.

[0217] Electronic device 1100 can also communicate with one or more external devices 1170 (e.g., keyboard, pointing device, Bluetooth device, etc.), and with one or more devices that enable a user to interact with electronic device 1100, and / or with any device that enables electronic device 1100 to communicate with one or more other computing devices (e.g., router, modem, etc.). This communication can be performed via input / output (I / O) interface 1150. Furthermore, electronic device 1100 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public networks, such as the Internet) via network adapter 1160. As shown, network adapter 1160 communicates with other modules of electronic device 1100 via bus 1130. It should be understood that, although not shown in the figures, other hardware and / or software modules can be used in conjunction with electronic device 1100, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.

[0218] From the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of this disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, including several instructions to cause a computing device (such as a personal computer, server, terminal device, or network device, etc.) to execute the methods according to the embodiments of this disclosure.

[0219] In an exemplary embodiment, a computer-readable storage medium including instructions is also provided, such as a memory including instructions that can be executed by a processor of the device to perform the described method. Optionally, the computer-readable storage medium may be a ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device, etc.

[0220] In an exemplary embodiment, a computer program product is also provided, including a computer program / instructions, which, when executed by a processor, implement the item classification model training or item classification method in the above embodiments.

[0221] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the following claims.

[0222] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.

Claims

1. A method for training an item classification model, characterized in that, The method includes: Acquire video sample data; wherein, the video sample data includes text data, image data, and item category tags, and the text data includes descriptive text, audio text, and image text; The video sample data is input into the model to be trained to obtain the feature vector of the descriptive text and the first semantic information of the descriptive text, the feature vector of the speech text and the second semantic information of the speech text, and the feature vector of the image text and the third semantic information of the image text. Based on the feature vector of the descriptive text and the first semantic information of the descriptive text, the feature vector of the speech text and the second semantic information of the speech text, the feature vector of the image text and the third semantic information of the image text, the text feature vector corresponding to the video sample data is determined; Based on the image data in the video sample data, obtain the image feature vector corresponding to the video sample data, and fuse the image feature vector with the text feature vector to obtain the first fused feature vector corresponding to the video sample data; The predicted item category is obtained based on the first fusion feature vector corresponding to the video sample data. A first loss function is determined for the model to be trained based on the item category label and the predicted item category. The neural network parameters of the model to be trained are updated based on the first loss function to obtain an item classification model, including: converting text data into momentum text feature vectors using a momentum model; updating the neural network parameters of the momentum model based on changes in the neural network parameters during the training process of the model to be trained; fusing the momentum image feature vector and the momentum text feature vector to obtain a second fusion feature vector; determining a second loss function based on the second fusion feature vector; and determining the first loss function and the second loss function based on the second loss function. The loss function determines the overall loss function, which is used to update the neural network parameters of the model to be trained. The second loss function is an asymmetric loss function. The video sample data includes positive and negative samples. In the asymmetric loss function, the exponential coefficient for negative samples is greater than that for positive samples. When the prediction probability of the predicted item category corresponding to a negative sample is less than the prediction probability of the item category pseudo-label corresponding to the negative sample, the negative sample is removed. When the prediction probability of the predicted item category corresponding to a positive sample is greater than the prediction probability of the item category pseudo-label corresponding to the positive sample, the positive sample is removed. The item category pseudo-label is obtained based on the second fused feature vector of the video sample data.

2. The method according to claim 1, characterized in that, Based on the feature vector of the descriptive text and the first semantic information of the descriptive text, the feature vector of the speech text and the second semantic information of the speech text are obtained; the feature vector of the image text and the third semantic information of the image text are obtained; and the text feature vector corresponding to the video sample data is determined, including: Based on the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information, the first semantic information, the second semantic information, and the third semantic information are merged into the overall semantic information. The text feature vector corresponding to the video sample data is determined based on the overall semantic information, the feature vector of the descriptive text, the feature vector of the speech text, and the feature vector of the image text.

3. The method according to claim 1, characterized in that, The first loss function is an asymmetric loss function; wherein, the video sample data includes positive samples and negative samples, and the exponential coefficient of the negative samples in the asymmetric loss function is greater than the exponential coefficient of the positive samples. When the predicted probability of the predicted item category corresponding to the negative sample is less than a preset threshold, the negative sample is removed.

4. The method according to claim 1, characterized in that, The step of updating the neural network parameters of the model to be trained according to the first loss function of the model to be trained includes: The video sample data is input into the momentum model; wherein, the neural network parameters of the momentum model are updated by sliding according to the changes in the neural network parameters during the training process of the model to be trained. Obtain the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text; obtain the momentum feature vector of the speech text and the fifth semantic information of the speech text; obtain the momentum feature vector of the image text and the sixth semantic information of the image text. Based on the momentum feature vector of the descriptive text and the fourth semantic information of the descriptive text, the momentum feature vector of the speech text and the fifth semantic information of the speech text, the momentum feature vector of the image text and the sixth semantic information of the image text, the momentum text feature vector corresponding to the video sample data is determined. Obtain the momentum image feature vector corresponding to the video sample data, and fuse the momentum image feature vector with the momentum text feature vector to obtain the second fused feature vector corresponding to the video sample data; The second loss function of the model to be trained is determined based on the second fused feature vector of the video sample data. The overall loss function is determined based on the first loss function and the second loss function of the model to be trained. The neural network parameters of the model to be trained are then updated using the overall loss function to obtain the item classification model.

5. The method according to claim 4, characterized in that, Determining the second loss function of the model to be trained based on the second fused feature vector of the video sample data includes: The pseudo-label of the item category is obtained based on the second fusion feature vector of the video sample data; The second loss function of the model to be trained is determined based on the pseudo-label of the item category and the predicted item category.

6. A method for classifying items, characterized in that, The method includes: Acquire video data; wherein the video data includes text data and image data, and the text data includes descriptive text, audio text, and image text; The video data is input into an item classification model to obtain item categories; wherein the item classification model is obtained by the item classification model training method as described in any one of claims 1-5.

7. A training device for an item classification model, characterized in that, include: The sample data acquisition unit is configured to acquire video sample data; wherein the video sample data includes text data, image data, and item category tags, and the text data includes descriptive text, audio text, and image text; The semantic information acquisition unit is configured to input the video sample data into the model to be trained, acquire the feature vector of the descriptive text and the first semantic information of the descriptive text, acquire the feature vector of the speech text and the second semantic information of the speech text, and acquire the feature vector of the image text and the third semantic information of the image text. The text feature acquisition unit is configured to determine the text feature vector corresponding to the video sample data based on the feature vector of the descriptive text and the first semantic information of the descriptive text, the feature vector of the speech text and the second semantic information of the speech text, the feature vector of the image text and the third semantic information of the image text. The feature fusion unit is configured to perform the following operations: obtain the image feature vector corresponding to the video sample data based on the image data in the video sample data, and fuse the image feature vector with the text feature vector to obtain the first fused feature vector corresponding to the video sample data; The parameter update unit is configured to perform the following steps: obtaining a predicted item category based on a first fused feature vector corresponding to the video sample data; determining a first loss function for the model to be trained based on the item category label and the predicted item category; updating the neural network parameters of the model to be trained based on the first loss function to obtain an item classification model, including: converting text data into a momentum text feature vector using a momentum model; sliding and updating the neural network parameters of the momentum model based on changes in the neural network parameters during the training process of the model to be trained; fusing the momentum image feature vector and the momentum text feature vector to obtain a second fused feature vector; determining a second loss function based on the second fused feature vector; and updating the neural network parameters of the model to be trained based on the first loss function. The overall loss function is determined by the number and the second loss function, and the neural network parameters of the model to be trained are updated by the overall loss function; the second loss function is an asymmetric loss function; the video sample data includes positive samples and negative samples, and the exponential coefficient for negative samples in the asymmetric loss function is greater than the exponential coefficient for positive samples. When the prediction probability of the predicted item category corresponding to the negative sample is less than the prediction probability of the item category pseudo-label corresponding to the negative sample, the negative sample is removed; when the prediction probability of the predicted item category corresponding to the positive sample is greater than the prediction probability of the item category pseudo-label corresponding to the positive sample, the positive sample is removed; wherein, the item category pseudo-label is obtained based on the second fusion feature vector of the video sample data.

8. A sorting device for items, characterized in that, include: The video acquisition unit is configured to acquire video data; wherein the video data includes text data and image data, and the text data includes descriptive text, audio text, and image text. The item category acquisition unit is configured to input the video data into an item classification model to obtain the item category; wherein the item classification model is obtained by the item classification model training method as described in any one of claims 1-5.

9. An electronic device, characterized in that, include: processor; Memory used to store the processor's executable instructions; The processor is configured to execute the executable instructions to implement the item classification model training method as described in any one of claims 1 to 5 or the item classification method as described in claim 6.

10. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the item classification model training method as claimed in any one of claims 1 to 5 or the item classification method as claimed in claim 6.