Model pre-training method, model training method, data processing method and device thereof

By extracting slot text from a multimodal pre-trained model and generating question and answer texts, the problem of low training efficiency in multimodal pre-trained models is solved, improving the training efficiency and quality of the model and adapting it to the needs of downstream tasks.

CN115982330BActive Publication Date: 2026-06-23ALIBABA (CHINA) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ALIBABA (CHINA) CO LTD
Filing Date
2022-12-28
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing multimodal pre-trained models have low training efficiency, especially in text mask prediction, image mask restoration, and image-text matching, where random occlusion leads to low training efficiency and limited improvement in downstream tasks.

Method used

By extracting slot text from the training data, question text and answer text are generated. The image and question text are used as input to the multimodal pre-trained model, and the answer text is used as the label to train the multimodal pre-trained model, thereby improving training efficiency.

Benefits of technology

This enables multimodal pre-trained models to focus on learning slot information in training data, improving the training efficiency and quality of multimodal pre-trained models and adapting to the needs of downstream tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115982330B_ABST
    Figure CN115982330B_ABST
Patent Text Reader

Abstract

The application provides a model pre-training method, a model training method, a data processing method and a device thereof. The model pre-training method comprises: obtaining training data corresponding to a to-be-trained multi-modal pre-training model, the training data comprising an image and text used for describing the image; extracting slot value text meeting a training requirement from the text; generating question text meeting the training requirement and answer text corresponding to the question text according to the slot value text; determining the image and the question text as inputs of the multi-modal pre-training model, and determining the answer text as a label of the multi-modal pre-training model; training the multi-modal pre-training model to obtain a trained multi-modal pre-training model, which can determine more accurate training samples for the multi-modal pre-training model, thereby improving the training efficiency of the multi-modal pre-training model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to model pre-training methods, model training methods, data processing methods, and apparatus thereof. Background Technology

[0002] Model pre-training is a pre-training process performed before the formal training of the model. Model pre-training can improve the performance of the model during formal training. Among them, multimodal pre-trained models have achieved good performance in many classic multimodal tasks, such as visual question answering, image caption generation, and image-text retrieval.

[0003] Currently, multimodal pre-trained models are typically trained by masking and predicting the text portion of the training samples, masking and restoring the image portion of the training samples, or performing image-text matching on the explicit multimodal interactions of the training samples. This approach suffers from low training efficiency. Summary of the Invention

[0004] This application provides various aspects of model pre-training methods, model training methods, data processing methods, and apparatus thereof to improve the training efficiency of multimodal pre-trained models.

[0005] The first aspect of this application provides a model training method, comprising: acquiring training data corresponding to a multimodal pre-trained model to be trained, the training data including images and text used to describe the images; extracting slot text that meets training requirements from the text; generating question text that meets training requirements and answer text corresponding to the question text based on the slot text; determining that the images and question text are the inputs of the multimodal pre-trained model, and the answer text is the label of the multimodal pre-trained model; training the multimodal pre-trained model to obtain a trained multimodal pre-trained model.

[0006] The second aspect of this application provides a model training method, including: acquiring training samples; using the training samples to train a multimodal pre-trained model to obtain a downstream task model, wherein the downstream task model includes at least one of a visual question answering model, an image title generation model, or an image and text retrieval model, and the multimodal pre-trained model is trained by the model pre-training method of the first aspect.

[0007] The third aspect of this application provides a data processing method, which involves acquiring data to be processed, including images and / or text; inputting the data to be processed into a downstream task model for processing, and obtaining an output result. If the data to be processed includes images and text, the output result is a response text for the images and text; if the data to be processed includes images, the output result is the title information of the images; if the data to be processed includes text, the output result is the image described by the text. The downstream task model is trained using the model training method of the second aspect.

[0008] A fourth aspect of this application provides a model pre-training apparatus, comprising:

[0009] The acquisition module is used to acquire the training data corresponding to the multimodal pre-trained model to be trained. The training data includes images and text used to describe the images.

[0010] The extraction module is used to extract slot text that meets the training requirements from the text.

[0011] The generation module is used to generate question text and corresponding answer text that meet the training requirements based on the slot value text;

[0012] The training module is used to determine that the image and question text are the inputs to the multimodal pre-trained model, and the answer text is the label of the multimodal pre-trained model. The multimodal pre-trained model is then trained to obtain the trained multimodal pre-trained model.

[0013] A fifth aspect of this application provides a model pre-training system, including: a cloud server and a terminal device, wherein a multimodal pre-trained model is deployed on the cloud server;

[0014] The terminal device is used to acquire training data of the multimodal pre-trained model to be trained and send the training data to the cloud server. The training data includes images and text used to describe the images.

[0015] A cloud server is used to extract slot text that meets the training requirements from the text; based on the slot text, question text that meets the training requirements and the corresponding answer text are generated; the image and question text are determined as the input of the multimodal pre-trained model, and the answer text is the label of the multimodal pre-trained model; the multimodal pre-trained model is trained to obtain the trained multimodal pre-trained model.

[0016] A sixth aspect of this application provides an electronic device, including: a processor, a memory, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the model pre-training method of the first aspect, and / or the model training method of the second aspect, and / or the data processing method of the third aspect.

[0017] A seventh aspect of this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, causes the processor to implement the model pre-training method of the first aspect, and / or the model training method of the second aspect, and / or the data processing method of the third aspect.

[0018] This application embodiment is applied to the pre-training scenario of multimodal models. By acquiring the training data corresponding to the multimodal pre-training model to be trained, the training data includes images and text used to describe the images; slot text that meets the training requirements is extracted from the text; based on the slot text, question text that meets the training requirements and answer text corresponding to the question text are generated; the images and question text are determined as the input of the multimodal pre-training model, and the answer text is the label of the multimodal pre-training model; the multimodal pre-training model is trained to obtain the trained multimodal pre-training model. This can determine more accurate training samples for the multimodal pre-training model, thereby improving the training efficiency of the multimodal pre-training model. Attached Figure Description

[0019] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:

[0020] Figure 1 An application scenario diagram provided for an exemplary embodiment of this application;

[0021] Figure 2 A flowchart illustrating the steps of a model pre-training method provided as an exemplary embodiment of this application;

[0022] Figure 3 A schematic diagram of training data provided for an exemplary embodiment of this application;

[0023] Figure 4 A flowchart illustrating the steps of another model pre-training method provided in an exemplary embodiment of this application;

[0024] Figure 5 A schematic diagram illustrating the filling of content into a template to be filled, provided as an exemplary embodiment of this application;

[0025] Figure 6 A flowchart illustrating the steps of a model training method provided in an exemplary embodiment of this application;

[0026] Figure 7 A flowchart illustrating the steps of a data processing method provided in an exemplary embodiment of this application;

[0027] Figure 8 A structural block diagram of a model pre-training device provided for an exemplary embodiment of this application;

[0028] Figure 9 This is a schematic diagram of the structure of an electronic device provided as an exemplary embodiment of this application. Detailed Implementation

[0029] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0030] Multimodal pre-training refers to improving the understanding and inference capabilities of multimodal models for multimodal information by designing pre-training tasks. Multimodal pre-trained models have achieved good performance in many classic multimodal tasks, such as visual question answering, image caption generation, and image-text retrieval. Widely used multimodal pre-training methods include mask prediction for text, mask restoration for images, and image-text matching for explicit multimodal interactions.

[0031] Specifically, the text partial masking prediction task involves using highly randomized text partial masks, randomly masking 15% of the words in the text input, and requiring the multimodal pre-trained model to predict the masked words. In reality, only a small portion of the words in the text are crucial for generating the answer. With the 15% random masking strategy, the multimodal pre-trained model spends most of its training time predicting unimportant information, significantly hindering its training efficiency. The image partial masking reconstruction task involves randomly masking a portion of the input image, requiring the multimodal pre-trained model to reconstruct the masked image region based on the surrounding image and the input text. This task suffers from the same low training efficiency problem as the text task due to random masking, and the input / output format differs significantly from downstream multimodal tasks, resulting in limited actual improvement for downstream tasks. The specific method of multimodal interactive image-text matching involves collecting a large number of image-text pairs and using a simple binary classification method to determine whether there is a correspondence between the image and the text. This allows the multimodal pre-trained model to learn the connection between the image and the text. However, this method has two problems. First, it only considers the correspondence between the text level and the image level, neglecting to train the model to learn the connection between some text in the text and objects in the image. Second, collecting a large number of image-text pairs is costly and difficult to implement.

[0032] To address the aforementioned issues, this application employs a slot text extraction method. Based on the training requirements of downstream tasks, it selectively extracts slot text describing images from the training data, generating a series of question and answer texts. By using images, question texts, and answer texts to train a multimodal pre-training model, it can guide the multimodal pre-training model to focus on learning slot information in the training data, thereby improving the training efficiency of the multimodal pre-training model.

[0033] In this embodiment, the execution device of the model pre-training method is not limited. Optionally, the model pre-training method can be implemented using a cloud computing system. For example, the model pre-training method can be applied to a cloud server to leverage the advantages of cloud resources to run various neural network models; instead of applying it to the cloud, the model training method can also be applied to server-side devices such as conventional servers, cloud servers, or server arrays.

[0034] In addition, refer to Figure 1 This diagram illustrates one application scenario of this application. First, a multimodal pre-trained model is pre-trained on a server to obtain a pre-trained multimodal pre-trained model. Then, this pre-trained model is formally trained to obtain a downstream task model. Finally, this downstream task model is formally deployed on the server. During deployment, data to be processed is input into the downstream task model, and output results are obtained. The server can be a cloud server. The input data to be processed for the downstream task model includes images and / or text, and the output results are images and / or text.

[0035] Specifically, a model pre-training system is provided, comprising: a cloud server and a terminal device; the cloud server deploying a multimodal pre-trained model; the terminal device for acquiring training data of the multimodal pre-trained model to be trained and sending the training data to the cloud server, the training data including images and text describing the images; the cloud server for extracting slot text that meets the training requirements from the text; generating question text that meets the training requirements and corresponding answer text based on the slot text; determining the images and question text as inputs to the multimodal pre-trained model, the answer text as labels for the multimodal pre-trained model, training the multimodal pre-trained model, and obtaining the trained multimodal pre-trained model.

[0036] Figure 1 This is merely one example of an application scenario exemplified by this application. This application can also be applied to other model pre-training scenarios, without limitation.

[0037] The technical solutions provided by the various embodiments of this application are described in detail below with reference to the accompanying drawings.

[0038] Figure 2 A flowchart illustrating the steps of a model pre-training method provided for an exemplary embodiment of this application. Figure 2 The model pre-training method shown includes the following steps:

[0039] S201, Obtain the training data corresponding to the multimodal pre-trained model to be trained.

[0040] The training data includes images and text used to describe the images.

[0041] In this embodiment, the training data may include multiple sets of samples, each set of samples including an image and text describing the image. The text content may include: the names of objects included in the image, object IDs, spatial information between objects, and object attribute information, such as the object's color, quantity, and shape. (See also...) Figure 3 The training data consists of a set of samples, including images and text describing those images. The text includes the names of objects in the images: "piano, piano chair, dining table, dining chair, green plants, TV cabinet, and television." It also includes location information between objects, such as "the piano is in the upper left corner of the image, east of the dining table; the dining chairs surround the dining table; the green plants are above the dining table; the piano chair is south of the piano; the sofa is south of the piano chair; the TV cabinet is in the lower right corner of the image; and the television is above the TV cabinet." Furthermore, it includes attribute information such as color information (e.g., "the piano is black, the piano chair is white, the dining table is black, the dining chair is pink, the green plants have red flowers, the sofa is green, the TV cabinet is white, and the television is gray"), as well as the number, shape, or type of objects. Additionally, the training data includes regions of objects within the images, represented using coordinates. For example, let R(x1, y1, x2, y2) represent the region of an object in the image, with the bottom left corner of the image as the origin, the horizontal axis as the X-axis, and the vertical axis as the Y-axis. x1, y1 are the coordinates of the top left corner of the rectangle containing the object, and x2, y2 are the coordinates of the bottom right corner of the rectangle containing the object. For example, the region of a piano in the image is (100, 500, 400, 900). Furthermore, the region can also be represented by the area of ​​the object in the image, such as the piano in the top left corner of the image, or the television in the bottom right corner of the image.

[0042] S202, extract slot text that meets the training requirements from the text.

[0043] In this embodiment, the training requirements are pre-defined. Slot text can be extracted from the text according to the training requirements. Slot text is a portion of the text, specifically words such as the visual color of an object, object type, and directional adverbs. For example, if a training requirement asks the downstream task model to identify the directional information of each object in an image, then the slot text is the object's name or ID and its directional information. If a training requirement asks the downstream task model to identify the color information of each object in an image, then the slot text is the object's name or ID and its directional information.

[0044] For example, refer to Figure 3 The training requirement is to extract positional and color-related information from the text. Then in Figure 3The extracted slot value text includes: piano, piano chair, dining table, dining chair, green plant, TV cabinet, TV set, direction, color, east and south, above, black, white, pink, red, green, and gray. If the training requirement is to extract information related to quantity from the text, then... Figure 3 The extracted slot text is: piano, piano chair, dining table, dining chair, green plant, TV cabinet, TV set, 1 and 4.

[0045] Among them, various text information extraction methods, including dependency analysis, regular expressions, and sequence labeling models, can be used to extract slot text that meets the training requirements from the text.

[0046] In this embodiment, the required slot text can be extracted according to the training requirements as key information for training the multimodal pre-trained model, avoiding the use of words in the text as key information for training the multimodal pre-trained model, thereby improving the training efficiency and quality of the multimodal pre-trained model.

[0047] S203, based on the slot value text, generate question text that meets the training requirements and the corresponding answer text.

[0048] This process generates multiple target question-and-answer texts, each consisting of a question text and its corresponding answer text. For example, Table 1 shows the generated target question-and-answer texts.

[0049] Table 1

[0050]

[0051] In this embodiment, the answer text includes slot text, which can describe objects in the image. The question text may or may not include slot text; the answer text is used to respond to the question text.

[0052] In this embodiment, the generated question text and answer text meet the preset training requirements, which facilitates the learning and convergence of the downstream task model during training.

[0053] S204, determine that the image and question text are the inputs to the multimodal pre-trained model, and the answer text is the label of the multimodal pre-trained model. Train the multimodal pre-trained model to obtain the trained multimodal pre-trained model.

[0054] In this process, the image and question text are input into a multimodal pre-trained model for processing to obtain a predicted output. Then, the label and the loss value of the predicted output are determined. If the loss value is greater than a preset loss threshold, the model parameters of the multimodal pre-trained model are adjusted using this loss value. If the loss value is less than the preset loss threshold, the multimodal pre-trained model can continue training using the image and the next question text and answer text. The loss value in this embodiment can be a cross-entropy loss value or other loss values, which are not limited herein. Furthermore, a larger loss value leads to a greater optimization of the model parameters of the multimodal pre-trained model, while a smaller loss value leads to a smaller optimization.

[0055] In this embodiment, by using a slot text extraction method, slot text describing images is extracted from the training data according to the training requirements of the downstream task, generating a series of question texts and answer texts. The images, question texts, and answer texts are then used to train the multimodal pre-training model, which can guide the multimodal pre-training model to focus on learning the slot information in the training data, thereby improving the training efficiency of the multimodal pre-training model.

[0056] Reference Figure 4 The flowchart illustrates another model pre-training method provided as an exemplary embodiment of this application. Figure 4 The model pre-training method shown includes the following steps:

[0057] S401, Obtain the training data corresponding to the multimodal pre-trained model to be trained.

[0058] The training data includes images and text describing the images. The specific implementation process for this step is described in S201 and will not be repeated here.

[0059] S402, Get the template to be filled.

[0060] In this embodiment, the template to be filled is pre-designed according to training requirements. The template includes multiple empty slots, each corresponding to a text attribute. Furthermore, the template includes template question-and-answer text, which includes a template question and its corresponding template answer. (See also...) Figure 5 A template question and its corresponding template answer constitute a template question-and-answer text.

[0061] Reference Figure 5 The template to be filled includes image empty spaces, question templates, and answer templates. The image empty spaces are used for filling. Figure 3The image shown. Among multiple question templates, each question template contains empty slots [], where each empty slot contains a corresponding text attribute. Each question template also has a corresponding answer template, which also contains empty slots, each with a corresponding text attribute.

[0062] For example, refer to Figure 5 When the question template is "[Object] What is [Visual Attribute Type]", one empty slot has the text attribute "Object", where "object" refers to an object in the image, such as the piano, television, and sofa in the image above. The other empty slot has the text "Visual Attribute Type", for example, "visual attribute type" refers to color. The corresponding answer template for this question template is "[Visual Attribute Value]", such as white or black as mentioned above.

[0063] In this embodiment of the application, multiple question templates to be filled can be designed according to training requirements, and multiple answer templates can be provided. The question template can be set with multiple empty slots with text attributes as needed, so as to fill the slot text later.

[0064] Furthermore, the template to be filled defines the attributes of the training samples of the multimodal pre-trained model, which enables the rapid and accurate extraction of key information in the training data that is relevant to the training requirements, thereby improving the training efficiency of multimodal pre-training.

[0065] In one embodiment, obtaining the template to be filled includes: obtaining the application scenario of the downstream task model of the multimodal pre-trained model, wherein the downstream task model includes at least one of the following: a visual question answering model, an image title generation model, or an image and text retrieval model; determining the corresponding template to be filled according to the application scenario, wherein the similarity between the template question answering text in the template to be filled and the training samples of the downstream task model is greater than a preset threshold.

[0066] In this embodiment, multiple templates to be filled can be pre-set, establishing a correspondence between the templates and application scenarios. For example, application scenarios include at least one of color-related, space-related, quantity-related, and object type-related. Therefore, different templates to be filled can be designed for different application scenarios. After determining the application scenario of the downstream task model, the corresponding template to be filled can be determined based on that application scenario, and subsequently, the slot value text corresponding to the application scenario can be extracted based on the template to be filled. If the application scenario involves processing color-related information of objects in an image, then both the question template and the answer template in the template to be filled are related to the object's color. For example, in... Figure 5 The application scenario involves spatial and color-related information in an image, so the template to be filled is as follows: Figure 5 .

[0067] Furthermore, the template question-and-answer text in the template to be filled has a high similarity to the training samples of the downstream task model. For example, if the downstream task model is used to identify the color of an object in an image, then the template question-and-answer text is about color. If the downstream task model is used to identify the position of an object in an image, then the template question-and-answer text is about position. Therefore, in this embodiment, the template to be filled can be determined based on the downstream task model, thereby determining the attributes of the extracted slot text to improve the multimodal pre-trained model.

[0068] If the downstream task model is used to identify the color of objects in an image, and existing technology is used to train a multimodal pre-trained model by using information describing the number of objects in the training data, then the multimodal pre-trained model's learning of the number of objects in the image does not help the downstream task model to be able to identify the color of objects, and will waste the training time of the multimodal pre-trained model.

[0069] S403, extract the slot text belonging to the text attribute from the text.

[0070] Specifically, based on the text attributes in a template question-and-answer text, various text information extraction methods, including dependency analysis, regular expressions, or sequence labeling models, are used to extract slot text from the text.

[0071] Reference Figure 5 Yes, a template question-and-answer text is "[object] is what [visual property type], [visual property value]", which can be found in [the following context]. Figure 3 From the text, we can extract the slot text "piano" corresponding to the text attribute "object", the slot text "color" corresponding to the text attribute "visual attribute type", and the slot text "black" corresponding to the text attribute "visual attribute value". Thus, one set of slot text is "piano", "color", and "black". Similarly, for this template question-and-answer text, we can obtain multiple sets of slot text, such as "sofa", "color", and "red". For another template question-and-answer text, "[object] where is it? [location]", we can also obtain multiple sets of slot text.

[0072] In this embodiment, the extraction of slot text is essentially the extraction of a portion of the text defined by the template to be filled, while irrelevant text is not extracted, in order to improve the training efficiency of the multimodal pre-trained model.

[0073] S404: Fill the empty slots in the template answer with the corresponding slot text to obtain the answer text.

[0074] Reference Figure 5 The target template is obtained by filling the corresponding content into the template to be filled. Specifically, the answer text is obtained by filling the empty slots in the template answer with the corresponding slot value text.

[0075] S405, if there are no empty slots in the template problem, determine the template problem as the problem text.

[0076] In this embodiment, the template question can be pre-set when designing the template to be filled, and there is no need to set slot values ​​in the template question. In this case, the template question is the question text. For example, a template question and answer text is "What is the color tone of this image? [Actual attribute value]", that is, there are no slot values ​​in the template question, and only slot values ​​are set in the answer template to reply to the template question.

[0077] You can choose not to set one or more slots in the template question or set one or more slots in the answer text, depending on your needs. There is no limit to the number of slots you can set.

[0078] S406: If there are empty slots in the template question, fill the empty slots in the template answer with the corresponding slot text to obtain the answer text.

[0079] Reference Figure 5 Fill the empty slots in the question template with the corresponding slot text to obtain the answer text.

[0080] In this embodiment, both the question text and the answer text are natural language, which can maintain consistency with the language form of the training text of the downstream task model.

[0081] S407: Combine the question-and-answer type and the question text to obtain the target text.

[0082] The template to be filled also includes the question and answer type corresponding to the template question. (See reference...) Figure 5 The template to be filled includes various question and answer types.

[0083] For example, question-and-answer types include: purely visual, region-guided visual, location-guided visual, purely spatial, region-guided spatial, and visual perception-guided spatial. Purely visual indicates the question text is about color; region-guided visual indicates the question text is about color and related image regions; location-guided visual indicates the question text is about color and related content regarding location; purely spatial indicates the question text is about location; region-guided spatial indicates the question text is about location and related image regions; and visually guided visual indicates the question text is about the location of an object based on its color.

[0084] In this embodiment, the purpose of the question-and-answer type is to instruct the multimodal training model to learn knowledge of that question-and-answer type. The question-and-answer type and the question text can be combined to obtain the target text, which is the target question input into the multimodal pre-trained model.

[0085] For example, refer to Figure 5 The target text obtained by concatenating the question-and-answer type "pure visual" and the question text "What color is a piano?" is "pure visual - What color is a piano?".

[0086] S408: Determine the image samples and target text as input samples for the multimodal pre-training model, and the answer text as the label for the multimodal pre-training model. Train the multimodal pre-training model to obtain the trained multimodal pre-training model.

[0087] This includes multiple target question-and-answer texts, which consist of question texts and answer texts. Training the multimodal pre-trained model involves: determining the number of slot texts in the target question-and-answer texts; and training the multimodal pre-trained model using the target question-and-answer texts in ascending order of the number of slot texts.

[0088] For example, refer to Figure 5 The target template corresponds to 6 target question-and-answer texts for the image. Target question-and-answer text A is "What color is the piano? White," containing 3 slot values. Target question-and-answer text B is "What color is the piano in the upper left corner of the image? Black," containing 4 slot values. Target question-and-answer text C is "What color are the flowers on the green plant on the dining table? Red," containing 5 slot values. Target question-and-answer text D is "Where is the sofa located? South of the piano," containing 2 slot values. Target question-and-answer text E is "Where is the TV located in the lower right corner of the image (according to coordinates, lower right can be replaced with coordinates such as "600, 100, 900, 500")? Above the TV cabinet," containing 3 slot values. Target question-and-answer text F is "Where is the white piano chair located? South of the piano," containing 3 slot values. When training the multimodal pre-training model, the multimodal pre-training model is first trained using target question-and-answer text D containing two slot texts and an image. Then, the multimodal pre-training model is trained using target question-and-answer text (A, E, and F) containing three slot texts and an image. Next, the multimodal pre-training model is trained using target question-and-answer text (B) containing four slot texts and an image. Finally, the multimodal pre-training model is trained using target question-and-answer text (C) containing five slot texts and an image.

[0089] In this embodiment, the multimodal pre-training model can be trained starting from low-difficulty target question-and-answer texts, based on the number of slot texts in the target question-and-answer text and following the learning approach of the course. The multimodal pre-training model is less difficult when the number of slot texts is small, and more difficult when the number of slot texts is large. Therefore, by gradually increasing the training difficulty of the multimodal pre-training model, the multimodal pre-training model can converge to better parameters more quickly.

[0090] Furthermore, the embodiments of this application do not limit the specific structure of the multimodal pre-trained model.

[0091] Compared to existing text-based random mask prediction tasks, this application's multimodal pre-trained model can determine the corresponding template to be filled based on the application scenario of the downstream task model during training, thus more specifically guiding the multimodal pre-trained model to focus on slot text (important information) related to the downstream task model. Furthermore, compared to existing image-based random mask reconstruction tasks, the input and output of this application's multimodal pre-trained model are both in natural language form, consistent with the input and output of the downstream task model, which is beneficial for the convergence of the fine-tuning process during the formal training of the downstream task model after pre-training. In addition, this application also addresses the shortcoming of image-text matching pre-training tasks that consider complete sentence and image-level semantics while ignoring the features of words and objects in the image. It extracts a portion of the semantics (slot text) from the text using text extraction methods to generate question text to pose a question.

[0092] Furthermore, the model pre-training method proposed in this application does not require additional data annotation. By using text extraction methods and templates to be filled, a variety of target question-and-answer texts can be generated based on the training data. Training a multimodal pre-trained model with these target question-and-answer texts can guide the multimodal pre-trained model to focus on information (input and output) closely related to the downstream task model, helping the multimodal pre-trained model to converge better and faster.

[0093] Figure 6 A flowchart illustrating the steps of a model training method provided for an exemplary embodiment of this application. Figure 5 The model training method shown includes the following steps:

[0094] S601, Obtain training samples.

[0095] The above embodiments are pre-training processes of downstream task models, while this embodiment is the formal training process of downstream task models. Downstream task models refer to downstream models of multimodal pre-trained models.

[0096] The downstream task model includes at least one of the following: a visual question answering model, an image title generation model, or an image and text retrieval model.

[0097] Furthermore, the training samples here are those used to train the downstream task model. These training samples can be generated during the actual question-answering process or manually labeled samples. For example, the pre-training samples for a multimodal pre-trained model are images and their corresponding target question-answer text, where the target question-answer text is "What color is this piano? White." If the downstream task model is a visual question-answering model, the training samples are images and the question-answer text, which could be "What color is this piano? The piano is white." If the downstream task model is an image title generation model, the training samples are images and title text, which could be "This image contains a white piano." If the downstream task model is an image-text retrieval model, the training samples are images and input text, which could be "A white piano."

[0098] In the embodiments of this application, obtaining formal training samples and training a multimodal pre-trained model can yield a downstream task model that can accurately achieve a specific task.

[0099] S602 uses training samples to train a multimodal pre-trained model to obtain the downstream task model.

[0100] The multimodal pre-trained model is trained using any of the model pre-training methods mentioned above. This step involves fine-tuning the pre-trained multimodal pre-trained model using training samples to obtain the downstream task model.

[0101] For example, if the downstream task model is a visual question answering model, the image and the question text "What color is this piano?" are input into the visual question answering model to obtain the prediction result. The loss value between the prediction result and the answer text "The piano is white" is calculated. If the loss value is greater than a threshold, the model parameters of the visual question answering model are adjusted, and training continues. If the loss value is less than the threshold, the training of the visual question answering model is complete. If the downstream task model is an image title generation model, the training samples are input into the image title generation model to obtain the predicted title. The loss value between the predicted title and the title text "This image contains a white piano" is calculated. If the loss value is greater than a threshold, the model parameters of the image title generation model are adjusted, and training continues. If the loss value is less than the threshold, the training of the image title generation model is complete. If the downstream task model is an image-text retrieval model, the text "white piano" is input into the image-text retrieval model to obtain the predicted image. The loss value between the predicted image and an image containing a white piano is calculated. If the loss value is greater than a threshold, the model parameters of the image-text retrieval model are adjusted, and training continues. If the loss value is less than the threshold, the training of the image-text retrieval model is complete. In this embodiment, the loss value can be the cross-entropy loss value or other loss values, which are not limited herein. Furthermore, if the loss value is greater than a loss value threshold, a larger loss value leads to a greater optimization of the model parameters in the downstream task model, while a smaller loss value leads to a smaller optimization of the model parameters in the downstream task model.

[0102] In this embodiment of the application, since the multimodal pre-trained model extracts training data and uses important information (slot text) to pre-train the multimodal pre-trained model during the above pre-training process, the multimodal pre-trained model can converge better. Therefore, formal training of the multimodal pre-trained model can more efficiently train a high-quality downstream task model.

[0103] Figure 7 A flowchart illustrating the steps of a data processing method provided for an exemplary embodiment of this application. Figure 7 The data processing method shown specifically includes the following steps:

[0104] S701, Obtain the data to be processed.

[0105] The data to be processed includes images and / or text. In the embodiments of this application, when the downstream task model is a visual question answering model, the data to be processed is both images and text. When the downstream task model is an image title generation model, the data to be processed is images. When the downstream task model is an image-text retrieval model, the data to be processed is text.

[0106] S702 inputs the data to be processed into the downstream task model for processing, and obtains the output result.

[0107] If the data to be processed includes both images and text, the output is a response text for both the images and the text. If the data to be processed includes images, the output is the image's title information. If the data to be processed includes text, the output is the image described in the text. The downstream task model is trained using the model training method described above.

[0108] In this embodiment of the application, the downstream task model obtained by training is a multimodal model. This multimodal model can perform related processing on images and text. In addition, the downstream task model is obtained by pre-training and formal training, and has high data processing performance.

[0109] In this application embodiment, in addition to providing a model pre-training method, a model pre-training apparatus is also provided, such as... Figure 8 As shown, the model pre-training device 80 includes:

[0110] The acquisition module 81 is used to acquire the training data corresponding to the multimodal pre-trained model to be trained. The training data includes images and text used to describe the images.

[0111] Extraction module 82 is used to extract slot text that meets the training requirements from the text;

[0112] The generation module 83 is used to generate question text that meets the training requirements and the corresponding answer text based on the slot value text.

[0113] Training module 84 is used to determine that the image and question text are the inputs of the multimodal pre-trained model, and the answer text is the label of the multimodal pre-trained model, to train the multimodal pre-trained model and obtain the trained multimodal pre-trained model.

[0114] In an optional embodiment, the extraction module 82 is specifically used to obtain a template to be filled, the template including multiple empty slots, each slot corresponding to a text attribute; and to extract the text belonging to the text attribute from the text.

[0115] In an optional embodiment, the template to be filled includes template question and answer text, which includes template questions and template answers corresponding to the template questions. The generation module 83 is specifically used to fill the slot value empty spaces in the template answer with the corresponding slot value text to obtain the answer text; if there are no slot value empty spaces in the template question, the template question is determined to be question text.

[0116] In an optional embodiment, the generation module 83 is further configured to fill the slot value empty space in the template answer with the corresponding slot value text when there is an empty slot value in the template question, so as to obtain the answer text.

[0117] In an optional embodiment, the template to be filled further includes a question-and-answer type corresponding to the template question, and a determination module 84, specifically used to concatenate the question-and-answer type and the question text to obtain the target text; and to determine the image sample and the target text as input samples for the multimodal pre-trained model.

[0118] In an optional embodiment, when the extraction module 82 is used to obtain the template to be filled, it is specifically used to obtain the application scenario of the downstream task model of the multimodal pre-trained model. The downstream task model includes at least one of the following: visual question answering model, image title generation model, or image and text retrieval model. According to the application scenario, the corresponding template to be filled is determined, and the similarity between the template question answering text in the template to be filled and the training sample of the downstream task model is greater than a preset threshold.

[0119] The model training apparatus provided in this application embodiment can improve the training efficiency of multimodal pre-trained models. The specific implementation process is the same as described in the above method embodiment, and will not be repeated here.

[0120] Furthermore, in some of the processes described in the above embodiments and accompanying drawings, multiple operations appear in a specific order. However, it should be clearly understood that these operations may not be executed in the order they appear herein, or may be executed in parallel. The sequence numbers are merely used to distinguish different operations, and the sequence numbers themselves do not represent any execution order. Additionally, these processes may include more or fewer operations, and these operations may be executed sequentially or in parallel. It should be noted that the descriptions such as "first," "second," etc., in this document are used to distinguish different messages, devices, modules, etc., and do not represent a sequential order, nor do they limit "first" and "second" to different types.

[0121] Figure 9 This is a schematic diagram of an electronic device provided as an exemplary embodiment of this application. The electronic device can be a cloud device. This device is used to run the aforementioned model training method and model pre-training method. Figure 9 As shown, the electronic device includes a memory 94 and a processor 95.

[0122] Memory 94 is used to store computer programs and can be configured to store various other data to support operation on electronic devices. This memory 94 may be object storage (OSS).

[0123] The memory 94 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk or optical disk.

[0124] The processor 95, coupled to the memory 94, is used to execute the computer program in the memory 94 for: acquiring training data corresponding to the multimodal pre-trained model to be trained, the training data including images and text used to describe the images; extracting slot text that meets the training requirements from the text; generating question text that meets the training requirements and answer text corresponding to the question text based on the slot text; determining that the images and question text are the inputs of the multimodal pre-trained model, and the answer text is the label of the multimodal pre-trained model; training the multimodal pre-trained model; and obtaining the trained multimodal pre-trained model.

[0125] Further optionally, when the processor 95 extracts slot text that meets the training requirements from the text, it is specifically used to: obtain a template to be filled, the template to be filled includes multiple slot empty spaces, and the slot empty spaces correspond to text attributes; and extract slot text that belongs to the text attributes from the text.

[0126] Further optionally, the template to be filled includes template question and answer text, which includes template question and template answer corresponding to the template question. When the processor 95 generates question text and answer text corresponding to the question text that meet the training requirements based on the slot value text, it is specifically used to: fill the slot value empty space in the template answer with the corresponding slot value text to obtain the answer text; and determine the template question as question text if there is no slot value empty space in the template question.

[0127] Further optionally, when the processor 95 generates question text and answer text corresponding to the question text that meet the training requirements based on the slot value text, it is also used to: fill the slot value empty space in the template answer with the corresponding slot value text to obtain the answer text if there is a slot value empty space in the template question.

[0128] Further optionally, the template to be filled also includes a question-answering type processor 95 corresponding to the template question. When determining that the image and question text are the inputs to the multimodal pre-trained model, the processor is specifically used to: concatenate the question-answering type and the question text to obtain the target text; and determine the image sample and the target text as input samples for the multimodal pre-trained model.

[0129] Optionally, the processor 95 is further configured to acquire the application scenario of the downstream task model of the multimodal pre-trained model, the downstream task model including at least one of: visual question answering model, image title generation model or image and text retrieval model; and determine the corresponding template to be filled according to the application scenario, wherein the similarity between the template question answer text in the template to be filled and the training sample of the downstream task model is greater than a preset threshold.

[0130] In one optional embodiment, the processor 95, coupled to the memory 94, is used to execute a computer program in the memory 94, and is further used to: acquire training samples; use the training samples to train a multimodal pre-trained model to obtain a downstream task model, the downstream task model including at least one of a visual question answering model, an image title generation model, or an image and text retrieval model, wherein the multimodal pre-trained model is trained by the above-mentioned model pre-training method.

[0131] In one optional embodiment, the processor 95, coupled to the memory 94, is configured to execute a computer program in the memory 94, and is further configured to: acquire data to be processed, the data to be processed including images and / or text; input the data to be processed into a downstream task model for processing, and obtain an output result, wherein if the data to be processed includes images and text, the output result is a response text for the images and text; if the data to be processed includes images, the output result is the title information of the images; and if the data to be processed includes text, the output result is the image described in the text. The downstream task model is trained using the model training method described above.

[0132] Furthermore, such as Figure 9 As shown, the electronic device also includes other components such as a firewall 91, a load balancer 92, a communication component 96, and a power supply component 98. Figure 9 The diagram only shows some components and does not mean that the electronic device includes only these components. Figure 9 The components shown.

[0133] Accordingly, embodiments of this application also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the steps in the method described above.

[0134] Accordingly, embodiments of this application also provide a computer program product, including a computer program / instructions, which, when executed by a processor, cause the processor to implement the steps in the method described above.

[0135] The above Figure 9 The communication component is configured to facilitate wired or wireless communication between the device containing the communication component and other devices. The device containing the communication component can access wireless networks based on communication standards, such as WiFi, 2G, 3G, 4G / LTE, 5G, or combinations thereof. In one exemplary embodiment, the communication component receives broadcast signals or broadcast-related text from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component also includes a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID), Infrared Data Association (IrDA) technology, Ultra-Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

[0136] The above Figure 9 The power supply component provides power to the various components of the device in which it resides. The power supply component may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the device in which it resides.

[0137] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0138] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0139] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0140] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0141] In a typical configuration, a computing device includes one or more processors (CPU and / or GPU), input / output interfaces, network interfaces, and memory.

[0142] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0143] Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can be used to store text by any method or technology. Text can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store text accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0144] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0145] The above are merely embodiments of this application and are not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. A model pre-training method, characterized in that, include: Obtain training data corresponding to the multimodal pre-trained model to be trained, wherein the training data includes images and text used to describe the images; Obtain the template to be filled, which includes multiple empty slots and template question and answer text; Each slot has a corresponding text attribute; the template question-and-answer text includes a template question and a template answer corresponding to the template question. Extract the slot value text belonging to the text attribute from the text; If there are no empty slots in the template question, the template question is determined to be a question text that meets the training requirements. The empty slots in the template answer are filled with the corresponding slot text to obtain the answer text corresponding to the question text. If there are empty slots in the template question, fill the empty slots with the corresponding slot text to obtain the question text; and fill the empty slots in the template answer with the corresponding slot text to obtain the answer text. The image and the question text are determined as the input to the multimodal pre-trained model, and the answer text is determined as the label of the multimodal pre-trained model. The multimodal pre-trained model is then trained to obtain the trained multimodal pre-trained model.

2. The model pre-training method according to claim 1, characterized in that, The template to be filled also includes the question-and-answer type corresponding to the template question, and determining that the image and the question text are the inputs to the multimodal pre-trained model includes: By concatenating the question-and-answer type and the question text, the target text is obtained; The image samples and the target text are determined as input samples for the multimodal pre-trained model.

3. The model pre-training method according to claim 1, characterized in that, The process of obtaining the template to be filled includes: The application scenarios of the downstream task models of the multimodal pre-trained model are obtained, and the downstream task models include at least one of the following: visual question answering model, image title generation model, or image and text retrieval model; Based on the application scenario, a corresponding template to be filled is determined, wherein the similarity between the template question-and-answer text in the template to be filled and the training samples of the downstream task model is greater than a preset threshold.

4. The model pre-training method according to claim 1, characterized in that, The multimodal pre-trained model includes multiple target question-and-answer texts, wherein the target question-and-answer texts include the question text and the answer text, and the training of the multimodal pre-trained model includes: For the target question-and-answer text, determine the number of slot texts in the target question-and-answer text; The multimodal pre-trained model is trained using the target question-and-answer text in ascending order of quantity.

5. A model training method, characterized in that, include: Obtain training samples; Using the training samples, a multimodal pre-trained model is trained to obtain a downstream task model. The downstream task model includes at least one of a visual question answering model, an image title generation model, or an image and text retrieval model. The multimodal pre-trained model is trained by the model pre-training method described in any one of claims 1 to 4.

6. A data processing method, characterized in that, include: Acquire data to be processed, which includes images and / or text; The data to be processed is input into a downstream task model for processing to obtain an output result. If the data to be processed includes an image and text, the output result is a response text for the image and the text. If the data to be processed includes an image, the output result is the title information of the image. If the data to be processed includes text, the output result is the image described by the text. The downstream task model is trained by the model training method described in claim 5.

7. A model pre-training device, characterized in that, include: The acquisition module is used to acquire the training data corresponding to the multimodal pre-trained model to be trained, wherein the training data includes images and text used to describe the images; The extraction module is used to obtain the template to be filled, which includes multiple empty slots and template question and answer text; Each slot has a corresponding text attribute; the template question-and-answer text includes a template question and a template answer corresponding to the template question. Extract the slot value text belonging to the text attribute from the text; The generation module is used to determine that the template question is a question text that meets the training requirements when there are no slot value empty spaces in the template question, and to fill the slot value empty spaces in the template answer with the corresponding slot value text to obtain the answer text corresponding to the question text. If there are empty slots in the template question, fill the empty slots with the corresponding slot text to obtain the question text; and fill the empty slots in the template answer with the corresponding slot text to obtain the answer text. The training module is used to determine that the image and the question text are the inputs of the multimodal pre-trained model, and the answer text is the label of the multimodal pre-trained model, to train the multimodal pre-trained model and obtain the trained multimodal pre-trained model.

8. A model pre-training system, characterized in that, include: A cloud server and terminal devices, wherein a multimodal pre-trained model is deployed on the cloud server; The terminal device is used to acquire training data of a multimodal pre-trained model to be trained and send the training data to the cloud server. The training data includes images and text used to describe the images. The cloud server is used to obtain the template to be filled, which includes multiple empty slots and template question and answer text; Each slot has a corresponding text attribute; the template question-and-answer text includes a template question and a template answer corresponding to the template question; the slot text belonging to the text attribute is extracted from the text. If there are no empty slots in the template question, the template question is determined to be a question text that meets the training requirements. The empty slots in the template answer are filled with the corresponding slot text to obtain the answer text corresponding to the question text. If there are empty slots in the template question, fill the empty slots with the corresponding slot text to obtain the question text; and fill the empty slots in the template answer with the corresponding slot text to obtain the answer text. The image and the question text are determined as the input to the multimodal pre-trained model, and the answer text is determined as the label of the multimodal pre-trained model. The multimodal pre-trained model is then trained to obtain the trained multimodal pre-trained model.

9. An electronic device, characterized in that, include: A processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the model pre-training method as described in any one of claims 1 to 4, and / or the model training method as described in claim 5, and / or the data processing method as described in claim 6.