Self-supervised encoder training method and apparatus, storage medium and electronic device

By inputting image and text sequences into the encoder module and performing encoding and reconstruction loss calculations, the problem of the inability of autoencoders to learn in multiple modes is solved, and multimodal autoencoder training for images and text is realized, thereby enhancing the multimodal learning capability of the encoder.

CN115272786BActive Publication Date: 2026-06-23BEIJING XUEZHITU NETWORK TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING XUEZHITU NETWORK TECH
Filing Date
2022-06-30
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

In existing technologies, autoencoders cannot achieve multimodal self-supervised learning of images and text, and cannot train generative autoencoders by combining images and text together.

Method used

By acquiring the test image and text sequence, inputting them into the trained encoder module, encoding them to obtain target image features and target text features, and using them for downstream tasks, the encoder module is trained by using masking operations and splicing to process the image and text sequence, and the decoder is used for reconstruction and loss calculation for backpropagation, thus realizing the training of a multimodal autoencoder.

Benefits of technology

We have achieved training of multimodal autoencoders for images and text, which enhances the multimodal learning capability of the encoder and enables effective utilization of image and text features in downstream tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115272786B_ABST
    Figure CN115272786B_ABST
Patent Text Reader

Abstract

The application discloses a self-supervised encoder training method and device, a storage medium and an electronic device. The method comprises the following steps: obtaining a to-be-tested image and a to-be-tested text sequence; inputting the to-be-tested image and the to-be-tested text sequence into a trained encoder module; obtaining a target image feature and a target text feature through encoding; and using the target image feature and the target text feature for a downstream task. The application solves the technical problem that an encoder cannot automatically learn in a multi-modal manner.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computers, and more specifically, to a self-supervised encoder training method, apparatus, storage medium, and electronic device. Background Technology

[0002] Autoencoders are a type of generative model. With the advent of deep learning, autoencoders can achieve dimensionality reduction by stacking network layers to form deep autoencoders. By reducing the number of units in the hidden layers during the encoding process, dimensionality reduction can be achieved in a hierarchical manner, obtaining higher-level features in deeper hidden layers, thus better reconstructing the data during the decoding process. Image autoencoders can achieve self-supervised learning and learn relatively good image features. However, current technologies are contrastive learning methods and cannot combine images and text to achieve multimodal generative autoencoders. Summary of the Invention

[0003] This invention provides a self-supervised encoder training method, apparatus, storage medium, and electronic device to at least solve the technical problem that encoders cannot automatically learn in multiple modes.

[0004] According to one aspect of the present invention, a self-supervised encoder training method is provided, comprising: acquiring a test image and a test text sequence; inputting the test image and the test text sequence into a trained encoder module; obtaining target image features and target text features through encoding; and using the target image features and the target text features for downstream tasks.

[0005] According to another aspect of the present invention, a self-supervised encoder training apparatus is provided, comprising: an acquisition module for acquiring a test image and a test text sequence; an input module for inputting the test image and the test text sequence into a trained encoder module; an encoding module for obtaining target image features and target text features through encoding; and a processing module for using the target image features and the target text features for downstream tasks.

[0006] As an optional example, the input module includes: an acquisition unit for acquiring a training image and a training text; a processing unit for performing a masking operation on the training image to obtain a first image sequence and performing a masking operation on the training text to obtain a first text sequence; and a training unit for training an encoder module using the first image sequence and the first text sequence to obtain the trained encoder module.

[0007] As an optional example, the above processing unit includes: a first segmentation subunit, used to divide the above image to be trained into equal blocks to obtain a first number of image blocks; a first processing subunit, used to perform the above masking operation on a first proportion of the above number of image blocks to obtain masked image blocks and unmasked image blocks; and a first stitching subunit, used to stitch the above unmasked image blocks to obtain the above first image sequence.

[0008] As an optional example, the above processing unit includes: a second segmentation subunit, used to divide the above-mentioned text to be trained into equal blocks to obtain a second number of text blocks; a second processing subunit, used to perform the above-mentioned masking operation on the second proportion of text blocks in the above-mentioned second number of text blocks to obtain masked text blocks and unmasked text blocks; and a second splicing subunit splicing the above-mentioned unmasked text blocks to obtain the above-mentioned first text sequence.

[0009] As an optional example, the training unit includes: a third splicing subunit for splicing the first image sequence and the first text sequence to obtain a first sequence; an input subunit for inputting the first sequence into the encoder module; an encoding subunit for obtaining encoded first image features and first text features through encoding; and a training subunit for training the encoder module using the first image features and the first text features to obtain the trained encoder module.

[0010] As an optional example, the training subunit is further configured to: concatenate the mask image block in the image to be trained with the first image feature to obtain a second image feature, wherein the mask image block is the image block in the first number of image blocks after the image to be trained is divided into equal blocks to obtain a first number of image blocks, and the mask operation is performed on the first number of image blocks; concatenate the mask text block in the text to be trained with the first text feature to obtain a second text feature, wherein the mask text block is the text block in the second number of text blocks after the text to be trained is divided into equal blocks to obtain a second number of text blocks, and the mask operation is performed on the second number of text blocks; and train the encoder module using the second image feature and the second text feature to obtain the trained encoder module.

[0011] As an optional example, the training subunit is further configured to: input the second image features and the second text features into the decoder module; obtain the decoded third image features and the third text features through decoding; reconstruct the image and the text based on the third image features and the third text features through the decoder module; calculate the loss by comparing the reconstructed image and the image to be trained, and the reconstructed text and the text to be trained; and backpropagate the loss to the encoder module to obtain the trained encoder module.

[0012] As an optional example, the above processing module includes: a classification unit, configured to classify the test image according to the target image features and classify the test text sequence according to the target text features; or to identify objects in the test image through the target image features and identify the content in the test text sequence according to the test text sequence.

[0013] According to another aspect of the present invention, a storage medium is also provided, wherein a computer program is stored in the storage medium, and the computer program is executed by a processor to perform the above-described self-supervised encoder training method.

[0014] According to another aspect of the present invention, an electronic device is also provided, including a memory and a processor, wherein the memory stores a computer program and the processor is configured to execute the self-supervised encoder training method described above through the computer program.

[0015] In the process of applying the above-described traffic delivery method to computer vision using deep learning technology, in this embodiment of the invention, the following methods are employed: acquiring a test image and a test text sequence; inputting the test image and the test text sequence into a trained encoder module; obtaining target image features and target text features through encoding; and using the target image features and target text features for downstream tasks. Since the above method obtains target image features and target text features by inputting the image and text sequence into a trained encoder module, the purpose of a multimodal autoencoder is achieved, thereby solving the technical problem that the encoder cannot automatically learn in multiple modalities. Attached Figure Description

[0016] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and, together with their description, serve to explain the invention and do not constitute an undue limitation thereof. In the drawings:

[0017] Figure 1 This is a flowchart of an optional self-supervised encoder training method according to an embodiment of the present invention;

[0018] Figure 2 This is a model diagram of an optional self-supervised encoder training method according to an embodiment of the present invention;

[0019] Figure 3 This is a schematic diagram of an optional self-supervised encoder training device according to an embodiment of the present invention;

[0020] Figure 4This is a schematic diagram of an optional electronic device according to an embodiment of the present invention. Detailed Implementation

[0021] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0022] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0023] According to a first aspect of the present invention, a self-supervised encoder training method is provided, optionally, as follows: Figure 1 As shown, the above method includes:

[0024] S102, Obtain the image and text sequence to be tested;

[0025] S104, Input the image to be tested and the text sequence to be tested into the trained encoder module;

[0026] S106, target image features and target text features are obtained through encoding;

[0027] S108 uses the target image features and target text features for downstream tasks.

[0028] Optionally, in this embodiment, the encoder module, also an autoencoder, employs supervised learning, using the same data for both the input and output layers of the three-layer neural network. The downstream task is a supervised learning task utilizing a pre-trained model or component.

[0029] Optionally, in this embodiment, the image to be tested and the text sequence to be tested are acquired and input into a trained encoder module for encoding. The output is the target image features corresponding to the image to be tested and the target text features corresponding to the text sequence to be tested. The target image features and target text features are used for downstream tasks. The image to be tested is classified according to the target image features, and the text sequence to be tested is classified according to the target text features. Alternatively, objects in the image to be tested can be identified by the target image features, and the content in the text sequence to be tested can be identified by the text sequence to be tested.

[0030] Optionally, in this embodiment, by inputting the image and text sequence into the trained encoder module, the target image features and target text features are obtained, thereby achieving the purpose of a multimodal autoencoder and solving the technical problem that the encoder cannot learn multimodally automatically.

[0031] As an optional example, inputting the image and text sequence to be tested into the trained encoder module includes:

[0032] Obtain the image and text to be trained;

[0033] A masking operation is performed on the training images to obtain the first image sequence, and a masking operation is performed on the training text to obtain the first text sequence;

[0034] The encoder module is trained using the first image sequence and the first text sequence to obtain the trained encoder module.

[0035] Optionally, in this embodiment, the mask is used to occlude (fully or partially) the image or text sequence to be processed using a selected image, graphic, or object, thereby controlling the area or process of image processing. The training image and training text are acquired, and random masking is performed on them to obtain a first image sequence corresponding to the training image and a first text sequence corresponding to the training text. The encoder module is then trained using the first image sequence and the first text sequence to obtain a trained encoder module.

[0036] As an optional example, performing a masking operation on the images to be trained to obtain the first image sequence includes:

[0037] The image to be trained is divided into equal blocks to obtain the first number of image blocks;

[0038] Perform a masking operation on a first proportion of image blocks from a first number of image blocks to obtain masked image blocks and unmasked image blocks;

[0039] The unmasked image blocks are stitched together to obtain the first image sequence.

[0040] Optionally, in this embodiment, the first quantity can be 6 or 9, and the first ratio can be 10% or 20%. The image to be trained is divided into N*N blocks. If N is 3, then the first quantity is 9 image blocks. If the first ratio is 20%, a masking operation is performed on 2 random blocks (rounded) of the 9 image blocks to obtain 2 masked image blocks and 7 unmasked image blocks.

[0041] As an optional example, performing a masking operation on the training text to obtain the first text sequence includes:

[0042] The text to be trained is divided into equal blocks to obtain a second number of text blocks;

[0043] Perform a masking operation on the second proportion of text blocks in the second number of text blocks to obtain masked text blocks and unmasked text blocks;

[0044] The unmasked text blocks are concatenated to obtain the first text sequence.

[0045] Optionally, in this embodiment, the second quantity can be 8 or 10, and the second ratio can be 10% or 20%. The text to be trained is divided into N equal blocks. If N is 8, then 8 text blocks are obtained. If the second ratio is 20%, a masking operation is performed on 2 random text blocks (rounded to the nearest whole number) out of the 8 text blocks to obtain 2 masked text blocks and 6 unmasked text blocks.

[0046] As an optional example, the encoder module is trained using the first image sequence and the first text sequence, resulting in a trained encoder module including:

[0047] The first image sequence and the first text sequence are concatenated to obtain the first sequence;

[0048] The first sequence is input into the encoder module;

[0049] Through encoding, the encoded first image features and first text features are obtained;

[0050] The encoder module is trained using the first image features and the first text features to obtain the trained encoder module.

[0051] Optionally, in this embodiment, the first image sequence and the first text sequence are concatenated together to form a first sequence, and the first sequence is input to the encoder module. The encoder encodes the first image features and the first text features, and the encoder module is trained using the first image features and the first text features to obtain the trained encoder module.

[0052] As an optional example, the encoder module is trained based on the first image features and the first text features, resulting in a trained encoder module including:

[0053] The mask image block in the image to be trained is concatenated with the first image feature to obtain the second image feature. The mask image block is the image block in the first number of image blocks after the image to be trained is divided into equal blocks to obtain a first number of image blocks, and the mask operation is performed on the first number of image blocks.

[0054] The masked text block in the text to be trained is concatenated with the first text feature to obtain the second text feature. The masked text block is the text block in the second number of text blocks after the text to be trained is divided into equal blocks to obtain the second number of text blocks.

[0055] The encoder module is trained using the second image features and the second text features to obtain the trained encoder module.

[0056] Optionally, in this embodiment, a mask image block and a first image feature are concatenated, with the concatenation position of the mask image block being the location in the image to be trained where the masking operation is performed, to obtain a second image feature. Similarly, a mask text block and a first text feature are concatenated, with the concatenation position of the mask text block being the location in the text to be trained where the masking operation is performed, to obtain a second text feature. The encoder module is then trained using the second image feature and the second text feature to obtain a trained encoder module.

[0057] As an optional example, the encoder module is trained using second image features and second text features, resulting in a trained encoder module that includes:

[0058] The second image feature and the second text feature are input into the decoder module;

[0059] Through decoding, the decoded third image features and third text features are obtained;

[0060] Based on the third image features and the third text features, reconstruction is performed through the decoder module to obtain the reconstructed image and reconstructed text.

[0061] The loss is calculated by comparing the reconstructed image with the training image, reconstructing the text with the training text, and then calculating the loss.

[0062] The loss is backpropagated to the encoder module to obtain the trained encoder module.

[0063] Optionally, in this embodiment, the decoder module decodes the image or text encoded data obtained by encoding the image or text, generating image or text decoded data. Loss, also known as a loss function, is used in machine learning to estimate the degree of inconsistency between the model's predicted values ​​and the true values; the smaller the loss function, the better the model performance. The second image features and the second text features are input into the decoder module, which decodes them to obtain the third image features and the third text features. The decoder automatically learns the third image features and the third text features. After successful learning, reconstruction is achieved, resulting in a reconstructed image and reconstructed text. By comparing the reconstructed image with the training image and the reconstructed text with the training text, the loss function is calculated, and the calculation result is backpropagated to the encoder module to obtain the trained encoder module.

[0064] As an optional example, using target image features and target text features for downstream tasks includes:

[0065] Classify the test image based on the target image features, and classify the test text sequence based on the target text features; or

[0066] The system identifies objects in the test image based on the features of the target image and identifies the content in the test text sequence based on the test text sequence.

[0067] Optionally, in this embodiment, the test image is classified according to the target image features, and the test text sequence is classified according to the target text features. That is, a label is assigned to the test image and the test text from a given set of classifications. Alternatively, the objects in the test image are identified according to the target image features, and the content in the test text sequence is identified according to the test text sequence.

[0068] Optionally, as illustrated by an example, this invention relates to a self-supervised encoder training method for pre-training a bimodal self-supervised encoder of images and text. Text possesses high semantic information, and when trained simultaneously with the image autoencoder, it enhances the representational ability of the image model, enabling the encoder to encode both images and text, thus achieving multimodality. The model implementation process is as follows: Figure 2 As shown.

[0069] A. Training Phase:

[0070] Step 1: Divide the image to be trained into N*N blocks. Apply random masks according to a ratio (a smaller ratio simplifies the reconstruction task, making it difficult for the encoder model to learn true semantic information; a larger ratio results in missing information and affects reconstruction; the optimal ratio should be determined by the encoder model). The ratio can be adjusted according to the actual situation. There are N*N*Ratio masked image blocks and N*N*(1-Ratio) unmasked image blocks. Concatenate the unmasked image blocks to form the first image sequence.

[0071] Step 2: Divide the training text into M equal blocks. Randomly mask the blocks according to the ratio Ratio. The number of masked text blocks is M*Ratio, and the number of unmasked text blocks is M*(1-Ratio). Concatenate the unmasked text blocks into the first text sequence.

[0072] Step 3: The first image sequence that was not masked and the first text sequence are concatenated together to form a sequence of length (M+N*N)*(1-Ratio), and then sent to the Encoder module;

[0073] Step 4: The features output by the encoder module are the encoded first image features and the first text features. Then, the masked image blocks and masked text blocks are added, with the positions corresponding to the previously masked image and text positions.

[0074] Step 5: The encoded features are obtained by concatenating the masked image blocks and the masked text blocks together to obtain the second image features and the second text features.

[0075] Step 6: Input the second image feature and the second text feature into the Decoder module, output the decoded features, and obtain the third image feature and the third text feature;

[0076] Step 7: Reconstruct the image and text. Calculate the loss by comparing the reconstructed image with the training image, and the reconstructed text with the training text. Backpropagate to the Encoder module to update the model parameters.

[0077] B. Downstream Tasks:

[0078] The original test image and test text sequence are input into the Encoder module to obtain the target image features and target text features. These features are then used in downstream tasks.

[0079] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the present invention is not limited to the described order of actions, because according to the present invention, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to the present invention.

[0080] According to another aspect of the embodiments of this application, a self-supervised encoder training device is also provided, such as... Figure 3 As shown, it includes:

[0081] The acquisition module 302 is used to acquire the image to be tested and the text sequence to be tested;

[0082] The input module 304 is used to input the image to be tested and the text sequence to be tested into the trained encoder module;

[0083] Encoding module 306 is used to obtain target image features and target text features through encoding;

[0084] Processing module 308 is used to apply target image features and target text features to downstream tasks.

[0085] Optionally, in this embodiment, the encoder module, also an autoencoder, employs supervised learning, using the same data for both the input and output layers of the three-layer neural network. The downstream task is a supervised learning task utilizing a pre-trained model or component.

[0086] Optionally, in this embodiment, the image to be tested and the text sequence to be tested are acquired and input into a trained encoder module for encoding. The output is the target image features corresponding to the image to be tested and the target text features corresponding to the text sequence to be tested. The target image features and target text features are used for downstream tasks. The image to be tested is classified according to the target image features, and the text sequence to be tested is classified according to the target text features. Alternatively, objects in the image to be tested can be identified by the target image features, and the content in the text sequence to be tested can be identified by the text sequence to be tested.

[0087] Optionally, in this embodiment, by inputting the image and text sequence into the trained encoder module, the target image features and target text features are obtained, thereby achieving the purpose of a multimodal autoencoder and solving the technical problem that the encoder cannot learn multimodally automatically.

[0088] As an optional example, the input module includes:

[0089] The acquisition unit is used to acquire the image and text to be trained.

[0090] The processing unit is used to perform a masking operation on the image to be trained to obtain a first image sequence and to perform a masking operation on the text to be trained to obtain a first text sequence.

[0091] The training unit is used to train the encoder module using the first image sequence and the first text sequence to obtain the trained encoder module.

[0092] Optionally, in this embodiment, the mask is used to occlude (fully or partially) the image or text sequence to be processed using a selected image, graphic, or object, thereby controlling the area or process of image processing. The training image and training text are acquired, and random masking is performed on them to obtain a first image sequence corresponding to the training image and a first text sequence corresponding to the training text. The encoder module is then trained using the first image sequence and the first text sequence to obtain a trained encoder module.

[0093] As an optional example, the processing unit includes:

[0094] The first segmentation subunit is used to divide the image to be trained into equal blocks to obtain a first number of image blocks;

[0095] The first processing subunit is used to perform a masking operation on a first proportion of image blocks in a first number of image blocks to obtain masked image blocks and unmasked image blocks.

[0096] The first stitching subunit is used to stitch together unmasked image blocks to obtain the first image sequence.

[0097] Optionally, in this embodiment, the first quantity can be 6 or 9, and the first ratio can be 10% or 20%. The image to be trained is divided into N*N blocks. If N is 3, then the first quantity is 9 image blocks. If the first ratio is 20%, a masking operation is performed on 2 random blocks (rounded) of the 9 image blocks to obtain 2 masked image blocks and 7 unmasked image blocks.

[0098] As an optional example, the processing unit includes:

[0099] The second segmentation subunit is used to divide the text to be trained into equal blocks to obtain a second number of text blocks.

[0100] The second processing subunit is used to perform a masking operation on a second proportion of text blocks in a second number of text blocks to obtain masked text blocks and unmasked text blocks;

[0101] The second splicing subunit splices the unmasked text blocks to obtain the first text sequence.

[0102] Optionally, in this embodiment, the second quantity can be 8 or 10, and the second ratio can be 10% or 20%. The text to be trained is divided into N equal blocks. If N is 8, then 8 text blocks are obtained. If the second ratio is 20%, a masking operation is performed on 2 random text blocks (rounded to the nearest whole number) out of the 8 text blocks to obtain 2 masked text blocks and 6 unmasked text blocks.

[0103] As an optional example, the training unit includes:

[0104] The third splicing subunit is used to splice the first image sequence and the first text sequence to obtain the first sequence;

[0105] The input subunit is used to input the first sequence into the untrained encoder module;

[0106] The encoding subunit is used to obtain the encoded first image features and the first text features through encoding;

[0107] The training subunit is used to train the encoder module using the first image features and the first text features to obtain the trained encoder module.

[0108] Optionally, in this embodiment, the first image sequence and the first text sequence are concatenated together to form a first sequence, and the first sequence is input to the encoder module. The encoder encodes the first image features and the first text features, and the encoder module is trained using the first image features and the first text features to obtain the trained encoder module.

[0109] As an optional example, the training subunit is also used for:

[0110] The mask image block in the image to be trained is concatenated with the first image feature to obtain the second image feature. The mask image block is the image block in the first number of image blocks after the image to be trained is divided into equal blocks to obtain a first number of image blocks, and the mask operation is performed on the first number of image blocks.

[0111] The masked text block in the text to be trained is concatenated with the first text feature to obtain the second text feature. The masked text block is the text block in the second number of text blocks after the text to be trained is divided into equal blocks to obtain the second number of text blocks.

[0112] The encoder module is trained using the second image features and the second text features to obtain the trained encoder module.

[0113] Optionally, in this embodiment, a mask image block and a first image feature are concatenated, with the concatenation position of the mask image block being the location in the image to be trained where the masking operation is performed, to obtain a second image feature. Similarly, a mask text block and a first text feature are concatenated, with the concatenation position of the mask text block being the location in the text to be trained where the masking operation is performed, to obtain a second text feature. The encoder module is then trained using the second image feature and the second text feature to obtain a trained encoder module.

[0114] As an optional example, the training subunit is also used for:

[0115] The second image feature and the second text feature are input into the decoder module;

[0116] Through decoding, the decoded third image features and third text features are obtained;

[0117] Based on the third image features and the third text features, reconstruction is performed through the decoder module to obtain the reconstructed image and reconstructed text.

[0118] The loss is calculated by comparing the reconstructed image with the training image, reconstructing the text with the training text, and then calculating the loss.

[0119] The loss is backpropagated to the encoder module to obtain the trained encoder module.

[0120] Optionally, in this embodiment, the decoder module decodes the image or text encoded data obtained by encoding the image or text, generating image or text decoded data. Loss, also known as a loss function, is used in machine learning to estimate the degree of inconsistency between the model's predicted values ​​and the true values; the smaller the loss function, the better the model performance. The second image features and the second text features are input into the decoder module, which decodes them to obtain the third image features and the third text features. The decoder automatically learns the third image features and the third text features. After successful learning, reconstruction is achieved, resulting in a reconstructed image and reconstructed text. By comparing the reconstructed image with the training image and the reconstructed text with the training text, the loss function is calculated, and the calculation result is backpropagated to the encoder module to obtain the trained encoder module.

[0121] As an optional example, the processing module includes:

[0122] The classification unit is used to classify the test image based on the target image features and to classify the test text sequence based on the target text features.

[0123] Alternatively, objects in the test image can be identified by the features of the target image, and the content in the test text sequence can be identified based on the test text sequence.

[0124] Optionally, in this embodiment, the test image is classified according to the target image features, and the test text sequence is classified according to the target text features. That is, a label is assigned to the test image and the test text from a given set of classifications. Alternatively, the objects in the test image are identified according to the target image features, and the content in the test text sequence is identified according to the test text sequence.

[0125] For other examples of this embodiment, please refer to the examples above, which will not be repeated here.

[0126] Figure 4 This is a structural block diagram of an optional electronic device according to an embodiment of this application, such as... Figure 4 As shown, it includes a processor 402, a communication interface 404, a memory 406, and a communication bus 408. The processor 402, communication interface 404, and memory 406 communicate with each other via the communication bus 408.

[0127] Memory 406 is used to store computer programs;

[0128] When processor 402 executes a computer program stored in memory 406, it performs the following steps:

[0129] Obtain the image and text sequence to be tested;

[0130] The image and text sequence to be tested are input into the trained encoder module;

[0131] Target image features and target text features are obtained through encoding;

[0132] The target image features and target text features are used for downstream tasks.

[0133] Optionally, in this embodiment, the communication bus can be a PCI (Peripheral Component Interconnect) bus or an EISA (Extended Industry Standard Architecture) bus, etc. This communication bus can be divided into an address bus, a data bus, a control bus, etc. For ease of representation, Figure 4 The symbol is represented by a single thick line, but this does not indicate that there is only one bus or one type of bus. The communication interface is used for communication between the aforementioned electronic devices and other devices.

[0134] The memory may include RAM, or non-volatile memory, such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0135] As an example, the memory 406 described above may include, but is not limited to, the acquisition module 302, input module 304, encoding module 306, and processing module 308 from the self-supervised encoder training device described above. Furthermore, it may include, but is not limited to, other module units from the aforementioned processing device, which will not be elaborated upon in this example.

[0136] The processors mentioned above can be general-purpose processors, including but not limited to: CPU (Central Processing Unit), NP (Network Processor), etc.; they can also be DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0137] Optionally, specific examples in this embodiment can refer to the examples described in the above embodiments, and will not be repeated here.

[0138] Those skilled in the art will understand that Figure 4 The structure shown is for illustrative purposes only. The device that implements the above self-supervised encoder training method can be a terminal device, such as a smartphone (e.g., Android phone, iOS phone), tablet computer, PDA, mobile Internet device (MID), PAD, etc. Figure 4 This does not limit the structure of the aforementioned electronic devices. For example, the electronic device may also include components that are more... Figure 4 The more or fewer components shown (such as network interfaces, display devices, etc.), or having the same Figure 4 The different configurations shown.

[0139] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing the hardware related to the terminal device. The program can be stored in a computer-readable storage medium, which may include: flash drive, ROM, RAM, disk or optical disk, etc.

[0140] According to another aspect of the present invention, a computer-readable storage medium is also provided, wherein a computer program is stored therein, wherein the computer program is executed by a processor to perform the steps in the above-described self-supervised encoder training method.

[0141] Optionally, in this embodiment, those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing the hardware related to the terminal device. The program can be stored in a computer-readable storage medium, which may include: flash drive, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.

[0142] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0143] If the integrated units in the above embodiments are implemented as software functional units and sold or used as independent products, they can be stored in the aforementioned computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause one or more computer devices (which may be personal computers, servers, or network devices, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention.

[0144] In the above embodiments of the present invention, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0145] In the several embodiments provided in this application, it should be understood that the disclosed client can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection between units or modules, and may be electrical or other forms.

[0146] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0147] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0148] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.

Claims

1. A self-supervised encoder training method, characterized in that, include: Obtain the image and text sequence to be tested; The image to be tested and the text sequence to be tested are input into the trained encoder module; Target image features and target text features are obtained through encoding; The target image features and the target text features are used in downstream tasks; The step of inputting the test image and the test text sequence into the trained encoder module includes: acquiring the training image and the training text; performing a masking operation on the training image to obtain a first image sequence, and performing a masking operation on the training text to obtain a first text sequence; and training the encoder module using the first image sequence and the first text sequence to obtain the trained encoder module. The step of training the encoder module using the first image sequence and the first text sequence to obtain the trained encoder module includes: concatenating the first image sequence and the first text sequence to obtain a first sequence; inputting the first sequence into the encoder module; obtaining encoded first image features and first text features through encoding; and training the encoder module using the first image features and the first text features to obtain the trained encoder module. Training an encoder module based on the first image features and the first text features to obtain the trained encoder module includes: concatenating a mask image block in the image to be trained with the first image features to obtain a second image feature, wherein the mask image block is an image block in the first number of image blocks after dividing the image to be trained into equal blocks to obtain a first number of image blocks, and performing a mask operation on the first number of image blocks; concatenating a mask text block in the text to be trained with the first text features to obtain a second text feature, wherein the mask text block is a text block in the second number of text blocks after dividing the text to be trained into equal blocks to obtain a second number of text blocks, and performing a mask operation on the second number of text blocks; and training the encoder module using the second image features and the second text features to obtain the trained encoder module.

2. The method according to claim 1, characterized in that, The step of performing a masking operation on the image to be trained to obtain the first image sequence includes: The image to be trained is divided into equal blocks to obtain a first number of image blocks; The masking operation is performed on a first proportion of image blocks out of the first number of image blocks to obtain masked image blocks and unmasked image blocks; The unmasked image blocks are stitched together to obtain the first image sequence.

3. The method according to claim 1, characterized in that, The step of performing a masking operation on the text to be trained to obtain the first text sequence includes: The text to be trained is divided into equal blocks to obtain a second number of text blocks; Perform the masking operation on a second proportion of the second number of text blocks to obtain masked text blocks and unmasked text blocks; The unmasked text blocks are concatenated to obtain the first text sequence.

4. The method according to claim 1, characterized in that, The step of training the encoder module using the second image features and the second text features to obtain the trained encoder module module includes: The second image feature and the second text feature are input into the decoder module; Through decoding, the decoded third image features and third text features are obtained; Based on the third image features and the third text features, the decoder module performs reconstruction to obtain the reconstructed image and reconstructed text; The loss is calculated by comparing the reconstructed image with the training image, and the reconstructed text with the training text. The loss is backpropagated to the encoder module to obtain the trained encoder module.

5. The method according to claim 1, characterized in that, The step of using the target image features and the target text features for downstream tasks includes: The test image is classified according to the target image features, and the test text sequence is classified according to the target text features; or The objects in the test image are identified by the target image features, and the content in the test text sequence is identified based on the test text sequence.

6. A self-supervised encoder training device, characterized in that, include: The acquisition module is used to acquire the image and text sequence to be tested; An input module is used to input the image to be tested and the text sequence to be tested into a trained encoder module; The encoding module is used to obtain target image features and target text features through encoding; The processing module is used to apply the target image features and the target text features to downstream tasks; The step of inputting the test image and the test text sequence into the trained encoder module includes: acquiring the training image and the training text; performing a masking operation on the training image to obtain a first image sequence, and performing a masking operation on the training text to obtain a first text sequence; and training the encoder module using the first image sequence and the first text sequence to obtain the trained encoder module. The step of training the encoder module using the first image sequence and the first text sequence to obtain the trained encoder module includes: concatenating the first image sequence and the first text sequence to obtain a first sequence; inputting the first sequence into the encoder module; obtaining encoded first image features and first text features through encoding; and training the encoder module using the first image features and the first text features to obtain the trained encoder module. Training an encoder module based on the first image features and the first text features to obtain the trained encoder module includes: concatenating a mask image block in the image to be trained with the first image features to obtain a second image feature, wherein the mask image block is an image block in the first number of image blocks after dividing the image to be trained into equal blocks to obtain a first number of image blocks, and performing a mask operation on the first number of image blocks; concatenating a mask text block in the text to be trained with the first text features to obtain a second text feature, wherein the mask text block is a text block in the second number of text blocks after dividing the text to be trained into equal blocks to obtain a second number of text blocks, and performing a mask operation on the second number of text blocks; and training the encoder module using the second image features and the second text features to obtain the trained encoder module.

7. A computer-readable storage medium storing a computer program, characterized in that, The computer program is executed by the processor to perform the method described in any one of claims 1 to 5.

8. An electronic device comprising a memory and a processor, characterized in that, The memory stores a computer program, and the processor is configured to execute the method described in any one of claims 1 to 5 through the computer program.