Mask image modeling self-supervised learning method and device based on unified information flow

By using a unified information flow-based method to segment images and construct fractal sequences, the problem of not considering the image block feature dependencies and two-dimensional characteristics in existing technologies is solved, resulting in more accurate image representation and better model applicability.

CN119206427BActive Publication Date: 2026-06-12INST OF AUTOMATION CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INST OF AUTOMATION CHINESE ACAD OF SCI
Filing Date
2024-08-08
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing masked image modeling methods fail to effectively address the dependencies between image patch features and the two-dimensional characteristics of image modalities during serialized feature construction, resulting in decreased representation learning ability and poor versatility.

Method used

A unified information flow-based approach is adopted, which divides the image into blocks using an image block encoder, decouples the information flow using a feature encoder, constructs fractal sequences by combining the fractal space filling curve paradigm, and finally performs feature decoding and loss value construction using a feature decoder.

🎯Benefits of technology

It improves the accuracy of image representation and the versatility of the model, enabling more accurate feature extraction in various two-dimensional image scenes and enhancing the model's generalization performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119206427B_ABST
    Figure CN119206427B_ABST
Patent Text Reader

Abstract

The application provides a mask image modeling self-supervised learning method and device based on unified information flow, which comprises the following steps: image blocking of a to-be-processed image of a self-supervised learning task to obtain equal-pixel blocks; calling a feature encoder to perform information flow-based decoupling processing on the equal-pixel blocks to obtain deep representation information; based on a fractal space-filling curve norm, constructing a fractal sequence of the deep representation information to obtain fractal sequence representation information, and calling a feature decoder to obtain a feature prediction value; performing linear mapping processing on the feature prediction value to obtain a target prediction result of the self-supervised learning task, and constructing a self-supervised learning loss value for training the self-supervised learning model. Through the application, the problem that the prior art does not pay attention to the dependency relationship between image block features and the two-dimensional characteristics of image modalities when constructing image features, resulting in a decline in image representation learning ability during mask image modeling and poor universality is solved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and in particular to a self-supervised learning method and apparatus for mask image modeling based on a unified information flow. Background Technology

[0002] Masked image modeling, as a generative visual self-supervised learning task, has achieved significant academic progress and widespread applications due to its powerful representation learning capabilities. It typically relies on autoregressive models or autoencoders to reconstruct corrupted images without requiring any label information to complete the pre-training task.

[0003] Existing masked image modeling techniques often employ autoencoders and autoregressive models to model the data. Autoencoders typically treat all masked image patches equally and complete predictions at the same time step, employing an architecture with both bidirectional and symmetrical information flow. While this provides a degree of simplicity in prediction, it fails to consider the dependencies between image patch features. Autoregressive models, although taking into account the sequential nature of images by serializing image features to represent dependencies between them, still use a sequential scanning method (from left to right and then from top to bottom) similar to natural language for image modality. Although a random serialization method has been proposed for sequential scanning, neither the natural language-like sequential scanning method nor the random serialization method considers the two-dimensional characteristics of the image modality. This leads to a decline in image representation learning ability in masked image modeling and limits its applicability to specific target task scenarios, resulting in poor generality. Summary of the Invention

[0004] This invention provides a self-supervised learning method and apparatus for mask image modeling based on a unified information flow, which addresses the technical problems in existing technologies that fail to consider the dependencies between image patch features and lack consideration for the two-dimensional characteristics of image modalities when constructing serialized features for mask image modeling. These technologies still employ natural language sequence scanning or random serialization methods, resulting in decreased image representation learning ability and poor versatility in mask image modeling.

[0005] This invention provides a self-supervised learning method for masked image modeling based on a unified information flow, applied to a self-supervised learning model. The self-supervised learning model includes an image patch encoder, a feature encoder, and a feature decoder. The method includes the following steps:

[0006] The image to be processed for the self-supervised learning task is obtained, and the image block encoder is called to perform image block processing on the image to be processed to obtain multiple equally divided pixel blocks.

[0007] The feature encoder is invoked to perform information flow-based decoupling processing on the equally divided pixel blocks to obtain the deep representation information of the image to be processed.

[0008] Based on the fractal space filling curve paradigm, fractal sequences are constructed on the deep representation information to obtain fractal sequence representation information. The feature decoder is then called to perform feature decoding processing on the fractal sequence representation information to obtain the feature prediction value of the self-supervised learning task.

[0009] The predicted feature values ​​are linearly mapped to obtain the target prediction result of the self-supervised learning task, and a self-supervised learning loss value is constructed based on the target prediction result, wherein the self-supervised learning loss value is used to train the self-supervised learning model.

[0010] In some embodiments, the step of calling the image block encoder to perform image block processing on the image to be processed to obtain multiple equally divided pixel blocks includes:

[0011] The image to be processed is subjected to embedding processing to obtain a labeled embedding sequence;

[0012] Obtain the binarized mask sequence corresponding to the image to be processed, and perform image masking on the marker embedding sequence using the binarized mask sequence to obtain visible pixel blocks and masked pixel blocks, which serve as multiple equally divided pixel blocks of the image to be processed.

[0013] In some embodiments, the information flow includes a body perception information flow and an external body perception information flow;

[0014] The visible pixel blocks in the equally divided pixel blocks have combined features, which include image pixel features and pixel position features. The mask pixel blocks in the equally divided pixel blocks have mask position features.

[0015] The ontology-aware information flow includes a first mapping relationship between combined features or between mask position features;

[0016] The external body perception information flow includes a second mapping relationship between combined features and mask position features, or between mask position features and combined features;

[0017] In some embodiments, the feature encoder is invoked to perform information flow-based decoupling processing on the equally divided pixel blocks to obtain deep representation information of the image to be processed, including:

[0018] When the equally divided pixel blocks have a first mapping relationship, the attention network of the feature encoder is invoked to process the equally divided pixel blocks using a self-attention mechanism to obtain attention output information;

[0019] When the equally divided pixel blocks have a second mapping relationship, the attention network of the feature encoder is invoked to perform cross-attention mechanism processing on the equally divided pixel blocks to obtain attention output information;

[0020] The attention output information is linearly mapped through the feedforward network in the feature encoder to obtain the deep representation information of the image to be processed.

[0021] In some embodiments, the fractal space-filling curve paradigm includes the Hilbert curve paradigm or the Z-order curve paradigm. The process of constructing a fractal sequence from the deep representation information based on the fractal space-filling curve paradigm to obtain fractal sequence representation information includes:

[0022] The deep representation information is scanned using the Hilbert curve paradigm, and the feature fractal sequences obtained from the sequence scan are spliced ​​together to obtain the fractal sequence representation information. Alternatively, the deep representation information is scanned using the Z-order curve paradigm, and the feature fractal sequences obtained from the sequence scan are spliced ​​together to obtain the fractal sequence representation information.

[0023] In some embodiments, the image to be processed has a ground truth label, which is constructed from the HOG feature map generated from the image to be processed. The step of constructing a self-supervised learning loss value based on the target prediction result includes:

[0024] Determine the difference between the true label and the target prediction result;

[0025] An L2 norm loss function is constructed based on the difference, and the L2 norm loss function is used as the self-supervised learning loss value.

[0026] The present invention also provides a self-supervised learning device for mask image modeling based on a unified information flow, the device comprising the following modules:

[0027] The acquisition module is used to acquire the image to be processed for the self-supervised learning task, and call the image block encoder to perform image block processing on the image to be processed to obtain multiple equally divided pixel blocks;

[0028] The decoupling module is used to call the feature encoder to perform information flow-based decoupling processing on the equally divided pixel blocks to obtain the deep representation information of the image to be processed.

[0029] The module is used to construct fractal sequences from the deep representation information based on the fractal space filling curve paradigm, obtain fractal sequence representation information, and call the feature decoder to perform feature decoding processing on the fractal sequence representation information to obtain the feature prediction value of the self-supervised learning task.

[0030] The training module is used to perform linear mapping processing on the feature prediction values ​​to obtain the target prediction result of the self-supervised learning task, and to construct a self-supervised learning loss value based on the target prediction result, wherein the self-supervised learning loss value is used to train the self-supervised learning model.

[0031] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the self-supervised learning method for mask image modeling based on unified information flow as described above.

[0032] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the self-supervised learning method for mask image modeling based on a unified information flow as described above.

[0033] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the self-supervised learning method for mask image modeling based on unified information flow as described above.

[0034] This invention provides a self-supervised learning method and apparatus for masked image modeling based on unified information flow. The method divides the image to be processed for the self-supervised learning task into multiple equally divided pixel blocks. Then, a feature encoder is invoked to decouple these equally divided pixel blocks based on information flow, obtaining deep representation information. This extraction of image features, based on the decoupling of information flow between equally divided pixel blocks, is more accurate than existing technologies that directly extract image features at the same time step without considering the dependencies between image blocks. Next, a fractal space-filling curve paradigm is used to construct a fractal sequence from the extracted deep representation information, resulting in fractal sequence representation information for feature decoding to obtain feature prediction values. Therefore, before feature decoding, a fractal space-filling curve paradigm is designed to construct fractal sequences of representation information with different information flows. Compared to existing technologies that use natural language scanning sequences or random serialization for feature sequence construction, this approach fully considers the two-dimensional characteristics between image blocks, further improving the accuracy of image representation. Furthermore, the fractal space-filling curve paradigm can be applied to various two-dimensional image scenarios, possessing a certain degree of versatility. Attached Figure Description

[0035] To more clearly illustrate the technical solutions in this invention or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced one by one below. Obviously, the accompanying drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0036] Figure 1 This is a flowchart illustrating the self-supervised learning method for mask image modeling based on unified information flow provided by the present invention.

[0037] Figure 2 This is a schematic diagram of attention calculation based on information flow provided by the present invention.

[0038] Figure 3 This is a schematic diagram of the self-encoding paradigm provided in the prior art.

[0039] Figure 4 This is a schematic diagram of the autoregressive paradigm provided in the prior art.

[0040] Figure 5 This is a schematic diagram of sequence processing using various autoregressive paradigms provided by the present invention.

[0041] Figure 6 This is a schematic diagram of attention visualization analysis on images of different test datasets provided by the present invention.

[0042] Figure 7 This is a schematic diagram of the structure of the self-supervised learning device for mask image modeling based on unified information flow provided by the present invention.

[0043] Figure 8 This is a schematic diagram of the physical structure of an electronic device provided by the present invention. Detailed Implementation

[0044] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0045] The following is combined with Figure 1 This invention describes a self-supervised learning method for mask image modeling based on a unified information flow. Figure 1 This is a flowchart illustrating the self-supervised learning method for mask image modeling based on unified information flow provided by the present invention, as shown below. Figure 1 As shown, the method includes the following steps.

[0046] Step 101: Obtain the image to be processed for the self-supervised learning task, and call the image block encoder to perform image block processing on the image to be processed to obtain multiple equally divided pixel blocks.

[0047] Firstly, the self-supervised learning method for masked image modeling based on unified information flow provided in this invention is applied to self-supervised learning models. Self-supervised learning models are generally trained through self-supervised learning tasks, such as image object detection (recognition), image semantic segmentation, and image (feature) description. The architecture of the self-supervised learning model specifically includes an image patch encoder, a feature encoder, and a feature decoder.

[0048] First, based on the self-supervised learning task, obtain the image to be processed for the self-supervised learning task, which serves as the input image for the self-supervised learning model, denoted as [image name missing]. , H, W, and C represent the height, width, and channels of the input image, respectively, where R represents the set of real numbers. Then, the image block encoder in the self-supervised learning model is called to perform image block processing on the image to be processed, obtaining multiple equally divided pixel blocks. Here, the image block encoder is a convolutional neural network, with a convolutional kernel size of 16 and a stride of 16. The images to be processed are then processed according to batch data. The image is input into the convolutional layer of a convolutional neural network, where it undergoes embedding processing to obtain a labeled embedding sequence, denoted as . , P is the side length of the image patch, and N represents the number of equally divided pixel patches. .

[0049] Next, we obtain the binary mask sequence corresponding to the image to be processed, denoted as . The marker embedding sequence is obtained by using a binarized mask sequence. Perform image masking to obtain visible pixel blocks. and mask pixel blocks This is represented by multiple equally divided pixel blocks of the image to be processed. These equally divided pixel blocks constitute the feature map of the image to be processed, denoted as... Furthermore, during the image masking process, a special placeholder for [MASK] can be determined, denoted as... .

[0050] Step 102: Call the feature encoder to perform information flow-based decoupling processing on the equally divided pixel blocks to obtain the deep representation information of the image to be processed.

[0051] After determining multiple equally divided pixel blocks, the feature encoder is then called to perform information flow-based decoupling processing on the equally divided pixel blocks to obtain the deep representation information of the image to be processed.

[0052] Before extracting deep representation information, we need to consider the masking operation. Each equally divided pixel block has two types of metadata: pixel and position. Therefore, when the equally divided pixel block is a visible pixel block (the unmasked pixel block), the visible pixel block has combined features, which include image pixel features and pixel position features, meaning that the visible pixel block has two types of metadata. Correspondingly, when the equally divided pixel block is a masked pixel block (the masked pixel block), the masked pixel block only has mask position features, meaning that the masked pixel block has only one type of metadata.

[0053] Based on this, in this embodiment of the invention, the feature dependencies between equally divided pixel blocks are decoupled into four types of information flows, namely, ontology-aware information flows and external-aware information flows. The ontology-aware information flows are further divided into two types: the first mapping relationship between the combined features of visible pixel blocks, or the first mapping relationship between the mask position features of mask pixel blocks. This first mapping relationship represents the information flow between two adjacent visible pixel blocks or two adjacent mask pixel blocks. Therefore, the information flow between visible pixel blocks is a bidirectional semantic and positional information fusion flow, while the information flow between mask pixel blocks is only a bidirectional positional information fusion flow.

[0054] The external perception information flow is also divided into two types: a second mapping relationship between the combined features of visible pixel blocks and the mask position features of mask pixel blocks, or a second mapping relationship between the mask position features of mask pixel blocks and the combined features of visible pixel blocks. This second mapping relationship represents the information flow between visible pixel blocks and adjacent mask pixel blocks, or between mask pixel blocks and adjacent visible pixel blocks. Therefore, the information flow from visible pixel blocks to mask pixel blocks is a unidirectional flow of semantic and positional information, while the information flow from mask pixel blocks to visible pixel blocks is a unidirectional flow of positional information.

[0055] Based on the decoupled information flow, the feature encoder in the self-supervised learning model is then invoked to extract deep representation information. The feature encoder (e.g., a Transformer) includes an attention network and a feedforward network (e.g., a multilayer perceptron). The attention network includes 12 decoupled fractal attention modules, and the feedforward network includes 12 forward feedback modules. These 24 modules are stacked alternately to form the feature encoder. This invention employs a context-query-based dual-stream modeling approach to equally divided pixel blocks. Feature extraction will be explained in detail below.

[0056] When the equally divided pixel blocks have a first mapping relationship, the attention network of the feature encoder is invoked to process the equally divided pixel blocks using a self-attention mechanism to obtain attention output information.

[0057] Here, when the equally divided pixel blocks have a first mapping relationship, it indicates that the feature dependency relationship between the equally divided pixel blocks is decoupled from the ontology-aware information flow (i.e., the bidirectional semantic and positional information fusion information flow between visible pixel blocks, and the bidirectional positional information fusion information flow between masked pixel blocks). Therefore, self-attention computation is performed on the equally divided pixel blocks through the attention network of the feature encoder. See also Figure 2 , Figure 2 This is a schematic diagram of attention calculation based on information flow provided by the present invention. Based on the information flow fused with bidirectional semantic and positional information between visible pixel blocks, its self-attention calculation is as follows: Figure 2 As shown in (1), based on the image pixel features and the [MASK] mask token (i.e., the special placeholder of [MASK]), Calculate the query vector Key vector Sum value vector Then, self-attention calculation is performed to obtain attention output information.

[0058] The self-attention calculation is based on the information flow where only bidirectional positional information is fused between mask pixel blocks. Figure 2 As shown in (2), the query vector is calculated based on the positional features of the equally divided pixel blocks (including pixel positional features and mask positional features). Key vector Sum value vector Then, self-attention calculation is performed to obtain attention output information.

[0059] During self-attention computation, the input visible pixel blocks are processed through the change function of the linear layer in the attention network. and mask pixel blocks Linear projection onto three high-dimensional vector subspaces yields the corresponding query vector. ), key vector ( ) and value vector ( Then, attention calculation is performed to obtain the attention output information, denoted as... The process of attention calculation is as follows: (1)

[0060] (1)

[0061] In the above formula (1), This represents the eigenvector multiplication calculation, where T represents the matrix transpose. Let K represent the feature modulation coefficient, K represent the feature vector dimension, and i represent the i-th equally divided pixel block.

[0062] When the equally divided pixel blocks have a second mapping relationship, the attention network of the feature encoder is invoked to process the equally divided pixel blocks using a cross-attention mechanism to obtain attention output information. Here, when the equally divided pixel blocks have a second mapping relationship, it indicates that the feature dependency relationship between the equally divided pixel blocks is decoupled from the external perception information flow (i.e., the information flow of unidirectional semantic and positional information propagation between visible pixel blocks and mask pixel blocks, and the information flow of unidirectional positional information propagation between mask pixel blocks and visible pixel blocks). Therefore, the attention network of the feature encoder performs cross-attention calculation on the equally divided pixel blocks. Based on the information flow of unidirectional semantic and positional information propagation between visible pixel blocks and mask pixel blocks, the cross-attention calculation is as follows: Figure 2 As shown in (3), the query vector is calculated based on the image pixel features and the [MASK] mask token during the crossover process. The key vector is calculated based on the positional features of the equally divided pixel blocks (including pixel positional features and mask positional features). Sum value vector Then, attention calculation is performed to obtain attention output information. This enables inference calculation of the information flow from combined features to mask positional features, achieving unidirectional positional information propagation.

[0063] The cross-attention calculation for information flow based on the unidirectional propagation of positional information from masked pixel blocks to visible pixel blocks is as follows: Figure 2 As shown in (4), during the crossover process, the query vector is calculated based on the positional features of the equally divided pixel blocks (including pixel positional features and mask positional features). The key vector is calculated based on the image pixel features and the [MASK] mask token. Sum value vector Then, attention calculation is performed to obtain attention output information. This enables inference calculation of information flow, which propagates unidirectional positional information from mask positional features to combined features.

[0064] Attention output information corresponding to various information flows is obtained through attention network calculation. Then, the attention output information is processed through the feedforward network in the feature encoder. Perform linear mapping to obtain the deep representation information of the image to be processed, denoted as... .

[0065] In this embodiment of the invention, the image to be processed is divided into multiple equally divided pixel blocks, and then a feature encoder is invoked to perform information flow-based decoupling processing on the equally divided pixel blocks to obtain deep representation information. Therefore, image features are extracted based on the decoupling of information flow between equally divided pixel blocks. Compared to existing technologies that directly extract image features at the same time step without considering the dependencies between image blocks, this method can extract image representations more accurately.

[0066] See also Figure 1 Step 103: Based on the fractal space filling curve paradigm, construct fractal sequences for deep representation information to obtain fractal sequence representation information, and call the feature decoding encoder to perform feature decoding processing on the fractal sequence representation information to obtain the feature prediction value of the self-supervised learning task.

[0067] After extracting features in step 102, the next step is to decode the extracted features according to the learning objective of self-supervised learning, mapping them to feature prediction values ​​for the learning objective, thus executing the self-supervised learning task. However, before feature decoding, it is necessary to process the extracted deep representation information. Serialization is then performed. Serialization is generally accomplished using a serialization paradigm corresponding to the information stream. Serialization paradigms include autoencoder paradigms and autoregressive paradigms. For example... Figure 3 As shown, Figure 3 This is a schematic diagram of a self-encoding paradigm provided in the prior art. The self-encoding paradigm typically employs an architecture that combines bidirectionality and symmetry ([M] belongs to the mask token), based on image pixel features x ( ) and image location features p ( To extract deep representation information z, where, However, it did not consider the dependencies between features. For example... Figure 4 As shown, Figure 4 This is a schematic diagram of an autoregressive paradigm provided in existing technology. While the autoregressive paradigm focuses on the sequentiality between image pixel features—that is, the available contextual information should dynamically change with its position in the sequence—it is not the only approach to such features. Figure 4 In the image, from pixel features Initially, the process involves extracting contextual information T times until deep representation information is obtained. As the final context information, the process is equivalent to processing a sequence of T elements sequentially. Each extracted contextual information varies depending on the element's position in the sequence. Since there is a connection between each adjacent element in the sequence, feature serialization can theoretically express the dependencies between image pixel features. However, the autoregressive paradigm does not consider the two-dimensional nature of image modalities (the element sequence is only one-dimensional), because there may also be dependencies between two non-adjacent image pixel features.

[0068] Autoregressive paradigms often require first serializing the input image feature information, and this serialization process typically involves scanning the feature information using a scanning method. See also Figure 5 , Figure 5This is a schematic diagram illustrating sequence processing using various autoregressive paradigms provided by this invention. The diagram shows the processing of four feature sequences for 64 equally divided pixel blocks (labeled 0-63). Figure 5 (a) represents the traditional autoregressive paradigm, which is a natural language-based sequence scanning method. It scans the sequence in a strict line-by-line manner, and the resulting sequence representation information is fixed. Figure 5 (b) represents a scanning method based on the random serialization paradigm. The direction of each sequence scan is random, so the sequence representation information obtained from each scan is not fixed. Neither of these two methods can accurately express the dependency relationship between equally divided pixel blocks.

[0069] Based on this, this invention designs a fractal space-filling curve paradigm, and constructs fractal sequences from deep representation information based on the fractal space-filling curve paradigm to obtain fractal sequence representation information. The fractal space-filling curve paradigm includes the Hilbert curve paradigm or the Z-order curve paradigm, which are essentially also fractal autoregressive paradigms.

[0070] The Hilbert curve paradigm is designed using the definition of the Hilbert curve in fractal geometry. During fractal sequence construction, the Hilbert curve paradigm is used to perform fractal sequence scanning on deep representation information, and the feature fractal sequences obtained from the sequence scanning are then spliced ​​together to obtain the fractal sequence representation information. A schematic diagram of its analysis sequence scanning is shown below. Figure 5 As shown in (c), the extracted fractal sequence representation information can accurately express the dependency relationship between each equally divided pixel block without causing the loss of feature information.

[0071] Similarly, the Z-order curve paradigm is designed using the Z-order curve definition commonly used in spatial indexing algorithms for meshing space. During fractal sequence construction, the Z-order curve paradigm is used to perform fractal sequence scanning on deep representation information, and the feature fractal sequences obtained from the sequence scanning are then concatenated to obtain the fractal sequence representation information. A schematic diagram of its analysis sequence scanning is shown below. Figure 5 As shown in (d), the extracted fractal sequence representation information can accurately express the dependency relationship between each equally divided pixel block without causing the loss of feature information.

[0072] Deep representation information through the fractal space filling curve paradigm After performing serialization to obtain the fractal sequence representation information, the feature decoder in the self-supervised learning model is then called to perform feature decoding on the fractal sequence representation information, thereby obtaining the feature prediction values ​​for the self-supervised learning task. The structure of the feature decoder here is similar to that of the feature encoder described above. Both include an attention network and a feedforward network, and both have the same number of decoupled fractal attention modules and feedforward modules. Therefore, the structure of the feature decoder will not be elaborated here. Finally, the feedforward network of the feature decoder outputs the feature prediction values ​​for the self-supervised learning task. .

[0073] This invention utilizes a fractal space-filling curve paradigm to construct fractal sequences from extracted deep representation information, obtaining fractal sequence representation information for feature decoding to obtain feature prediction values. Therefore, before feature decoding, a fractal space-filling curve paradigm is designed to construct fractal sequences of representation information with different information flows. Compared to existing technologies that use natural language scanning sequences or random serialization for feature sequence construction, this approach fully considers the two-dimensional characteristics between image patches, further improving image representation accuracy. Furthermore, the fractal space-filling curve paradigm can be applied to various two-dimensional image scenarios, possessing a certain degree of versatility.

[0074] Step 104: Perform linear mapping on the feature prediction values ​​to obtain the target prediction results of the self-supervised learning task, and construct the self-supervised learning loss value based on the target prediction results.

[0075] Determine the feature prediction values ​​for the self-supervised learning task Then, a simple output projection layer (e.g., a fully connected neural network) is used to predict the feature values. A linear mapping process is performed to obtain the target prediction result of the self-supervised learning task, denoted as Y. The target prediction result can be, for example, the classification (recognition) probability of image target recognition, thus completing the self-supervised learning task of the self-supervised learning model.

[0076] However, during the training of the self-supervised learning model, this embodiment of the invention pre-generates an HOG feature map, i.e., a Histogram of Oriented Gradients (HOG), based on the image to be processed. Then, other pre-trained self-supervised learning models are called to predict and generate the HOG feature map. By executing the corresponding self-supervised learning task, the true label of the image to be processed is finally constructed, denoted as... The ground truth labels are then labeled onto the images to be processed. After the self-supervised learning model determines the target prediction result Y of the self-supervised learning task, the ground truth labels are then determined. The difference between the predicted result Y and the target result Y is used to construct an L2 norm loss function, which is then used as the self-supervised learning loss value, denoted as . It is expressed as the following formula (2):

[0077] (2)

[0078] In the above formula (2), This indicates the calculation of the L2 norm.

[0079] The self-supervised learning loss value is calculated using formula (2). Then, the self-supervised learning loss value is used. Used for iterative training of self-supervised learning models. In each iteration, the self-supervised learning loss value is used... In a self-supervised learning model, backpropagation is performed sequentially through the feature decoder, feature encoder, and image encoder. During backpropagation, gradients are calculated and optimized using gradient descent. Thus, each iteration optimizes the loss value during backpropagation. The training continues until the specified number of iterations or loss value is reached. Once convergence begins (i.e., it stops decreasing), the training of the self-supervised learning model ends. The trained self-supervised learning model can then be used directly for prediction and to perform self-supervised learning tasks.

[0080] The self-supervised learning model trained in this embodiment of the invention can be applied to various two-dimensional image scenes and shows certain performance on different test datasets. See also... Figure 6 , Figure 6 This is a schematic diagram of attention visualization analysis on images in different test datasets provided by the present invention, such as... Figure 6 As shown, multiple different prediction targets are set in a target scene, and the prediction targets have certain dependencies on each other. The datasets are: a fish tank scene in the upper left, an office desk scene in the upper right, a bathroom scene in the lower left, and a street scene in the lower right. In each scene, a trained self-supervised learning model can be used for mask image modeling, and attention calculations are performed on different prediction targets in the scene to obtain the prediction results. Compared with existing mask image modeling techniques, the trained self-supervised learning model can exert greater feature representation capabilities, is applicable to different target scenes, has versatility, and improves the model's generalization performance.

[0081] In this embodiment of the invention, the image to be processed for a self-supervised learning task is divided into multiple equally divided pixel blocks. Then, a feature encoder is invoked to perform information flow-based decoupling processing on these equally divided pixel blocks to obtain deep representation information. Next, a fractal space-filling curve paradigm is used to construct a fractal sequence from the extracted deep representation information, resulting in fractal sequence representation information, which is then used for feature decoding to obtain feature prediction values. This method extracts image features by decoupling the information flow between equally divided pixel blocks. Compared to existing technologies that directly extract image features at the same time step without considering the dependencies between image blocks, this method can extract image representations more accurately. Before feature decoding, a fractal space-filling curve paradigm is designed to construct fractal sequences of representation information with different information flows. Compared to existing technologies that use natural language scanning sequences or random serialization to construct feature sequences, this method fully considers the two-dimensional characteristics between image blocks, further improving the accuracy of image representation. Furthermore, the fractal space-filling curve paradigm can be applied to various two-dimensional image scenarios, possessing a certain degree of versatility and improving the model's generalization performance.

[0082] The following describes the self-supervised learning device for mask image modeling based on unified information flow provided by the present invention. The self-supervised learning device for mask image modeling based on unified information flow described below can be referred to in correspondence with the self-supervised learning method for mask image modeling based on unified information flow described above.

[0083] See Figure 7 , Figure 7 This is a schematic diagram of the structure of the self-supervised learning device for mask image modeling based on unified information flow provided by the present invention, as shown below. Figure 7 As shown, the self-supervised learning device for mask image modeling based on unified information flow includes the following modules: acquisition module 701, decoupling module 702, construction module 703, and training module 704. The system comprises the following modules: an acquisition module 701, which acquires the image to be processed for the self-supervised learning task and calls the image block encoder to perform image block processing on the image to be processed, obtaining multiple equally divided pixel blocks; a decoupling module 702, which calls the feature encoder to perform information flow-based decoupling processing on the equally divided pixel blocks, obtaining deep representation information of the image to be processed; a construction module 703, which constructs a sequence of the deep representation information based on the fractal space filling curve paradigm, obtaining fractal sequence representation information, and calls the feature decoder to perform feature decoding processing on the fractal sequence representation information, obtaining feature prediction values ​​for the self-supervised learning task; and a training module 704, which performs linear mapping processing on the feature prediction values ​​to obtain the target prediction result of the self-supervised learning task, and constructs a self-supervised learning loss value based on the target prediction result, wherein the self-supervised learning loss value is used to train the self-supervised learning model.

[0084] It should be noted that the beneficial effects of the self-supervised learning device for mask image modeling based on unified information flow can be compared with those of the self-supervised learning method for mask image modeling based on unified information flow mentioned above. Therefore, the beneficial effects of the self-supervised learning device for mask image modeling based on unified information flow will not be elaborated here.

[0085] Figure 8 This is a schematic diagram of the physical structure of an electronic device provided by the present invention, such as... Figure 8 As shown, the electronic device may include: a processor 810, a communications interface 820, a memory 830, and a communications bus 840, wherein the processor 810, the communications interface 820, and the memory 830 communicate with each other through the communications bus 840. The processor 810 can call logic instructions in the memory 830 to execute a self-supervised learning method for mask image modeling based on a unified information flow. This method includes: acquiring an image to be processed for a self-supervised learning task; calling the image block encoder to perform image block processing on the image to be processed, obtaining multiple equally divided pixel blocks; calling the feature encoder to perform information flow-based decoupling processing on the equally divided pixel blocks to obtain deep representation information of the image to be processed; constructing a fractal sequence based on the fractal space-filling curve paradigm to obtain fractal sequence representation information; calling the feature decoder to perform feature decoding processing on the fractal sequence representation information to obtain feature prediction values ​​for the self-supervised learning task; performing linear mapping processing on the feature prediction values ​​to obtain the target prediction result of the self-supervised learning task; and constructing a self-supervised learning loss value based on the target prediction result, wherein the self-supervised learning loss value is used to train the self-supervised learning model.

[0086] Furthermore, the logical instructions in the aforementioned memory 830 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0087] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the self-supervised learning method for mask image modeling based on unified information flow provided by the above methods. The method includes: acquiring an image to be processed for a self-supervised learning task, and calling the image block encoder to perform image block processing on the image to be processed to obtain multiple equally divided pixel blocks; calling the feature encoder to perform information flow-based decoupling processing on the equally divided pixel blocks to obtain deep representation information of the image to be processed; constructing a fractal sequence on the deep representation information based on the fractal space filling curve paradigm to obtain fractal sequence representation information, and calling the feature decoder to perform feature decoding processing on the fractal sequence representation information to obtain feature prediction values ​​for the self-supervised learning task; performing linear mapping processing on the feature prediction values ​​to obtain the target prediction result of the self-supervised learning task, and constructing a self-supervised learning loss value based on the target prediction result, wherein the self-supervised learning loss value is used to train the self-supervised learning model.

[0088] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon. When executed by a processor, the computer program implements a self-supervised learning method for mask image modeling based on a unified information flow, as provided by the methods described above. This method includes: acquiring an image to be processed for a self-supervised learning task; calling the image block encoder to perform image block processing on the image to be processed, obtaining multiple equally divided pixel blocks; calling the feature encoder to perform information flow-based decoupling processing on the equally divided pixel blocks, obtaining deep representation information of the image to be processed; constructing a fractal sequence based on a fractal space-filling curve paradigm on the deep representation information, obtaining fractal sequence representation information; calling the feature decoder to perform feature decoding processing on the fractal sequence representation information, obtaining feature prediction values ​​for the self-supervised learning task; performing linear mapping processing on the feature prediction values ​​to obtain the target prediction result of the self-supervised learning task; and constructing a self-supervised learning loss value based on the target prediction result, wherein the self-supervised learning loss value is used to train the self-supervised learning model.

[0089] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0090] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0091] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A self-supervised learning method for masked image modeling based on unified information flow, characterized in that, Applied to a self-supervised learning model, wherein the self-supervised learning model includes an image patch encoder, a feature encoder, and a feature decoder, the method includes: The image to be processed for the self-supervised learning task is obtained, and the image block encoder is called to perform image block processing on the image to be processed to obtain multiple equally divided pixel blocks. The feature encoder is invoked to perform information flow-based decoupling processing on the equally divided pixel blocks to obtain the deep representation information of the image to be processed. Based on the fractal space filling curve paradigm, fractal sequences are constructed on the deep representation information to obtain fractal sequence representation information. The feature decoder is then called to perform feature decoding processing on the fractal sequence representation information to obtain the feature prediction value of the self-supervised learning task. The predicted feature values ​​are linearly mapped to obtain the target prediction result of the self-supervised learning task, and a self-supervised learning loss value is constructed based on the target prediction result. The self-supervised learning loss value is used to train the self-supervised learning model. The information flow includes the ontology perception information flow and the external body perception information flow. The visible pixel blocks in the equally divided pixel blocks have combined features, which include image pixel features and pixel position features. The mask pixel blocks in the equally divided pixel blocks have mask position features. The ontology-aware information flow includes a first mapping relationship between combined features or between mask position features; The external body perception information flow includes a second mapping relationship between combined features and mask position features, or between mask position features and combined features; The step of calling the feature encoder to perform information flow-based decoupling processing on the equally divided pixel blocks to obtain the deep representation information of the image to be processed includes: When the equally divided pixel blocks have a first mapping relationship, the attention network of the feature encoder is invoked to process the equally divided pixel blocks using a self-attention mechanism to obtain attention output information; When the equally divided pixel blocks have a second mapping relationship, the attention network of the feature encoder is invoked to perform cross-attention mechanism processing on the equally divided pixel blocks to obtain attention output information; The attention output information is linearly mapped through the feedforward network in the feature encoder to obtain the deep representation information of the image to be processed.

2. The self-supervised learning method for mask image modeling based on unified information flow according to claim 1, characterized in that, The step of calling the image block encoder to perform image block processing on the image to be processed, obtaining multiple equally divided pixel blocks, including: The image to be processed is subjected to embedding processing to obtain a labeled embedding sequence; Obtain the binarized mask sequence corresponding to the image to be processed, and perform image masking on the marker embedding sequence using the binarized mask sequence to obtain visible pixel blocks and masked pixel blocks, which serve as multiple equally divided pixel blocks of the image to be processed.

3. The self-supervised learning method for mask image modeling based on unified information flow according to claim 1, characterized in that, The fractal space-filling curve paradigm includes the Hilbert curve paradigm or the Z-order curve paradigm. Based on the fractal space-filling curve paradigm, the deep representation information is constructed into a fractal sequence to obtain fractal sequence representation information, including: The deep representation information is scanned using the Hilbert curve paradigm, and the feature fractal sequences obtained from the sequence scan are spliced ​​together to obtain the fractal sequence representation information. Alternatively, the deep representation information is scanned using the Z-order curve paradigm, and the feature fractal sequences obtained from the sequence scan are spliced ​​together to obtain the fractal sequence representation information.

4. The self-supervised learning method for mask image modeling based on unified information flow according to claim 1, characterized in that, The image to be processed has a real label, which is constructed from the HOG feature map generated from the image to be processed. The step of constructing a self-supervised learning loss value based on the target prediction result includes: Determine the difference between the true label and the target prediction result; An L2 norm loss function is constructed based on the difference, and the L2 norm loss function is used as the self-supervised learning loss value.

5. A self-supervised learning device for mask image modeling based on unified information flow, characterized in that, The device includes: The acquisition module is used to acquire the image to be processed for the self-supervised learning task, and call the image block encoder to perform image block processing on the image to be processed to obtain multiple equally divided pixel blocks. The decoupling module is used to call the feature encoder to perform information flow-based decoupling processing on the equally divided pixel blocks to obtain the deep representation information of the image to be processed. The module is used to construct fractal sequences from the deep representation information based on the fractal space filling curve paradigm, obtain fractal sequence representation information, and call the feature decoder to perform feature decoding processing on the fractal sequence representation information to obtain the feature prediction value of the self-supervised learning task. The training module is used to perform linear mapping processing on the feature prediction values ​​to obtain the target prediction result of the self-supervised learning task, and to construct a self-supervised learning loss value based on the target prediction result, wherein the self-supervised learning loss value is used to train the self-supervised learning model. The information flow includes the ontology perception information flow and the external body perception information flow. The visible pixel blocks in the equally divided pixel blocks have combined features, which include image pixel features and pixel position features. The mask pixel blocks in the equally divided pixel blocks have mask position features. The ontology-aware information flow includes a first mapping relationship between combined features or between mask position features; The external body perception information flow includes a second mapping relationship between combined features and mask position features, or between mask position features and combined features; The step of calling the feature encoder to perform information flow-based decoupling processing on the equally divided pixel blocks to obtain the deep representation information of the image to be processed includes: When the equally divided pixel blocks have a first mapping relationship, the attention network of the feature encoder is invoked to process the equally divided pixel blocks using a self-attention mechanism to obtain attention output information; When the equally divided pixel blocks have a second mapping relationship, the attention network of the feature encoder is invoked to perform cross-attention mechanism processing on the equally divided pixel blocks to obtain attention output information; The attention output information is linearly mapped through the feedforward network in the feature encoder to obtain the deep representation information of the image to be processed.

6. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the self-supervised learning method for mask image modeling based on a unified information flow as described in any one of claims 1 to 4.

7. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the self-supervised learning method for mask image modeling based on a unified information flow as described in any one of claims 1 to 4.

8. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the self-supervised learning method for mask image modeling based on a unified information flow as described in any one of claims 1 to 4.