Multi-granularity image feature extraction method, device and medium for image-text fusion
By extracting local and global features of images through sliding window embedding and multi-layer attention mechanism, this method solves the problems of losing key information and ignoring correlation in existing image feature extraction methods, realizes the extraction of multi-granular image features, and improves the effect of image-text fusion.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- UNIV OF ELECTRONICS SCI & TECH OF CHINA
- Filing Date
- 2023-05-22
- Publication Date
- 2026-06-26
AI Technical Summary
Existing image feature extraction methods suffer from the loss of key information during the pooling process in image-text fusion, ignore the correlation between the whole and the local, do not consider the correlation between text and image, and are unable to extract multi-granular features of the image.
We employ a sliding window to embed image vectors, combining window attention and multi-layer global attention mechanisms. We extract local features of the image through cross-correlation operations and CNN, extract global features of the image using the global attention mechanism, and output local and global features through a stitching layer.
It fully considers the relationship between images and text, can extract multi-granular image features, and improves the effect of image-text fusion.
Smart Images

Figure CN116563859B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision recognition technology, and in particular to a method, device and medium for multi-granularity image feature extraction for image-text fusion. Background Technology
[0002] With the rapid development of information technology and the Internet, the information received and sent by users is no longer in a single text form, but has become multi-modal forms such as text and images. Since data in different modalities generally only contain information belonging to their own modality—for example, in the design, manufacturing, and maintenance of integrated aviation electronic equipment, images are often used to illustrate the design, manufacturing, or maintenance work, while text is used to describe the requirements—these different forms of data all describe the same thing. When the description of this thing in one modality is incomplete, the complementary nature of different modalities can be used to reasonably fuse the data from different modalities, which can help this embodiment make a better judgment. Therefore, image-text fusion has become a key research direction in the field of deep learning. Current image-text fusion methods mainly focus on two aspects: single-modal representation and image-text feature fusion. Among them, single-modal representation is the foundation and prerequisite for image-text fusion. Google proposed the BERT model in 2018. This model uses an attention mechanism to perform bidirectional modeling of text, thereby obtaining excellent multi-granularity text vector representations at the word level, sentence level, etc. However, current image feature extraction methods do not consider the information interaction between images and text, resulting in poor image-text fusion performance. In current research on image-text fusion methods, the mainstream image feature extraction methods mainly include GoogleNet, VGGNet, Residual Networks, and the R-CNN series of networks. These methods are all based on convolutional neural networks to extract image features, and mainly include the following components:
[0003] (a) Convolutional layer: The convolutional kernel scans the pixel matrix of the image with a certain stride to extract image features and obtain a certain number of feature maps.
[0004] (b) Pooling layer: By using pooling functions, such as max-pooling and average-pooling, the feature dimension of the convolutional layer output is reduced.
[0005] (c) Fully connected layer: Flatten the feature vectors output by the pooling layer and output the global features of the image through the fully connected neural network.
[0006] The quality of data feature extraction determines the effectiveness of image-text fusion. The above-mentioned CNN-based image feature extraction methods mainly have the following technical drawbacks:
[0007] (a) The image feature extraction model based on CNN loses key information during the pooling process and also ignores the relationship between the whole and the local.
[0008] (b) Existing image feature extraction methods do not consider the relationship between text and images. They all extract global features of the image and cannot extract multi-granular features of the image. Summary of the Invention
[0009] This invention provides a multi-granularity image feature extraction method, device, and medium for image-text fusion.
[0010] In a first aspect, the present invention provides a multi-granularity image extraction method for image-text fusion, comprising the following steps: preprocessing image data;
[0011] Using a sliding window to embed image vectors: In two-dimensional cross-correlation operations, the convolution kernel starts from the top left corner of the input image tensor and slides from left to right and from top to bottom. When the convolution kernel slides to a new position, the tensor part of the current window is cross-correlated with the tensor in the convolution kernel, thereby achieving local embedding of the image and obtaining the embedding vector.
[0012] The image embedding vector is input into the attention computation layer. First, the embedding vector is used to obtain the local features of the image using the window attention mechanism. Then, a multi-layer global attention mechanism is used to extract the global features of the image.
[0013] The stitching layer outputs both local and global features of the image simultaneously.
[0014] Furthermore, the cross-correlation operation is as follows: Given two functions f(x) and g(x), the mathematical definition of the convolution of f(x) and g(x) is as shown in equation (1):
[0015] (f*g)(X)=∫f(Z)(XZ)dZ (1)
[0016] When the object of convolution is a discrete object, the integration operation in the above formula becomes a summation operation. Taking the convolution operation of a two-dimensional tensor as an example, its specific calculation method is shown in formula (2):
[0017]
[0018] Where (a,b) represents the index of function f, and (ia,ib) represents the index of function g.
[0019] Furthermore, in the step of obtaining the embedding vector, when performing cross-correlation operation, when the convolution kernel slides to the edge of the image, the edge of the image tensor needs to be padded with 0 elements. The number of elements in one slide is defined as the stride. The sliding stride of the image tensor is adjusted so that the number of text words is aligned with the image window.
[0020] Furthermore, a CNN is used for embedding image features. For an image dataset of size m, I = {i1, i2, ..., i...} m},in H is the length of the image, W is the width of the image, and C is the number of channels in the image. For image i m Assume the length of the text fused with the image is n. 2 In this embodiment, the image is segmented into parts equal to the text length n. 2 For the same number of windows, the specific segmentation method is as follows: Figure 2 As shown. Image segmentation and embedding are implemented using a convolutional neural network. First, based on the number of windows n... 2 The size of the calculation window, i.e. the size of the convolutional kernel of the convolutional neural network, is calculated as shown in equations (3) and (4).
[0021] k h =H+p h +s h -n·s h (3)
[0022] k w =W+p w +s w -n·s w (4)
[0023] Where k h k w p represents the height and width of the convolution kernel. h p w Indicates row and column fill, s h and s w This represents the hyperparameters vertical stride and horizontal stride.
[0024] Furthermore, the calculation process for extracting local features of the image using the window attention mechanism is shown in equations (5) and (6):
[0025]
[0026]
[0027] Where MLP represents a fully connected neural network, X I,[LOC]The extracted local image features are represented by W-Attention, which represents window attention calculation, that is, the calculation of multi-head attention is restricted to a window. At the same time, a channel attention mechanism is introduced in the self-attention calculation process. First, features of different aspects of the image are extracted through multiple attention heads, and the specific calculation is shown in equations (7), (8), and (9):
[0028]
[0029]
[0030]
[0031] Among them, W nj This represents the weight of the j-th attention head for the n-th image, where j is a hyperparameter, and in this technical solution, j = 12;
[0032] Multiple weight matrices W are obtained nj These weight matrices are regarded as feature maps of the image channel dimension. The contribution of each feature map to the key information is calculated by the channel attention. The specific calculation method is shown in Equation (10).
[0033]
[0034] in, The global value represents the channel attention weights of the j-th attention head weight matrix for the n-th image. avg This represents the global average pooling function, which calculates the average value of each weight matrix. max σ represents the global max pooling function, which calculates the maximum value of each weight matrix. σ represents the sigmoid function, which maps the result to a value in the 0-1 interval to obtain the channel attention weights. α and β are trainable parameters.
[0035] After obtaining the channel attention weights, the attention weights for the nth image are calculated, and the specific calculation method is shown in Equation (11):
[0036]
[0037] In obtaining the local feature vector representation of the image Then, a vector E with randomly initialized global feature labels is added before this vector. [GLO] , to obtain vector The purpose of this label is to learn the global features of the image and transform the vector... The global features of the image are calculated using a three-layer global attention mechanism. The specific calculation process is as follows:
[0038]
[0039]
[0040] Here, G-Attention represents global attention computation, which uses multi-head self-attention to compute global features of the image. In this case, the attention computation is no longer restricted to a fixed window. The vector obtained after computation through l layers of global attention can be represented as: The final output is a vector of labels. To obtain a global feature representation of the image.
[0041] Furthermore, local feature representations of the image are obtained through window attention calculation. The global feature representation of the image is obtained through multi-layer global attention computation. This embodiment represents local feature X. I,[LOC] and global feature representation By concatenating the features, we obtain feature vectors containing both local and global features of an image.
[0042] Secondly, the present invention also provides a multi-granularity image feature extraction system for image-text fusion, including an image embedding layer: embedding image vectors using a sliding window: performing cross-correlation operation based on two-dimensional convolution, specifically: in the two-dimensional cross-correlation operation, the convolution kernel starts from the upper left corner of the input image tensor, slides from left to right and from top to bottom, and when the convolution kernel slides to a new position, the tensor part of the current window performs cross-correlation operation with the tensor in the convolution kernel, thereby realizing local embedding of the image and obtaining the embedding vector;
[0043] Attention computation layer: The image embedding vector is input into the attention computation layer. First, the embedding vector is used to obtain the local features of the image using the window attention mechanism. Then, a multi-layer global attention mechanism is used to extract the global features of the image.
[0044] Output layer: Simultaneously outputs local and global features of the image through the stitching layer.
[0045] Thirdly, the present invention also provides a computer device, the computer device including a memory and a processor;
[0046] The memory is used to store computer programs;
[0047] The processor is used to execute the computer program and, when executing the computer program, implement the multi-granularity image feature extraction method for image-text fusion as described above.
[0048] Fourthly, the present invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, causes the processor to implement the above-described method for multi-granularity image feature extraction oriented towards image-text fusion.
[0049] This invention discloses a multi-granularity image feature extraction method for image-text fusion. It embeds image vectors through a sliding window and intelligently extracts local and global features of the image through a multi-layer global attention mechanism. Finally, the images are concatenated and output. This method can fully consider the relationship between images and text and achieve multi-granularity image feature extraction. Attached Figure Description
[0050] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0051] Figure 1 This is an example of convolution operation (cross-correlation operation).
[0052] Figure 2 This is a schematic diagram of the multi-granularity image feature extraction model structure.
[0053] Figure 3 A schematic diagram illustrating the segmentation of the image window into sliding sections. Detailed Implementation
[0054] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0055] The flowchart shown in the attached diagram is for illustrative purposes only and does not necessarily include all content and operations / steps, nor does it necessarily have to be performed in the order described. For example, some operations / steps can be broken down, combined, or partially merged, so the actual execution order may change depending on the actual situation.
[0056] It should be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.
[0057] It should also be understood that the term “and / or” as used in this specification and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.
[0058] This invention provides a multi-granularity image feature extraction method for image-text fusion. To achieve the fusion of text and image modalities, feature extraction needs to be performed on the text and image separately. While BERT-based text feature extraction methods are relatively mature, CNN-based image feature extraction methods have many shortcomings and cannot well adapt to image-text fusion models. Therefore, this embodiment proposes an image feature extraction module for image-text fusion. In this module, this embodiment uses CNN to implement sliding embedding of the image, obtaining the embedding vector. This embodiment uses the proposed window attention mechanism to achieve local feature extraction of the image, and uses a multi-layer global attention mechanism to achieve global feature extraction of the image, thereby realizing multi-granularity image feature extraction. This addresses the problems of existing image-text fusion methods, such as the pooling layer losing key information, ignoring the relationship between local and global images, and not considering the connection between text and image. Designed An image feature extraction method for image-text fusion is proposed. This method involves: (1) preprocessing image data; (2) embedding image vectors using a sliding window; (3) inputting the embedded image vectors into an attention layer to extract local and global features; and (4) simultaneously outputting the local and global features of the image through a concatenation layer. The overall model is structured similarly to the image feature extraction part of an image-text fusion model. The model uses a CNN to implement sliding embedding of images and obtains local and global features through an attention layer. Based on this, it outputs the local and global features of the image, thereby achieving multi-granularity image feature extraction and providing high-quality image feature support for image-text fusion.
[0059] (1) Preprocessing the image data. This mainly involves denoising the image data and unifying the pixel values.
[0060] (2) Embed image vectors using a sliding window, the process is as follows: Figure 1 The embedded layer in the middle, the window sliding method is as follows Figure 2 .
[0061] First, we study the structure of the Convolutional Neural Network (CNN) model. A CNN is a feedforward neural network based on multiple convolutional and pooling layers, commonly used for image recognition and classification tasks. Its core idea is to extract features through multiple convolutional layers and then downsample them through pooling layers to reduce the dimensionality of the features. This allows the network to recognize images more accurately while reducing computational cost and increasing processing speed. In the convolutional layers, each neuron is connected to only a small portion of the input data, and features are extracted through the convolution kernel. In the pooling layers, the maximum value or proportional average within a certain range is selected to represent the features of that region. Through the stacking of multiple convolutional and pooling layers, the network can progressively extract features at different levels, achieving higher-level abstraction. In this paper, this embodiment only uses convolution operations, padding, and stride in the convolutional layers; therefore, this embodiment only studies convolution operations, padding, and stride, as follows:
[0062] (a) Convolution operation (cross-correlation)
[0063] Given two functions f(x) and g(x), the mathematical definition of the convolution of f(x) and g(x) is shown in equation (1).
[0064] (f*g)(X)=∫f(Z)(XZ)dZ (1)
[0065] When the object of convolution is a discrete object, the integration operation in the above formula becomes a summation operation. Taking the convolution operation of a two-dimensional tensor as an example, its specific calculation method is shown in formula (2).
[0066]
[0067] Where (a,b) represents the index of function f, and (ia,ib) represents the index of function g.
[0068] In the image feature extraction process, the method actually used in this embodiment is a simplified cross-correlation operation based on two-dimensional convolution. The specific calculation method is as follows: Figure 1 As shown.
[0069] In two-dimensional cross-correlation operations, the convolution kernel starts from the top left corner of the input image tensor and slides from left to right and from top to bottom. When the convolution kernel slides to a new position, the tensor portion of the current window performs a cross-correlation operation with the tensor in the convolution kernel, thereby achieving local embedding of the image. In this paper, this embodiment uses convolution operations to extract local features of the image, thereby achieving image embedding through a sliding window.
[0070] (b) Fill and stride
[0071] During convolution operations, when the convolution kernel slides to the edge of the image, the image edge may be lost due to the image size, meaning the convolution kernel cannot perform operations with the edge pixels. Therefore, this embodiment requires padding the image tensor. A common method is to pad the edges of the image tensor with zero elements, thereby enabling the image edges to participate in the convolution operation.
[0072] During the sliding of the convolutional kernel, it typically slides one element by default. However, to improve computational efficiency or reduce the sampling rate, the convolutional window can skip intermediate elements, thus sliding multiple elements at a time. The number of elements slid at one time is called the stride. In this paper, this embodiment adjusts the image tensor through padding and stride to achieve alignment between the number of text words and the image window.
[0073] Therefore, this paper uses CNN for image feature embedding. For an image dataset of size m, I = {i1, i2, ..., i...} m},in H is the length of the image, W is the width of the image, and C is the number of channels in the image. For image i m Assume the length of the text fused with the image is n. 2 In this embodiment, the image is segmented into parts equal to the text length n. 2 For the same number of windows, the specific segmentation method is as follows: Figure 2 As shown. Image segmentation and embedding are implemented using a convolutional neural network. First, based on the number of windows n... 2 The size of the calculation window, i.e. the size of the convolutional kernel of the convolutional neural network, is calculated as shown in equations (3) and (4).
[0074] k h =H+p h +s h -n·s h (3)
[0075] k w =W+p w +s w -n·s w (4)
[0076] Where k h k w p represents the height and width of the convolution kernel. h p w Indicates row and column fill, s h and s w This represents the hyperparameters vertical stride and horizontal stride.
[0077] Then, based on the above formula, the relevant parameters are calculated. A two-dimensional convolutional neural network is used to implement window sliding for image embedding. The embedding vectors H and W are flattened and moved to the first dimension to finally obtain the graph embedding vector.
[0078] (3) Input the embedded image vector into the attention calculation layer to extract the local and global features of the image;
[0079] To achieve image feature extraction, this paper proposes an image feature extraction model based on window attention and hierarchical attention mechanisms. First, image embedding is performed using a sliding window to obtain embedding vectors. Then, the embedding vectors are first processed using a window attention mechanism to obtain local features of the image, and then a multi-layer global attention mechanism is used to extract global features of the image.
[0080] The specific process is as follows:
[0081] After obtaining the graph embedding vector X I,e Then, the local features of the image are extracted using the window attention mechanism. The model structure is shown in the dashed box on the right corresponding to the window attention mechanism in Figure 2. The overall calculation process is shown in equations (5) and (6).
[0082]
[0083]
[0084] Where MLP represents a fully connected neural network, X I,[LOC] The extracted local image features are represented by W-Attention, which means that the multi-head attention calculation is restricted to a window. At the same time, a channel attention mechanism is introduced in the self-attention calculation process. First, features of different aspects of the image are extracted through multiple attention heads, and the specific calculation is shown in equations (7), (8), and (9).
[0085]
[0086]
[0087]
[0088] Among them, W nj This represents the weight of the j-th attention head in the n-th image, where j is a hyperparameter, and in this technical solution, j = 12.
[0089] Multiple weight matrices W are obtained nj These weight matrices are considered as feature maps of the image channel dimension. The contribution of each feature map to the key information is calculated by channel attention, and the specific calculation method is shown in Equation (10).
[0090]
[0091] in, The global value represents the channel attention weights of the j-th attention head weight matrix for the n-th image. avg This represents the global average pooling function, which calculates the average value of each weight matrix. max σ represents the global max pooling function, which calculates the maximum value of each weight matrix. σ represents the sigmoid function, which maps the result to a value in the 0-1 interval to obtain the channel attention weights. α and β are trainable parameters.
[0092] After obtaining the channel attention weights, the attention weights of the nth image are calculated, and the specific calculation method is shown in Equation (11).
[0093]
[0094] In obtaining the local feature vector representation of the image Then, in this embodiment, a vector E with randomly initialized global feature labels is added before this vector. [GLO] , to obtain vector The purpose of this label is to learn the global features of the image and transform the vector... The global features of the image are calculated using a three-layer global attention mechanism. The specific calculation process is as follows:
[0095]
[0096]
[0097] Here, G-Attention represents global attention computation. In this embodiment, multi-head self-attention is used to compute the global features of the image. In this case, the computation of attention is no longer restricted to a fixed window. The vector obtained after computation through l layers of global attention can be represented as: The final output is a vector of labels. To obtain a global feature representation of the image.
[0098] (4) Multi-granularity image feature output layer, which calculates local feature representation of the image through window attention. The global feature representation of the image is obtained through multi-layer global attention computation. This embodiment represents local feature X. I,[LOC] and global feature representation By concatenating the features, we obtain feature vectors containing both local and global elements of an image.
[0099] The embodiments of this application also provide a computer-readable storage medium storing a computer program, the computer program including program instructions, and the processor executing the program instructions to implement any of the multi-granularity image feature extraction methods for image-text fusion provided in the embodiments of this application.
[0100] The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as the hard disk or memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, SmartMedia Card (SMC), Secure Digital (SD) card, or Flash Card equipped on the computer device.
[0101] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A multi-granularity image feature extraction method for image-text fusion, characterized in that, Includes the following steps: Preprocess the image data; Using a sliding window to embed image vectors: In two-dimensional cross-correlation operations, the convolution kernel starts from the top left corner of the input image tensor and slides from left to right and from top to bottom. When the convolution kernel slides to a new position, the tensor part of the current window is cross-correlated with the tensor in the convolution kernel, thereby achieving local embedding of the image and obtaining the embedding vector. The image embedding vector is input into the attention computation layer. First, the embedding vector is used to obtain the local features of the image using the window attention mechanism. Then, a multi-layer global attention mechanism is used to extract the global features of the image. The stitching layer outputs both local and global features of the image simultaneously. The cross-correlation operation is as follows: Given two functions f(x) and g(x), the mathematical definition of the convolution of f(x) and g(x) is as shown in equation (1): (1) When the object of convolution is a discrete object, the integration operation in the above formula becomes a summation operation. Taking the convolution operation of a two-dimensional tensor as an example, its specific calculation method is shown in formula (2): (2) Where (a, b) represents the index of f, and (ia, jb) represents the index of function g; In the step of obtaining the embedding vector, when performing cross-correlation operation, when the convolution kernel slides to the edge of the image, the edge of the image tensor needs to be padded with 0 elements. The number of elements in one slide is defined as the stride. The sliding stride of the image tensor is adjusted so that the number of text words is aligned with the image window. Using CNNs to embed image features, for an image dataset of size m... ,in H is the length of the image, W is the width of the image, and C is the number of channels in the image. Assuming the length of the text fused with the image is... The image is segmented into parts equal to the text length. Using a convolutional neural network to segment and embed images with the same number of windows, the process is first determined by the number of windows. The size of the calculation window, i.e. the size of the convolutional kernel of the convolutional neural network, is calculated as shown in equations (3) and (4); (3) (4) in , Indicates the height and width of the convolution kernel. , Indicates row filling and column filling. and Indicates the hyperparameters vertical stride and horizontal stride; The calculation process for extracting local features of an image using the window attention mechanism is shown in equations (5) and (6): (5) (6) MLP stands for Fully Connected Neural Network. This represents the extracted local features of the image. This indicates window attention calculation, which restricts the calculation of multi-head attention to within a window. At the same time, a channel attention mechanism is introduced into the self-attention calculation process. First, multiple attention heads are used to extract features from different aspects of the image. The specific calculation is shown in equations (7), (8), and (9): (7) (8) (9) in, This represents the weight of the j-th attention head for the n-th image, where j is a hyperparameter and j = 12; Multiple weight matrices are obtained These weight matrices are considered as feature maps of the image channel dimension. The contribution of each feature map to the key information is calculated by the channel attention, and the specific calculation method is shown in Equation (10). (10) in, This represents the channel attention weights of the j-th attention head weight matrix for the n-th image. This represents the global average pooling function, which calculates the average value of each weight matrix. This represents the global max pooling function, which calculates the maximum value of each weight matrix. This refers to the sigmoid function, which maps the result to values in the 0-1 range to obtain channel attention weights. , These are trainable parameters; After obtaining the channel attention weights, the attention weights for the nth image are calculated, and the specific calculation method is shown in Equation (11): (11) In obtaining the local feature vector representation of the image Then, a vector of randomly initialized global feature labels is added before this vector. , to obtain vector The purpose of this label is to learn the global features of the image and transform the vector... The global features of the image are calculated using a three-layer global attention mechanism. The specific calculation process is as follows: (12) (13) in, This indicates global attention computation, using multi-head self-attention to compute global features of the image. In this case, the attention computation is no longer restricted to a fixed window. The vector obtained after computation through l layers of global attention is represented as: The final output is a vector of labels. To obtain a global feature representation of the image; Local feature representations of the image are obtained through window attention calculation. The global feature representation of the image is obtained through multi-layer global attention computation. Representing local features and global feature representation By concatenating the features, we obtain feature vectors containing both local and global features of an image. .
2. A multi-granularity image feature extraction system for image-text fusion to implement the method of claim 1, characterized in that, The image embedding layer includes a sliding window for embedding image vectors. It performs cross-correlation operations based on two-dimensional convolution. Specifically, in the two-dimensional cross-correlation operation, the convolution kernel starts from the top left corner of the input image tensor and slides from left to right and from top to bottom. When the convolution kernel slides to a new position, the tensor part of the current window performs cross-correlation operations with the tensor in the convolution kernel, thereby achieving local embedding of the image and obtaining the embedding vector. Attention computation layer: The image embedding vector is input into the attention computation layer. First, the embedding vector is used to obtain the local features of the image using the window attention mechanism. Then, a multi-layer global attention mechanism is used to extract the global features of the image. Output layer: Simultaneously outputs local and global features of the image through the stitching layer.
3. A computer device, characterized in that, The computer device includes a memory and a processor; The memory is used to store computer programs; The processor is configured to execute the computer program and, when executing the computer program, implement the multi-granularity image feature extraction method for image-text fusion as described in claim 1.
4. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to implement a multi-granularity image feature extraction method for image-text fusion as claimed in claim 1.