A single-stage medical image difference question answering method based on fine-grained alignment
By constructing a single-stage medical image difference question-answering model, the problems of semantic differences between image features and text features and the lack of fine-grained information mining were solved. The model achieved alignment of image and text features and accurate answer generation, thereby improving the prediction accuracy of the model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIDIAN UNIV
- Filing Date
- 2024-05-31
- Publication Date
- 2026-06-30
AI Technical Summary
Existing two-stage medical image difference question answering methods suffer from semantic differences between image features and text features, and the failure to effectively mine fine-grained information from the text, resulting in inaccurate model predictions.
A single-stage medical image difference question-answering model based on fine-grained alignment is constructed, including an image difference feature extraction module, a text feature extraction module, a multimodal feature fusion module, and a differential decoder. The fine-grained feature alignment module realizes the alignment of image and text features, and the Transformer decoder is used to generate the answer.
The prediction accuracy of the medical image difference question answering model has been improved, and the fusion effect of multimodal information has been enhanced by improving the perception and alignment of fine-grained features of images and text.
Smart Images

Figure CN118503384B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of multimodal learning technology and relates to a single-stage medical image difference question answering method, specifically a single-stage medical image difference question answering method based on fine-grained alignment. Background Technology
[0002] Image difference question answering combines image question answering with difference detection, aiming to understand the content and differences of images across different time dimensions and provide corresponding answers based on the questions. It requires a high level of understanding in both visual and textual modalities. Mainstream visual difference question answering models employ a multi-stream network structure, extracting image features from different time periods, processing them to obtain difference features, extracting question text features, and finally fusing these features before feeding them into the decoder to generate the answer.
[0003] Medical images play a crucial role in the medical field, serving as a vital basis for clinicians to assess patients' conditions. In today's society, medical resources remain strained, with a shortage of doctors compared to patients. With the development of medical image question answering and more practical clinical diagnosis and follow-up solutions, image difference question answering has been applied to the medical field, forming specialized medical image difference question answering tasks. These tasks aim to compare medical images of patients at different times, understand questions, and provide answers. However, in reality, medical images are sampled from different perspectives each time, and parameters such as brightness and contrast vary. Pixel subtraction alone cannot accurately perceive these differences. Furthermore, medical images contain a wealth of information, requiring models to perceive more fine-grained information and describe corresponding fine-grained answer information based on this image detail.
[0004] To address the issue of differing viewpoints and imaging parameters in medical images at different times during question answering, Xinyue Hu, Lin Gu, Qiyuan An, and others proposed a two-stage medical image difference question answering method based on multiple relationship awareness in their paper "Expert Knowledge-Aware ImageDifference Graph Representation Learning for Difference-Aware Medical VisualQuestion Answering" (International Conference on Data Mining and Knowledge Discovery 2023). In the first stage, a pre-trained object detection network extracts fine-grained features of local images of medical symptoms and organs / tissues, along with corresponding anchor box coordinates. Then, based on the anchor box coordinates, spatial and semantic relationships between medical symptoms and organs / tissues are established, ultimately transforming all medical images in the dataset into fine-grained image features, spatial relationship graphs of medical symptoms and organs / tissues, and semantic relationship graphs. In the second stage, an LSTM is used to extract question text features. Then, a graph attention network fuses the fine-grained image information, spatial relationship graph, and semantic relationship graph extracted in the first stage with the question text features. Finally, a feedforward neural network extracts difference features, which are then input into the LSTM to generate the answer. However, since this method uses a two-stage network structure, the backpropagation process in stage two does not update the network model in stage one, resulting in semantic differences between the image features extracted in stage one and the text features extracted in stage two. Furthermore, since only global features of the text are used, it is not possible to effectively mine fine-grained information of the text, resulting in a misalignment between the fine-grained features of the image and the text, ultimately leading to inaccurate model predictions. Summary of the Invention
[0005] The purpose of this invention is to overcome the shortcomings of the existing technology and propose a single-stage medical image difference question answering method based on fine-grained alignment, which aims to improve the prediction accuracy of network models in medical image difference question answering tasks.
[0006] To achieve the above objectives, the technical solution adopted by the present invention mainly includes the following steps:
[0007] (1) Obtain the training sample set and the test sample set:
[0008] Two chest images of each of N patients at different times and And the question text Q n And answer text A n The samples from the nth patient are used to form the training sample set, and the remaining NM samples are used to form the test sample set, where N > 1000.
[0009] (2) Construct a single-stage medical image difference question-answering model based on fine-grained alignment:
[0010] A medical image differential question-answering model G is constructed, consisting of a feature extraction network composed of parallel image differential feature extraction modules and text feature extraction modules, as well as a multimodal feature fusion module, a parallel fine-grained feature alignment module, and a differential decoder cascaded thereon. The image differential feature extraction module includes a sequentially cascaded image feature extraction module and a differential feature enhancement module; the fine-grained feature alignment module includes a stacked normalization layer, a cross-attention layer, and a softmax layer; the differential decoder consists of a question selector and two parallel Transformer decoders cascaded together.
[0011] (3) Iteratively train the single-stage medical image difference question-answering model:
[0012] The single-stage medical image difference question-answering model was iteratively trained using a training sample set to obtain the trained question-answering model G. * ;
[0013] (4) Obtain the results of a single-stage medical image difference question and answer:
[0014] The test sample set is used as input to the trained question-answering model for forward propagation, passing sequentially through a feature extraction network, a multimodal feature fusion module, and a differential decoder to obtain the answer text corresponding to the question text in the test sample set.
[0015] Compared with the prior art, the present invention has the following advantages:
[0016] First, the fine-grained feature alignment module constructed in this invention enhances the model's perception of coarse-grained and fine-grained features of image text, fully mines the fine-grained information of text, and aligns the fine-grained information of text with the fine-grained information of image, thereby realizing the modeling of the relationship between multimodal fine-grained information. Furthermore, by using a differential decoder, different Transformer decoders are used to generate answers for different categories of questions, thereby effectively improving the accuracy of the differential question answering model.
[0017] Second, the single-stage medical image differential question answering model constructed in this invention includes an image feature extraction module based on the Transformer structure, a differential feature enhancement module, a multimodal feature fusion module, and a differential decoder. This ensures the consistency of the structure of each module, makes the feature vectors of multiple modalities more semantically similar, reduces the semantic gap of multimodal features, and enables better fusion of various features, thereby further improving the accuracy of the differential question answering model. Attached Figure Description
[0018] Figure 1This is a flowchart illustrating the implementation of the present invention;
[0019] Figure 2 This is a schematic diagram of the structure of the single-stage medical image difference question-and-answer model constructed in this invention. Detailed Implementation
[0020] The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0021] Reference Figure 1 The present invention includes the following steps:
[0022] Step 1: Obtain the training sample set and the test sample set.
[0023] Obtain the dataset MIMIC Diff VQA, which contains two chest images from 164,324 patients at different times. bef ,I aft For each patient, two chest images taken at different times were used to pose different questions Q and provide corresponding answers A, resulting in 164,324 samples (I bef ,I aft (Q,A). Where I bef ,I aft The model receives visual input (Q), text input (Q), and output (A). All samples are divided into a training set of 131,459 samples and a test set of 32,865 samples, in an 8:2 ratio.
[0024] Step 2: Construct a single-stage medical image difference question-answering model based on fine-grained alignment, the structure of which is as follows: Figure 2 As shown.
[0025] A medical image difference question-answering model is constructed, comprising a feature extraction network consisting of parallel image difference feature extraction modules and text feature extraction modules, as well as a cascaded multimodal feature fusion module, a parallel fine-grained feature alignment module, and a differential decoder.
[0026] The image difference feature extraction module includes a cascaded image feature extraction module and a difference feature enhancement module. The image feature extraction module includes an embedding layer and a 12-layer Transformer encoder. The embedding layer includes parallel image pixel embedding layers and position encoding embedding layers. The image pixel embedding layer is a convolutional layer with a kernel size of 24×24, 768 channels, and a stride of 24. Each Transformer encoder consists of a LayerNorm layer, a 12-head multi-head self-attention layer, a DropOut layer, a LayerNorm layer, and a feedforward network stacked sequentially. The feedforward network includes two cascaded fully connected layers, a DropOut layer, and a LayerNorm layer.
[0027] The differential feature enhancement module includes a cascaded linear projection layer, a position encoding layer, and a 6-layer Transformer encoder. Each Transformer encoder includes a 6-head multi-head attention layer, a DropOut layer, a LayerNorm layer, and a feedforward network connected in sequence. The feedforward network includes two cascaded fully connected layers, a DropOut layer, and a LayerNorm layer.
[0028] The text feature extraction module consists of a tokenizer layer, a BertEmbedding layer, and three BertLayer modules stacked together. The BertEmbedding layer includes parallel word encoding embedding layers and position encoding embedding layers. The BertLayer structure is consistent with the Transformer, consisting of a 6-head multi-head attention layer, a DropOut layer, a LayerNorm layer, and a feedforward network connected in sequence. The feedforward network includes two cascaded fully connected layers, a DropOut layer, and a LayerNorm layer.
[0029] The multimodal feature fusion module includes a 3-layer Transformer encoder. Each Transformer encoder includes a LayerNorm layer, a 6-head multi-head cross-attention layer, and a feedforward network layer. The feedforward network includes two cascaded fully connected layers, a DropOut layer, and a LayerNorm layer.
[0030] The fine-grained feature alignment module includes a normalization layer, a cross-attention layer, and a softmax layer;
[0031] The differential decoder consists of a problem selector and two parallel 3-layer Transformer decoders cascaded together. The problem selector consists of a normalization layer and a fully connected layer cascaded together. Each Transformer decoder includes a LayerNorm layer, a 6-head multi-head cross-attention layer, and a feedforward network layer. The feedforward network includes two cascaded fully connected layers, a DropOut layer, and a LayerNorm layer.
[0032] Step 3: Iteratively train the single-stage medical image difference question-answering model:
[0033] (3a) Initialize the number of iterations to t, the maximum number of iterations to T, T = 200, and the single-stage medical image difference question-answering model G in the t-th iteration is... t The learnable parameter is ω t And let t = 1;
[0034] (3b) Select B training samples from the training sample set as inputs to the feature extraction network for forward propagation, where B = 128:
[0035] (3b1) Take the chest image of the b-th sample. and Downsampling transforms the image into a 384×384×3 image with dimensions and number of channels. This image then passes through the image pixel embedding layer of the image feature extraction module to obtain... The feature map is flattened to obtain a vector with k = 16 × 16 dimensions and dim = 768. Simultaneously, a prefix is added to the feature map. The vector is used to extract global features from the image, and after concatenation, it becomes the image input vector of the Transformer encoder. Then, a positional encoding embedding layer is used to embed the index (0,1,2,...,k) of each vector to obtain the positional input vector of the Transformer encoder. And add it to each vector to get the final Transformer input:
[0036]
[0037] Finally, the b-th sample is obtained through a 12-layer Transformer encoder. and eigenvectors The first feature vector represents the global features of the image, while the remaining k feature vectors represent fine-grained features. The difference features are then obtained by subtracting these vectors.
[0038]
[0039] The features are labeled with indices (0, 1, 2). and F diff The sequence number is then broadcast to a dimension consistent with the feature, and the category code is obtained by encoding the positional encoding layer of the differential feature enhancement module. And add it to each vector to get
[0040]
[0041] Where embedding represents the position encoding layer; then... respectively with The concatenated data is then fed into the Transformer encoder in the differential feature enhancement module to obtain the final output. and
[0042]
[0043] Where H diff The symbol represents the Transformer encoder of the difference enhancement module; ";" indicates a splicing operation.
[0044] (3b2) The text feature extraction module extracts the question text Q for each training sample. b And answer text A b Feature extraction is performed using the tokenizer layer of the text feature extraction module to extract Q. b ={q1,q2,..,q qLen} and A b ={a1,a2,..,a aLen The text is segmented and numbered, with qLen and aLen representing the sentence lengths of the question and answer, respectively. A special [ENC] symbol is added before the sentence beginning for global feature extraction, and a [SEP] symbol is added after the sentence end to uniformly extend the sentence length to 90. Then, the BertEmbedding layer of the text feature extraction module is used to embed the sentence words to obtain the question text vector and answer text vector. Q is obtained by encoding using BertLayer. b and A b Their respective eigenvectors in and Q b and A b The first feature is the global feature, and the remaining 90 are fine-grained features, all with a dimension of dim;
[0045] (3b3) The output F′ of the multimodal feature fusion module to the differential feature enhancement module bef F′ aft and Q b Features To perform fusion, F′ bef and F′ aft spliced together Using problem features F as the key and value for attention q The query fuses features through the Transformer decoder, where the cross-attention calculation formula is as follows:
[0046] key=W key ·F v ,value=W value ·F v ,query=W query ·F q
[0047] output = (key) T ·value)·query
[0048] Among them W key W value W query All parameters are dim×dim dimension matrices, and output is the output after one layer of cross-attention. The output dimension is equal to the problem feature dimension F. q Consistent; after three layers of cross-attention and feedforward networks, the final fused feature F is obtained. fusion ={v cls ,v1,v2,..,v 90}, where the first feature vector v cls The first 90 feature vectors are the fused global features, and the remaining 90 feature vectors are the fused fine-grained features.
[0049] (3b4) The fine-grained feature alignment module combines the fine-grained features fused from the b-th training sample with the answer text A from the B samples respectively. r Align the fine-grained features, where r = 1, 2, ..., B, and align the text features of the r-th sample answer. The [SEP] symbol in the text corresponds to the feature vector and global features. Remove, and you get the answer text A. r fine-grained features at the word level Extract the fusion feature F fusion The last 90 fine-grained feature vectors Will After normalization and dot product, the similarity matrix is obtained.
[0050]
[0051] Then, using the similarity matrix sim, the cross-attention score between the fine-grained features of the i-th answer text and the fused fine-grained features of the j-th answer text is calculated through a softmax layer:
[0052]
[0053] Where τ1 is the temperature parameter, exp represents the exponential operation of e, and sim i,j and att i,j Let represent the similarity and attention score between the i-th word feature and the j-th fused fine-grained feature, respectively; using the attention score as weights, we obtain the fine-grained perceptual features:
[0054]
[0055] Next, the fine-grained perceptual features w′ i The fine-grained similarity between the b-th sample and the r-th sample is obtained by multiplying the vector dot product of the text's fine-grained features and taking the average.
[0056]
[0057] Similarly, extract the fused global features of the b-th sample. Calculate the global similarity s between the global features of the r-th sample answer text and the global features of the r-th sample answer text. g :
[0058]
[0059] Where <,> denotes the vector dot product, and || denotes the vector modulo;
[0060] (3b5) The question classifier uses the input question text Q. b global features Perform binary classification, determine which decoder to use, and fuse features F based on the classification result. fusion Inputting into different Transformer decoders and obtaining the predicted answer
[0061]
[0062] Where selector is the question selector, and Decoder is the code selector. single Decoder represents the decoder for difference-based problems. single This represents the decoder for other categories of problems;
[0063] (3c) Based on similarity respectively and the predicted answer Calculate the loss of the fine-grained feature alignment module and the Transformer decoder selected by the problem selector. itc and l lm Then, the results are summed to obtain the total loss of the single-stage medical image difference question-answering model.
[0064]
[0065]
[0066] in This represents the cross-entropy loss function for binary classification. Let γ represent the cross-entropy loss function. and The weighting coefficients between them are determined; then the SGD optimization algorithm is used to optimize the results. The learnable parameters ω of the model t-1 Update, and obtain the model G for this iteration. t :
[0067]
[0068] in η represents the partial derivative of the loss with respect to the learnable parameters of the model, and η represents the learning rate.
[0069] (3d) Determine whether t = T holds true. If so, obtain the trained question-answering model G. * Otherwise, let t = t + 1 and execute step (3b).
[0070] Step 4, obtain the single-stage medical image difference question-and-answer results:
[0071] The test sample set is used as the trained question-answering model G. * The input is passed through a feature extraction network, a multimodal feature fusion module, and a differential decoder for forward propagation to obtain the answer text corresponding to the question text in the test sample set.
[0072] The effects of the present invention will be further illustrated below with simulation experiments:
[0073] 1. Simulation experimental conditions:
[0074] The simulation experiment hardware platform of this invention is: AMD Reyzen 5900X CPU processor, 32GB memory, and Nvidia GeForce RTX 4090 graphics card.
[0075] The simulation experiment software platform of this invention is: Ubuntu 20.04 operating system and Python 3.8.
[0076] This invention selects CIDEr, BLEU4, METEOR, and ROUGE-L as evaluation indicators for predicting answers.
[0077] 2. Simulation Result Analysis:
[0078] Referring to Table 1, the evaluation metrics are based on the paper "Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering" (EKAID). CIDEr, BLEU4, METEOR, and ROUGE-L are used to compare the similarity between the generated text and the labels. Specific simulation results are shown in the table below. Compared to existing technologies, this invention shows significant improvements in CIDEr, METEOR, and ROUGE-L scores, and also some improvement in the BLEU4 score.
[0079]
Claims
1. A single-stage medical image difference question-answering method based on fine-grained alignment, characterized in that, Includes the following steps: (1) Obtain the training sample set and the test sample set: Will Two chest images of each patient at different times. and and the question text and answer text Composition of the first Samples from each patient, and The training sample set is composed of samples from 10 patients, and the remaining samples are used to form the training sample set. The test sample set consists of 10 samples, of which 10 samples are used. , ; (2) Construct a single-stage medical image difference question-answering model based on fine-grained alignment: A medical image difference question-answering model is constructed, comprising a feature extraction network consisting of parallel image difference feature extraction modules and text feature extraction modules, as well as a cascaded multimodal feature fusion module, a parallel fine-grained feature alignment module, and a differential decoder. The image difference feature extraction module includes an image feature extraction module and a difference feature enhancement module cascaded together. The image feature extraction module is used to extract multiple fine-grained features and a global feature from each of the two images. The difference feature enhancement module is used to perform perceptual enhancement on the difference between the fine-grained features and the difference between the global features of the two images to obtain their respective enhanced difference features. The fine-grained feature alignment module includes stacked normalization layers, cross-attention layers, and softmax layers; the differential decoder consists of a question selector and two parallel Transformer decoders cascaded together; it is used to select one of the Transformer decoders based on the global features of the question text to fuse the global features and fine-grained features fused by the multimodal feature fusion module, and generate the answer text; (3) Iteratively train the single-stage medical image difference question-and-answer model: The single-stage medical image difference question-answering model was iteratively trained using a training sample set to obtain a well-trained question-answering model. ; (4) Obtain the results of single-stage medical image difference question and answer: The test sample set is used as input to the trained question-answering model, which then passes through a feature extraction network, a multimodal feature fusion module, and a differential decoder for forward propagation to obtain the answer text corresponding to the question text in the test sample set.
2. The method according to claim 1, characterized in that, The problem text described in step (1) and answer text ,in Including symptoms and and Inquiry about the differences in symptoms This includes a description of the symptoms in one of the images and an explanation of the differences in symptoms between the two images.
3. The method according to claim 1, characterized in that, The single-stage medical image difference question-answering model construction described in step (2) includes: The feature extraction network includes an image feature extraction module comprising a cascaded embedding layer and a multi-layer Transformer encoder, and a differential feature enhancement module consisting of a cascaded linear projection layer, a position encoding layer, and a multi-layer Transformer encoder; wherein the Transformer encoders of the image feature extraction module and the differential feature enhancement module consist of a cascaded multi-head self-attention layer and a feedforward network. The text feature extraction module consists of a tokenizer layer, a BertEmbedding layer, and multiple BertLayer modules stacked together. The multimodal feature fusion module includes a multi-layer Transformer decoder, which consists of a cascaded multi-head cross-attention layer and a feedforward network; The feedforward network in the Transformer encoder and decoder consists of two fully connected layers, a DropOut layer, and a normalization layer cascaded together.
4. The method according to claim 1, characterized in that, The iterative training of the single-stage medical image difference question-answering model described in step (3) is implemented as follows: (4a) Initialize the number of iterations to be The maximum number of iterations is , , No. The next iteration of the single-stage medical image difference question-answering model The learnable parameters are and order ; (4b) Select from the training sample set Each training sample is used as input to the feature extraction network for forward propagation. : (4b1) The image feature extraction module extracts chest images from each training sample. and Feature extraction was performed separately to obtain and Each has multiple fine-grained features and one global feature; Differential feature enhancement module and The differences in fine-grained features and global features are used for perceptual enhancement to obtain... and Each of the enhanced differences; (4b2) The text feature extraction module extracts the question text from each training sample. and answer text Feature extraction was performed separately to obtain and Each has multiple fine-grained features and global features; (4c) The output of the multimodal feature fusion module to the differential feature enhancement module and The features are fused to obtain the first Fine-grained features and global features after fusing individual training samples; (4d) The fine-grained feature alignment module will... The fine-grained features fused from the training samples are respectively... Answer text in one sample Alignment is performed on fine-grained features, where The aligned fusion features are obtained, and then compared with the aligned fusion features. Fine-grained feature calculation The sample and the first Fine-grained similarity of individual samples ; By combining the fused global features with Answer text in one sample Global feature calculation The training sample and the first Global similarity of samples The question selector in the differential decoder is based on the question text. The global features are selected, and one of the Transformer decoders is used to fuse the fused global features and fine-grained features. The answer text is generated by guiding the input with the special character [DEC], thus obtaining the answer corresponding to each training sample. ; (4e) Based on similarity respectively , and the predicted answer Calculate the loss of the fine-grained feature alignment module and the Transformer decoder selected by the problem selector. and and through and Calculate the loss of a single-stage medical image discrepancy question-answering model. Then, the SGD optimization algorithm is used to optimize the algorithm. Learnable parameters of the model Update to obtain the model for this iteration. ; (4f) Judgment If true, then the trained question-answering model is obtained. Otherwise, let Then proceed with step (4b).
5. The method according to claim 4, characterized in that, The loss of the question-answering model described in step (4e) The calculation formula is: ; ; ; ; ; in This represents the cross-entropy loss function for binary classification. Represents the cross-entropy loss function. for and Weighting coefficients between them.
6. The method according to claim 3, characterized in that, The step (4e) described above The update is performed using the following formula: ; in This represents the partial derivative of the loss with respect to the learnable parameters of the model. This represents the learning rate.