A radiology report generation method based on bidirectional cross-modal interaction

By employing a bidirectional cross-modal interactive radiology report generation method, which utilizes image encoding and knowledge aggregation modules to achieve bidirectional interaction between image features and knowledge features, the method solves the problem of inconsistency in radiology report generation in existing technologies, resulting in reports that are closer to real radiology reports.

CN122245596APending Publication Date: 2026-06-19QILU UNIVERSITY OF TECHNOLOGY (SHANDONG ACADEMY OF SCIENCES)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
QILU UNIVERSITY OF TECHNOLOGY (SHANDONG ACADEMY OF SCIENCES)
Filing Date
2026-05-22
Publication Date
2026-06-19

Smart Images

  • Figure CN122245596A_ABST
    Figure CN122245596A_ABST
Patent Text Reader

Abstract

This application discloses a radiology report generation method based on bidirectional cross-modal interaction, belonging to the technical field of radiology report generation methods. In this application, the radiology report generation network includes an image encoding module, a knowledge aggregation module, and a bidirectional cross-attention fusion module connected to both. The radiology image to be generated is input into the image encoding module to obtain an image embedding sequence and an image global feature vector. The output of a two-stage knowledge retrieval and fact-consistency reordering process is input to the knowledge aggregation module, with the input being the image global feature vector and a knowledge queue. The bidirectional cross-attention fusion module is used to perform bidirectional cross-modal interaction and collaborative optimization between image features and knowledge features on the image embedding sequence and the final knowledge embedding sequence output by the knowledge aggregation module, realizing knowledge enhancement of the image and image-to-knowledge reverse correction. The radiology report generated by the method described in this application is closer to a real clinical radiology report.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the technical field of radiology report generation methods, specifically relating to a radiology report generation method based on bidirectional cross-modal interaction. Background Technology

[0002] Radiology reports are crucial for radiologists in making diagnostic and treatment decisions. Therefore, the task of generating radiology reports demands extremely high clinical consistency. Existing methods typically rely on image-text feature similarity for retrieval and integration into the report generation process, often lacking bidirectional cross-modal interaction between image and knowledge features. This results in significant discrepancies between the generated radiology reports and actual clinical reports. To address this, this invention proposes a radiology report generation method based on bidirectional cross-modal interaction. Summary of the Invention

[0003] To address the above problems, this invention provides a radiology report generation method based on bidirectional cross-modal interaction.

[0004] To achieve the above objectives, the present invention adopts the following technical solution: A radiology report generation method based on bidirectional cross-modal interaction includes the following steps: The radiological image to be generated is input into the radiological report generation network model, and after one forward propagation, a predicted radiological report is obtained; the radiological report generation network model is obtained by training a radiological report generation network. The radiology report generation network includes an image encoding module and a knowledge aggregation module. Both the image encoding module and the knowledge aggregation module are connected to a bidirectional cross-attention fusion module, which is connected to the report generation module. The radiology image to be generated is input into the image encoding module to obtain the image embedding sequence E. (I) and the global feature vector f of the image (I) The report generation module outputs a predicted radiological report; image embedding sequence E (I) Input to the bidirectional cross-attention fusion module, the global feature vector of the image f (I) The knowledge queue serves as the query vector input for the two-stage knowledge retrieval and fact consistency reordering process; the results obtained after the two-stage knowledge retrieval and fact consistency reordering process are input into the knowledge aggregation module. The bidirectional cross-attention fusion module is used to process the image embedding sequence E output by the image coding module. (I) The final knowledge embedding sequence H output by the knowledge aggregation module is used for bidirectional cross-modal interaction and collaborative optimization between image features and knowledge features, thereby realizing knowledge enhancement of images and image-to-knowledge reverse correction.

[0005] Preferably, the two-stage knowledge retrieval and fact consistency reordering process includes the following steps: 1) Calculate the global feature vector f of the image. (I) With each text feature vector Q in the knowledge queue T The cosine similarity between them is used to obtain the initial similarity score S1. Based on the initial similarity score S1, a preliminary candidate knowledge set and a candidate disease label vector set are obtained. 2) Based on the initial similarity score S1 corresponding to each text feature vector in the preliminary candidate knowledge set, obtain the comprehensive score S corresponding to each text feature vector, and reorder each text feature vector in the preliminary candidate knowledge set according to the comprehensive score S corresponding to each text feature vector; 3) Gating the candidate knowledge ranked in the top k by comprehensive score to obtain the top-k candidate knowledge text set.

[0006] Preferably, step 2) specifically includes the following steps: 2-1) Based on the global feature vector f of the image (I) Obtain the predicted disease label vector query_concept from the current radiographic image; 2-2) Calculate the fact consistency score concept_f1 based on the predicted disease label vector query_concept and each candidate disease label vector in the candidate disease label vector set; 2-3) Linearly add the factual consistency score between the predicted disease label vector and each candidate disease label vector and the initial similarity score S1 corresponding to the text feature vector in the preliminary candidate knowledge set to obtain the comprehensive score S corresponding to each text feature vector in the preliminary candidate knowledge set. 2-4) Reorder the text feature vectors in the preliminary candidate knowledge set according to the comprehensive score S corresponding to each text feature vector for fact consistency.

[0007] Preferably, the fact consistency score concept_f1 is calculated as follows: first, the overlap between the predicted disease label vector query_concept and each candidate disease label vector in the candidate disease label vector set is calculated; then, the precision and recall are calculated based on the overlap; and finally, the harmonic mean is calculated based on the precision and recall, thus obtaining the fact consistency score between the predicted disease label vector and each candidate disease label vector.

[0008] Preferably, step 3) includes the following steps: First, the fact consistency score concept_f1 of the candidate knowledge ranked first is numerically compared with the preset score threshold; When the factual consistency score concept_f1 of the first candidate knowledge is less than the preset score threshold, the comprehensive score of the first candidate knowledge is retained, and the comprehensive scores of the candidate knowledge ranked 2nd to kth are all replaced with a very small negative constant through a masking operation. The negative constant is used as the comprehensive score after masking the candidate knowledge ranked 2nd to kth, respectively, to forcibly block the attention weight allocation of non-first candidate knowledge. Then, the comprehensive score of the first candidate knowledge and the comprehensive score after masking the candidate knowledge ranked 2nd to kth are input into the Softmax activation function for normalization processing to obtain the normalized weights of the top k candidate knowledge. In this application, the text feature vectors of the top k candidate knowledge and their normalized weights constitute the top-k candidate knowledge text set, and the top-k candidate knowledge text set is transmitted to the knowledge aggregation module. When the factual consistency score of the first candidate knowledge, concept_f1, is greater than or equal to the preset score threshold, the scores obtained in step 2) are retained in the comprehensive scores corresponding to the top k candidate knowledges. The comprehensive scores corresponding to the top k candidate knowledges are input into the Softmax activation function for normalization processing to obtain the normalized weights of the top k candidate knowledges. The text feature vectors of the top k candidate knowledges and their normalized weights constitute the top-k candidate knowledge text set, which is then transmitted to the knowledge aggregation module.

[0009] Preferably, the bidirectional cross-attention fusion module includes an image-to-knowledge cross-attention unit and a knowledge-to-image cross-attention unit, both of which are connected to the fusion update and knowledge re-injection unit. Specifically, the feedforward neural network in the image-to-knowledge cross-attention unit is connected to two linear layers in the knowledge-to-image cross-attention unit, as well as the global average pooling layer and linear layer of the fusion update and knowledge re-injection unit. The feedforward neural network in the knowledge-to-image cross-attention unit is connected to the adaptive fusion module in the fusion update and knowledge re-injection unit. The image-to-knowledge cross-attention unit outputs an enhanced image embedding sequence E. 1(I) The knowledge embedding sequence H1 is updated after the knowledge is transferred to the image across the attention unit output.

[0010] Preferably, the image-to-knowledge cross-attention unit comprises three linear layers, a residual connection network, a layer normalization layer, and a feedforward neural network; one of the linear layers embeds the image into a sequence E. (I)The projection is the query matrix Query. The other two linear layers project the final knowledge embedding sequence H into a key matrix Key and a value matrix Value, respectively. The dot product of the query matrix Query and the key matrix Key is calculated and then normalized using the Softmax activation function to obtain the attention weight matrix. Then, the attention weight matrix is ​​multiplied by the value matrix Value to obtain the cross-attention output A. Finally, a residual connection network is used to connect the cross-attention output A with the image embedding sequence E. (I) Residual connections are performed to obtain a preliminary fused cross-modal residual feature matrix. Then, this preliminary fused cross-modal residual feature matrix is ​​normalized using a layer normalization layer before being input into a feedforward neural network. The feedforward neural network processes the output of the normalized layer and outputs the enhanced image embedding sequence E. 1(I) .

[0011] Preferably, the knowledge-to-image cross-attention unit comprises three linear layers, a residual connection network, a layer normalization layer, and a feedforward neural network; one linear layer projects the final knowledge embedding sequence H into a query matrix Query', and the other two linear layers enhance the image embedding sequence E. 1(I) The query matrix is ​​projected into a key matrix Key' and a value matrix Value', respectively. Then, the dot product of the query matrix Query' and the key matrix Key' is calculated to obtain the initial attention score matrix reflecting the correlation between knowledge and each region of the image. Next, a pathological region sparsification operation is performed based on the initial attention score matrix to obtain the sparsified initial attention score matrix. Then, the sparsified initial attention score matrix is ​​normalized using the Softmax function to obtain the sparse attention weight matrix. Then, the sparse attention weight matrix is ​​multiplied by the value matrix Value' to obtain the cross-attention output B. Then, the cross-attention output B is residually connected to the final knowledge embedding sequence H using a residual connection network to obtain the preliminary fused cross-modal residual feature matrix. Finally, the preliminary fused cross-modal residual feature matrix is ​​normalized using a layer normalization layer and then input into a feedforward neural network for feature mapping and nonlinear transformation processing to output the updated knowledge embedding sequence H1.

[0012] Preferably, the fusion update and knowledge re-injection unit includes a global average pooling layer, a multilayer perceptron, and a sigmoid activation layer connected in sequence, and also includes three linear layers, a residual connection network, a layer normalization layer, and a feedforward neural network. Among them, the enhanced image embedding sequence E 1(I) The input is fed into a global average pooling layer for global average pooling, resulting in a global context vector ν. global The multilayer perceptron for the global context vector ν globalDeep semantic mapping is performed, and the Sigmoid activation layer maps the output of the multilayer perceptron to the (0,1) interval, resulting in the enhanced image embedding sequence E. 1(I) The confidence gating weight α of the visual features of the pathological images contained therein is given, where α∈(0,1); then, the confidence gating weight α, the updated knowledge embedding sequence H1, and the final knowledge embedding sequence H are adaptively and dynamically fused. Enhanced image embedding sequence E 1(I) In one linear layer of the input fusion update and knowledge re-injection unit, the fused knowledge representation H′ is input into the other two linear layers of the fusion update and knowledge re-injection unit; wherein, the fused knowledge representation H′ is projected through the two linear layers as a new key matrix Key'' and a value matrix Value'', respectively, thus enhancing the image embedding sequence E. 1(I) A new query matrix Query'' is projected through a linear layer; then, the dot product of the query matrix Query'' and the key matrix Key'' is calculated and normalized using the Softmax activation function to obtain the final cross-modal attention weight matrix; then, the final cross-modal attention weight matrix is ​​multiplied by the value matrix Value'' to obtain the visual enhancement features guided by fusion knowledge; finally, the visual enhancement features guided by fusion knowledge are integrated with the enhanced image embedding sequence E using a residual connection network. 1(I) Residual connections are performed to obtain a cross-modal residual feature matrix guided by fusion knowledge. A layer normalization layer normalizes this matrix, and the output of this normalization layer is then fed into a feedforward neural network in the fusion update and knowledge re-injection unit for feature mapping and nonlinear transformation to obtain the final fused image embedding sequence E. (F) The final fused image embedding sequence E (F) Continue to serve as input for the report generation module.

[0013] Preferably, the confidence gating weight α, the updated knowledge embedding sequence H1, and the final knowledge embedding sequence H are adaptively and dynamically fused. Specifically, the adaptive dynamic fusion refers to weighting the updated knowledge embedding sequence H1 with the confidence gating weight α and weighting the final knowledge embedding sequence H with (1-α). Then, the two weighted results are added together to obtain the fused knowledge representation H′.

[0014] Preferably, the pathological region sparsification operation is performed on any row vector in the initial attention score matrix, including the following steps: first, all elements in the row vector are extracted as the relevance score set of the knowledge features corresponding to these elements in all image regions; statistical calculation is performed on the relevance score set to obtain the statistical mean and standard deviation of the knowledge features; then, the dynamic activation threshold τ of the knowledge features is obtained based on the statistical mean and standard deviation; and then, elements in the row vector that are less than the dynamic activation threshold τ are replaced with negative infinity.

[0015] Compared with the prior art, the beneficial technical effects of this application are as follows: In this application, the setting of the image-to-knowledge cross-attention unit enables the application to use the image embedding sequence as the query subject and the final knowledge embedding sequence as the knowledge semantics provider, thereby enhancing the visual features of the image with external medical knowledge (including knowledge text and disease labels in the knowledge queue). The setting of the knowledge-to-image cross-attention unit also enables the application to use image visual information containing pathological details to reverse-retrieve and correct knowledge features. Furthermore, the setting of the fusion update and knowledge re-injection unit allows the application to adaptively and dynamically fuse the updated knowledge embedding sequence H1 and the final knowledge embedding sequence H based on the saliency of the pathological image visual features, and re-inject the fused knowledge representation into the pathological image visual features, achieving bidirectional cross-modal interaction and collaborative optimization between knowledge features and image features. Tests show that the text generated by the radiology report generation method described in this application (i.e., the predicted radiology report) is closer to the radiology report sample text in terms of sentence structure, semantic coherence, and consistency; the radiology report generated by the method described in this application has a higher sentence similarity to the reference text (i.e., the radiology report sample). Attached Figure Description

[0016] Figure 1 This is a schematic diagram of the network structure of the radiology report generation network in this application; Figure 2 for Figure 1 A flowchart illustrating the two-stage knowledge retrieval and fact consistency reordering process in China; Figure 3 for Figure 1 A schematic diagram of a bidirectional cross-attention fusion network structure. Detailed Implementation

[0017] To more clearly illustrate the purpose, technical solution, and advantages of the present invention, further explanation is provided below in conjunction with the accompanying drawings and embodiments.

[0018] Example 1: A radiology report generation method based on bidirectional cross-modal interaction includes the following steps: S1. Construct a radiology report generation network; The structure of the radiology report generation network is as follows: Figure 1 As shown, it includes an image encoding module, a knowledge aggregation module, a bidirectional cross-attention fusion module, and a report generation module; the outputs of the image encoding module and the knowledge aggregation module are connected to the input of the bidirectional cross-attention fusion module, and the output of the bidirectional cross-attention fusion module is connected to the input of the report generation module; the report generation module outputs the predicted radiological report. In this application, the image encoding module is used to encode the radiological image to be generated into a report, extract the visual semantic features of the image, and output the image embedding sequence E. (I) and the global feature vector f of the image (I) The image encoding module used in this application has the same architecture and function as the ViT model disclosed in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". In this application, the image embedding sequence E (I) Image input as the bidirectional cross-attention fusion module; global image feature vector f (I) The query vector is used as input for the two-stage knowledge retrieval and fact consistency reordering process; the result obtained after the two-stage knowledge retrieval and fact consistency reordering process is input to the knowledge aggregation module. In this application, the knowledge queue includes a knowledge text set, a text feature vector set, and a disease label vector set; wherein, the knowledge text set is composed of knowledge text, the text feature vector set is composed of text feature vectors, and the text feature vectors in the text feature vector set correspond to the knowledge text in the knowledge text set, and the disease label vectors in the disease label vector set also correspond to the knowledge text in the knowledge text set; in this application, the disease label vectors are used to represent the disease labels or medical concepts involved in the knowledge text; In this application, the two-stage knowledge retrieval and fact consistency reordering process is as follows: Figure 2 As shown, the two-stage knowledge retrieval and fact consistency reordering process includes the following steps: 1) Calculate the global feature vector f of the image. (I) With each text feature vector Q in the knowledge queue T The cosine similarity between them is used to obtain an initial similarity score S1. Based on the initial similarity score S1, a preliminary candidate knowledge set and a candidate disease label vector set are obtained; specifically: All text feature vectors are sorted in descending order according to their initial similarity scores S1. The top k1 text feature vectors with the highest initial similarity scores S1 are selected. These top k1 text feature vectors and their corresponding cosine similarity scores constitute a preliminary candidate knowledge set that is most relevant to the global feature vectors of the current radiological image in terms of visual semantics. Then, disease label vectors corresponding to the top k1 text feature vectors are obtained from the knowledge queue. These disease label vectors are used as candidate disease label vectors, and all candidate disease label vectors constitute a candidate disease label vector set. 2) Based on the initial similarity score S1 corresponding to each text feature vector in the preliminary candidate knowledge set, obtain the comprehensive score S corresponding to each text feature vector, and reorder the text feature vectors in the preliminary candidate knowledge set according to the comprehensive score S corresponding to each text feature vector; including the following steps: 2-1) Based on the global feature vector f of the image (I) The predicted disease label vector `query_concept` of the current radiographic image is obtained; specifically: The global feature vector f of the image (I) First, matrix multiplication is performed using the first linear layer to transform the global feature vector f of the image. (I) The first linear layer is mapped to a high-dimensional hidden space. Then, the ReLU nonlinear activation function is used to perform nonlinear activation processing on the output of the first linear layer to obtain the nonlinearly activated feature vector. The second linear layer unifies the feature dimension of the nonlinearly activated feature vector to a dimension space that matches the disease label vector. Then, the Sigmoid activation function is used to map the output of the second linear layer to the (0,1) interval to obtain the probability distribution vector of the probability of the current image existing in each disease category dimension. Finally, the probability distribution vector is binarized to obtain the predicted disease label vector query_concept. The binarization of the probability distribution vector means setting the elements in the probability distribution vector that are greater than or equal to the threshold (the threshold is set to 0.5 in this embodiment) to 1, and setting the elements in the probability distribution vector that are less than the threshold (the threshold is set to 0.5 in this embodiment) to 0. 2-2) Calculate the fact consistency score concept_f1 based on the predicted disease label vector query_concept and each candidate disease label vector in the candidate disease label vector set; specifically: First, the overlap between the predicted disease label vector query_concept and each candidate disease label vector in the candidate disease label vector set is calculated; then, the precision and recall are calculated based on the overlap; finally, the harmonic mean is calculated based on the precision and recall, which gives the factual consistency score between the predicted disease label vector and each candidate disease label vector. For example, calculating the overlap between the predicted disease label vector query_concept and the j-th candidate disease label vector in the candidate disease label vector set. j The calculation method is shown in equation (1): (1) In equation (1), C represents the total number of disease categories, and q i c represents the i-th element of the predicted disease label vector query_concept. ji c represents the vector of the j-th candidate disease label. j The i-th element, ∧ represents the logical AND operation; Then, based on overlap j Precision calculation j and recall j As shown in equations (2) and (3) respectively: (2) In equation (2), Precision j This represents the accuracy between the predicted disease label vector `query_concept` and the j-th candidate disease label vector in the set of candidate disease label vectors; overlap j The value represents the overlap between the predicted disease label vector `query_concept` and the j-th candidate disease label vector in the candidate disease label vector set; C represents the total number of disease categories, c ji c represents the vector of the j-th candidate disease label. j The i-th element; b represents a constant, set to avoid a denominator of zero, and the value of b is 1 × 10. -8 ; (3) In equation (3), Recall j The overlap represents the recall between the predicted disease label vector `query_concept` and the j-th candidate disease label vector in the candidate disease label vector set. jq represents the overlap between the predicted disease label vector query_concept and the j-th candidate disease label vector in the candidate disease label vector set; C represents the total number of disease categories, q i Let represent the i-th element of the predicted disease label vector query_concept, and b represent a constant. The purpose of setting b is to avoid the denominator being zero, and the value of b is 1×10. -8 ; Then, based on the accuracy (Precision) j Recall j The harmonic mean of the j-th candidate disease label vector is calculated as shown in Equation (4); where the harmonic mean of the j-th candidate disease label vector is the factual consistency score between the predicted disease label vector and the j-th candidate disease label vector, concept_f1. j ; (4) In equation (4), Precision j Recall represents the accuracy between the predicted disease label vector `query_concept` and the j-th candidate disease label vector in the set of candidate disease label vectors. j This represents the recall rate between the predicted disease label vector query_concept and the j-th candidate disease label vector in the set of candidate disease label vectors; 2-3) Linearly add the factual consistency score between the predicted disease label vector and each candidate disease label vector and the initial similarity score S1 corresponding to the text feature vector in the preliminary candidate knowledge set to obtain the comprehensive score S corresponding to each text feature vector in the preliminary candidate knowledge set. To obtain the comprehensive score S corresponding to the j-th text feature vector (which corresponds to the j-th candidate disease label vector) in the preliminary candidate knowledge set. j For example, S j The calculation method is shown in equation (5); (5) In equation (5), λ represents the balance coefficient, and the value of λ ranges from (0,1); S 1j Let S1 represent the initial similarity score corresponding to the j-th text feature vector, and concept_f1 represent the initial similarity score. j The concept_f1 represents the factual consistency score between the predicted disease label vector and the j-th candidate disease label vector. j ; 2-4) Reorder each text feature vector in the preliminary candidate knowledge set according to the factual consistency of each text feature vector. Specifically, arrange each text feature vector in the preliminary candidate knowledge set in descending order according to its comprehensive score, select the text feature vectors with the top k comprehensive scores, and use the text feature vectors with the top k comprehensive scores as the candidate knowledge with the top k scores. 3) Gating the top k candidate knowledge items based on their overall scores to obtain the top-k candidate knowledge text set; this includes the following steps: The fact consistency score concept_f1 of the candidate knowledge ranked first (the candidate knowledge ranked first is referred to as the first candidate knowledge) is compared with the preset score threshold (in this embodiment, the preset threshold is set to 0.5). When the factual consistency score (concept_f1) of the first candidate knowledge is less than the preset scoring threshold, the comprehensive score of the first candidate knowledge is retained, and the comprehensive scores of the candidate knowledge ranked 2nd to kth are replaced with a very small negative constant through a masking operation. This negative constant serves as the masked comprehensive score corresponding to the candidate knowledge ranked 2nd to kth. In this embodiment, the very small negative constant is -1 × 10⁻⁶. 9 The system forcibly masks the attention weight allocation of non-first-rank candidate knowledge. Then, the comprehensive score of the first-rank candidate knowledge and the comprehensive scores of the candidate knowledge ranked 2nd to kth after masking are input into the Softmax activation function for normalization, resulting in normalized weights for the top k candidate knowledge. This setting makes the normalized weight of the first-rank candidate knowledge approach 1, and the normalized weights of the other candidate knowledge approach 0, effectively suppressing the attention contribution of non-first-rank candidate knowledge. In this application, the text feature vectors of the top k candidate knowledge and their normalized weights constitute the top-k candidate knowledge text set, which is then transmitted to the knowledge aggregation module. When the factual consistency score of the first candidate knowledge, concept_f1, is greater than or equal to the preset score threshold, the scores obtained in step 2) are retained in the comprehensive scores corresponding to the top k candidate knowledges. The comprehensive scores corresponding to the top k candidate knowledges are input into the Softmax activation function for normalization processing to obtain the normalized weights of the top k candidate knowledges. The text feature vectors of the top k candidate knowledges and their normalized weights constitute the top-k candidate knowledge text set, which is then transmitted to the knowledge aggregation module.

[0019] In this application, the knowledge aggregation module is used to fuse candidate knowledge and their corresponding normalized weights in the top-k candidate knowledge text set, converting the candidate knowledge into a continuous tensor representation that can be used for cross-modal fusion, to obtain the final knowledge embedding sequence H; wherein, the final knowledge embedding sequence H and the image embedding sequence E (I) The input is a bidirectional cross-attention fusion module; the knowledge aggregation module used in this application has the same structure and function as the Memory Responding network disclosed in the paper "Cross-modal Memory Networks for Radiology Report Generation"; In this application, the bidirectional cross-attention fusion module is used to process the image embedding sequence E output by the image coding module. (I) The final knowledge embedding sequence H output by the knowledge aggregation module is used for bidirectional cross-modal interaction and collaborative optimization between image features and knowledge features. This enables knowledge to enhance the image and the image to correct the knowledge in reverse, thereby improving the consistency between image features and knowledge features, and outputting the final fused image embedding sequence E. (F) The final fused image embedding sequence E (F) Input to the report generation module; In this application, the network structure of the bidirectional cross-attention fusion module is as follows: Figure 3 As shown, the bidirectional cross-attention fusion module includes an image-to-knowledge cross-attention unit and a knowledge-to-image cross-attention unit. Both the image-to-knowledge cross-attention unit and the knowledge-to-image cross-attention unit are connected to the fusion update and knowledge re-injection unit. Specifically, the feedforward neural network in the image-to-knowledge cross-attention unit is connected to the inputs of the two linear layers in the knowledge-to-image cross-attention unit and the global average pooling layer and the linear layer in the fusion update and knowledge re-injection unit, respectively. The output of the feedforward neural network in the knowledge-to-image cross-attention unit is connected to the input of the adaptive fusion module in the fusion update and knowledge re-injection unit. In this application, the image-to-knowledge cross-attention unit receives the image embedding sequence E respectively. (I) The final knowledge embedding sequence H is used as input; The image-to-knowledge cross-attention unit consists of three linear layers, a residual connection network, a layer normalization layer, and a feedforward neural network; one of the linear layers embeds the image into a sequence E. (I)The projection is the query matrix Query. The other two linear layers project the final knowledge embedding sequence H into a key matrix Key and a value matrix Value, respectively. The dot product of the query matrix Query and the key matrix Key is calculated and then normalized using the Softmax activation function to obtain the attention weight matrix. Then, the attention weight matrix is ​​multiplied by the value matrix Value to obtain the cross-attention output A. Finally, a residual connection network is used to connect the cross-attention output A with the image embedding sequence E. (I) Residual connections are performed to obtain a preliminary fused cross-modal residual feature matrix. Then, this preliminary fused cross-modal residual feature matrix is ​​normalized using a layer normalization layer before being input into a feedforward neural network. The feedforward neural network processes the output of the normalized layer and outputs the enhanced image embedding sequence E. 1(I) The feedforward neural network processes the output after the layer normalization. Specifically, the linear layer in the feedforward neural network first performs a first linear mapping transformation on the output of the layer normalization layer. Then, the nonlinear activation function ReLU in the feedforward neural network performs nonlinear feature enhancement on the first linear mapping result. Finally, another linear layer in the feedforward neural network performs a second linear mapping transformation on the features after nonlinear activation.

[0020] The image-to-knowledge cross-attention unit setting in this application enables this application to use image embedding sequence E (I) As the query subject, the final knowledge embedding sequence H is used as the knowledge semantics to enhance the visual features of the image by external medical knowledge (which includes knowledge text and disease labels in the knowledge queue).

[0021] In this application, the knowledge-to-image cross-attention unit receives the final knowledge embedding sequence H and the enhanced image embedding sequence E. 1(I) ; The knowledge-to-image cross-attention unit comprises three linear layers, a residual connection network, a layer normalization layer, and a feedforward neural network; one linear layer projects the final knowledge embedding sequence H into a query matrix Query', and the other two linear layers enhance the image embedding sequence E. 1(I)The matrix is ​​projected into a key matrix Key' and a value matrix Value', respectively. Then, the dot product of the query matrix Query' and the key matrix Key' is calculated to obtain an initial attention score matrix reflecting the correlation between knowledge and each region of the image. The initial attention score matrix is ​​a two-dimensional matrix, where each row vector corresponds to a knowledge feature in the final knowledge embedding sequence, and each column vector corresponds to an image region feature in the enhanced image embedding sequence. Then, a pathological region sparsification operation is performed based on the initial attention score matrix to obtain the sparsified initial attention score matrix. This application achieves the suppression of attention response in image regions with low correlation to the current knowledge features by performing a pathological region sparsification operation on the initial attention score matrix, thereby realizing the reverse and accurate correction of the image to the knowledge features, making the attention more focused on the pathological region. In this application, pathological region sparsification is performed based on the initial attention score matrix. Specifically, pathological region sparsification is performed on all row vectors in the initial attention score matrix. Taking the pathological region sparsification operation on any row vector in the initial attention score matrix as an example, the pathological region sparsification operation on any row vector in the initial attention score matrix includes the following steps: First, extract all elements in the row vector as the relevance score set of the knowledge features corresponding to these elements in all image regions. Perform statistical calculation on the relevance score set to obtain the statistical mean and standard deviation corresponding to the knowledge features. Then, based on the statistical mean and standard deviation, obtain the dynamic activation threshold τ of the knowledge features. The calculation method of the dynamic activation threshold τ of the knowledge features is shown in Equation (6). Then, replace the elements in the row vector that are less than the dynamic activation threshold τ with negative infinity. Wherein, the dynamic activation threshold refers to the dynamic activation threshold τ of the knowledge features corresponding to the row vector. τ=μ+β·σ (6) In equation (6), β represents the preset control coefficient. In this embodiment, the value range of β is [0.5,2]; τ represents the dynamic activation threshold of the knowledge feature; μ represents the statistical mean of the knowledge feature; and σ represents the standard deviation of the knowledge feature. Then, this application normalizes the initial attention score matrix after sparsification using the Softmax function to obtain a sparse attention weight matrix. Next, the sparse attention weight matrix is ​​multiplied by the value matrix Value' to obtain the cross-attention output B. Then, a residual connection network is used to perform a residual connection between the cross-attention output B and the final knowledge embedding sequence H to obtain a preliminary fused cross-modal residual feature matrix. Next, the preliminary fused cross-modal residual feature matrix is ​​processed by a layer normalization layer and then input into a feedforward neural network. The feedforward neural network processes the output after layer normalization and outputs an updated knowledge embedding sequence H1. Specifically, the linear layer in the feedforward neural network first performs a first linear mapping transformation on the output of the layer normalization layer. Then, the nonlinear activation function ReLU in the feedforward neural network performs nonlinear feature enhancement on the first linear mapping result. Finally, another linear layer in the feedforward neural network performs a second linear mapping transformation on the features after nonlinear activation. In this application, the knowledge-to-image cross-attention unit setting enables the application to reverse retrieve and correct knowledge features using image visual information containing pathological details; In this application, the fusion update and knowledge re-injection unit receives the updated knowledge embedding sequence H1, the final knowledge embedding sequence H, and the enhanced image embedding sequence E. 1(I) ; The fusion update and knowledge re-injection unit includes a globally average pooling layer, a multilayer perceptron, and a sigmoid activation layer connected in sequence. It also includes three linear layers, a residual connection network, a layer normalization layer, and a feedforward neural network. The multilayer perceptron consists of two linear transformation layers and a non-linear activation layer (ReLU) located between them. The residual connection networks used in this application are identical to the Residual Connection network architecture disclosed in "Deep residual learning for image recognition." The feedforward neural networks used in this application are identical to the Position-wise Feed-Forward Networks architecture disclosed in "Attention Is All You Need." Among them, the enhanced image embedding sequence E 1(I) The input is fed into a global average pooling layer for global average pooling, resulting in a global context vector ν. global The multilayer perceptron for the global context vector ν globalDeep semantic mapping is performed, and the Sigmoid activation layer maps the output of the multilayer perceptron to the (0,1) interval, resulting in the enhanced image embedding sequence E. 1(I) The confidence gating weight α of the visual features of the pathological images contained therein, where α∈(0,1); in this application, when the enhanced image embedding sequence E 1(I) When the pathological images contained therein have significant visual features, the global context vector ν global The pathological semantic information in the image is relatively strong, and the confidence gating weight α obtained after multilayer perceptron mapping is relatively large; when the enhanced image embedding sequence E 1(I) When the visual features of the pathological images contained therein are blurred, the global context vector ν global The pathological-related semantic information in the data is weak, and the confidence gating weight α obtained after mapping by the multilayer perceptron is small. Then, the confidence gating weight α, the updated knowledge embedding sequence H1, and the final knowledge embedding sequence H are adaptively and dynamically fused to realize the knowledge embedding sequence fusion update based on the confidence gating weight, and obtain the fused knowledge representation H′. Specifically, the adaptive dynamic fusion refers to using the confidence gating weight α to weight the updated knowledge embedding sequence H1, and using (1-α) to weight the final knowledge embedding sequence H. Then, the two weighted results are added together to obtain the fused knowledge representation H′. The calculation method of the fused knowledge representation H′ is shown in Equation (7). H′=α·H1+(1-α)·H (7) In equation (7), α represents the enhanced image embedding sequence E. 1(I) The confidence gating weights of the visual features of the pathological images contained therein, where H1 represents the updated knowledge embedding sequence and H represents the final knowledge embedding sequence; Enhanced image embedding sequence E 1(I) In one linear layer of the input fusion update and knowledge re-injection unit, the fused knowledge representation H′ is input into the other two linear layers of the fusion update and knowledge re-injection unit; wherein, the fused knowledge representation H′ is projected through the two linear layers as a new key matrix Key'' and a value matrix Value'', respectively, thus enhancing the image embedding sequence E. 1(I) The query matrix Query'' is projected onto a linear layer; then, the dot product of Query'' and Key'' is calculated and normalized using the Softmax activation function to obtain the final cross-modal attention weight matrix; finally, the final cross-modal attention weight matrix is ​​multiplied by the value matrix Value'' to obtain the visual enhancement features guided by fused knowledge. This achieves the fused knowledge representation H' on the enhanced image embedding sequence E. 1(I)Knowledge re-injection; utilizing residual connection networks to embed visual enhancement features guided by fused knowledge into the enhanced image embedding sequence E. 1(I) Residual connections are performed to obtain a cross-modal residual feature matrix guided by fusion knowledge. A layer normalization layer normalizes this matrix, and the output of this normalization layer is then fed into a feedforward neural network in the fusion update and knowledge re-injection unit for feature mapping and nonlinear transformation to obtain the final fused image embedding sequence E. (F) The final fused image embedding sequence E (F) Continue to serve as input for the report generation module; The output of the normalization layer is input into the feedforward neural network in the fusion update and knowledge re-injection unit for feature mapping and nonlinear transformation. Specifically, the linear layer in the feedforward neural network first performs a first linear mapping transformation on the output of the normalization layer, projecting it to a high-dimensional feature space. Then, the nonlinear activation function ReLU in the feedforward neural network performs nonlinear feature enhancement on the first linear mapping result, introducing higher-order nonlinear expressive power. Then, another linear layer in the feedforward neural network performs a second linear mapping transformation on the features after nonlinear activation, completing the expression enhancement of higher-order semantics. In this application, the setting of the fusion update and knowledge re-injection unit enables the application to adaptively and dynamically fuse the updated knowledge embedding sequence H1 and the final knowledge embedding sequence H according to the saliency of the visual features of the pathological image, and re-inject the fused knowledge representation into the visual features of the pathological image, thereby realizing bidirectional cross-modal interaction and collaborative optimization between knowledge features and image features.

[0022] In this application, the report generation module is used to embed the final fused image sequence E. (F) Generates a predicted radiology report text R; In this application, the report generation module has the same network structure and function as the Image-grounded text decoder disclosed in the paper "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation"; S2. Train the radiology report generation network based on the training set to obtain the radiology report generation network model; specifically including the following steps: S2-1. Obtain the knowledge queue of radiological report samples corresponding to paired images in the training set of the IU-Xray dataset. The specific acquisition method is as follows: (1) Extract radiological report samples corresponding to paired images from the training set of the publicly available IU-Xray dataset. After data cleaning and denoising of the radiological report samples, extract paragraphs reflecting the image performance and paragraphs of the final diagnosis conclusion from the radiological report samples to form knowledge text. All knowledge texts together form a knowledge text set. (2) Each knowledge text in the knowledge text set is segmented into a word sequence using the tokenizer, and then the word sequence is input into the pre-trained language model SciBERT to obtain text feature vectors. All text feature vectors are used to construct a text feature vector set. The pre-trained language model SciBERT used in this application has the same network structure and function as the pre-trained language model SciBERT disclosed in "SciBERT: A Pretrained Language Model for Scientific Text". (3) The CheXbert radiology report tagger identifies 14 common disease categories (atelectasis, cardiomegaly, consolidation, pulmonary edema, pleural effusion, pneumonia, pneumothorax, cardiomegaly, lung lesions, lung opacity, other pleural diseases, fracture, support device, no findings) in each knowledge text set. The existence of each disease category is converted into discrete numerical states. Specifically, the state where the disease category is determined to exist is mapped to 1, and all other states are mapped to 0. This generates a 14-dimensional binary vector for each knowledge text. These binary vectors are the disease label vectors. All disease label vectors together constitute the disease label vector set. The knowledge text set, the text feature vector set, and the disease label vector set together constitute the knowledge queue. S2-2. Input the paired images from the training set of the IU-Xray dataset and the knowledge queue obtained in S2-1 into the radiology report generation network. Perform forward propagation to calculate the total loss L of the radiology report generation network, and perform backpropagation under the guidance of the total loss L to update the network weight parameters. This completes one epoch of training. After iterating for 50 epochs, the training of the radiology report generation network is completed, and the radiology report generation network model is obtained. The IU-Xray dataset is an existing dataset.

[0023] In this application, the training process of the radiology report generation network uses the AdamW optimizer with a weight decay coefficient set to 0.02; the initial learning rate of the image coding module is set to e. -5 The initial learning rate for all trainable parts of the radiology report generation network, excluding the image encoding module, is set to e. -4; In the training process of the radiology report generation network, this application first employs a Linear Warm-up strategy for adjustment, wherein the initial warm-up learning rate is set to e. -5 When the number of warm-up steps (i.e., the number of training steps) reaches the preset warm-up step threshold (in this embodiment, the preset warm-up step threshold is set to 1000), cosine annealing learning rate scheduling is then adopted. In addition, during the training process of the radiology report generation network, this application adopts gradient value pruning for all trainable parameters. Specifically, during the training process, the gradient norm generated when updating parameters is detected, and the gradient norm is limited to within a preset threshold of 0.1 to avoid gradient explosion. In the inference testing phase, this application adopts a beam search strategy with a beam size of 3. When generating radiology report text through inference, the maximum length is set to 100 and the minimum length is set to 5.

[0024] In this application, the total loss L of the radiology report generation network includes image-text contrast loss. L ITC and language modeling generation loss L RG Among them, the total loss L and the image-text contrast loss L ITC and language modeling generation loss L RG The relationship is shown in equation (8): L=λ1· L ITC +λ2· L RG (8) In equation (8), L This represents the total loss of the radiology report generation network. L ITC This represents the image-to-text contrast loss. L RG λ1 and λ2 represent the weight coefficients corresponding to the image-text contrast loss and the language modeling generation loss, respectively. In this first embodiment, λ1 and λ2 take values ​​of 0.2 and 1.0, respectively.

[0025] In this application, image-text contrast loss L ITCThe calculation method is the same as the Image-Text Contrastive Loss calculation method disclosed in the paper "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation"; in this application, the language modeling generation loss is... L RG The calculation method is the same as the LanguageModeling Loss calculation method disclosed in the paper "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation".

[0026] S3. Input the radiological image to be generated into the radiological report generation network model, propagate forward once, and obtain the predicted radiological report.

[0027] Example 2: The difference between Example 2 and Example 1 is that: (1) In step S2-1 of Example 2, the knowledge queue is obtained from the radiological report samples corresponding to the paired images in the training set of the MIMIC-CXR dataset; wherein, the method of obtaining the knowledge queue from the radiological report samples corresponding to the paired images in the training set of the MIMIC-CXR dataset is the same as the method of obtaining the knowledge queue from the radiological report samples corresponding to the paired images in the training set of the IU-Xray dataset. (2) In step S2-2 of Example 2, the paired images in the training set of the MIMIC-CXR dataset and the knowledge queue obtained in step S2-1 of Example 2 are input into the radiology report generation network; (3) In step S2-2 of Example 2, training is completed after 30 epochs; in Example 2, the pre-set Warm-up step threshold is set to 5000.

[0028] (4) In step S3 of Example 2, the radiological image to be generated is input into the radiological report generation network model obtained in Example 2, and the forward propagation is performed once to obtain the predicted radiological report.

[0029] Test 1: To verify the superiority of the radiology report generation method described in this application compared to other radiology report generation methods, this application uses the test set of the IU-Xray dataset to test the radiology report generation method described in this application, as well as four existing radiology report generation methods: CNN-RNN (from the paper "Show and tell: A neural image caption generator"), CARG (from the paper "Clinically Accurate Chest X-ray Report Generation"), Transformer (from the paper "Generating Radiology Reports via Memory-driven Transformer"), and DCL (from the paper "Dynamic Graph Enhanced Contrastive Learning for Chest X-Ray Report Generation"), using the same testing strategy. Specifically, test images from the IU-Xray dataset test set are directly input into the radiology report generation network model obtained in the method described in Embodiment 1 of this application, as well as the trained neural network models disclosed in the four existing radiology report generation methods mentioned above, for testing. The test results are shown in Table 1.

[0030] Table 1. Test results of different methods on the test set of the IU-Xray dataset.

[0031] In Table 1, "-" indicates that there is no test data.

[0032] Test 2: To verify the superiority of the radiology report generation method described in this application compared to other radiology report generation methods, this application also uses the test set in the MIMIC-CXR dataset to test the radiology report generation method described in this application, as well as four existing radiology report generation methods: the Show-Tell method (from the paper "Show and tell: A neural image caption generator"), the AdaAtt method (from the paper "Knowing when to look: Adaptive attention via a visualsentinel for image captioning"), the Transformer method (from the paper "Generating RadiologyReports via Memory-driven Transformer"), and the R2GenCMN method (from the paper "Cross-modal Memory Networks for Radiology Report Generation"), using the same testing strategy. Specifically, test images from the IU-Xray dataset test set were directly input into the radiology report generation network model obtained in the method described in Embodiment 2 of this application, as well as the trained neural network models disclosed in the above four existing radiology report generation methods, for testing. The test results are shown in Table 2.

[0033] Table 2. Test results of different methods on the test set of the MIMIC-CXR dataset.

[0034] In Table 2, "-" indicates that there is no test data.

[0035] In Tables 1 and 2, the Ours method refers to the radiology report generation method proposed in this application. BLEU-1 to BLEU-4 represent the matching degree of single characters, bigrams, triples, and quadruples, respectively, with progressively higher values. Higher values ​​for BLEU-1, BLEU-2, BLEU-3, and BLEU-4 indicate a better degree of accurate matching between the generated text and the reference text. The ROUGE-L index measures the structural similarity between the generated text and the reference text, calculated based on the longest common subsequence. A higher value indicates that the generated text is closer to the reference text in terms of overall sentence structure and coherence. A higher METEOR index value indicates higher accuracy and flexibility in semantic expression of the generated text. The CIDEr index evaluates the sentence similarity between the generated report and the reference text, covering key elements such as grammatical regularity, salience, importance, and accuracy. A higher value indicates a higher degree of matching between the generated report and the reference report.

[0036] As shown in Table 1, when testing based on the test set in the IU-Xray dataset, the existing DCL method achieves better test results than other existing radiology report generation methods. Therefore, this application focuses on comparing the test results of the DCL method with the test results of the radiology report generation method described in this application, and the specific analysis is as follows: The BLEU-1 index obtained by the method described in this application is 0.404, the BLEU-2 index is 0.267, the BLEU-3 index is 0.200, the BLEU-4 index is 0.159, the ROUGE-L index is 0.365, the METEOR index is 0.193, and the CIDEr index is 0.735. The BLEU-1 index obtained by the radiology report generation method described in this application is 4.66% higher than that obtained by the DCL method. The improvement in the BLEU-1 index indicates that the method described in this application has stronger matching accuracy at the single character level, and can generate more content that is consistent with the single character of the reference text (i.e., the radiology report sample), thereby improving the generation quality of the text (i.e., the radiology report) generated by the method described in this application at the single character level. The BLEU-2 index obtained by the radiology report generation method described in this application is 5.95% higher than that obtained by the DCL method. The improvement in the BLEU-2 index indicates that the language construction ability of the method described in this application on tuples (i.e., bigram phrase combinations) is enhanced, and it can generate more accurate bigram phrase combinations, thereby enhancing the expression quality of the text (i.e., radiology report) generated by the method described in this application at the phrase level. The BLEU-3 index obtained by the radiology report generation method described in this application is 8.11% higher than that obtained by the DCL method. The improvement in the BLEU-3 index indicates that the method described in this application has stronger matching accuracy at the triplet level, that is, the text (i.e., radiology report) generated by the method described in this application has higher phrase matching accuracy. The BLEU-4 index obtained by the radiology report generation method described in this application is 10.42% higher than that obtained by the DCL method. The improvement in the BLEU-4 index indicates that the method described in this application has stronger matching accuracy at the quadruple level. The text (i.e., radiology report) generated by the method described in this application has a stronger advantage in higher-level phrase collocation and semantic consistency, further improving the quality of the generated radiology report. The ROUGE-L index obtained by the radiology report generation method described in this application is 19.28% higher than that obtained by the DCL method; the improvement in the ROUGE-L index indicates that the text (i.e., radiology report) generated by the method of this application is closer to the radiology report sample in terms of sentence structure and semantic coherence. The METEOR index obtained by the radiology report generation method described in this application is 6.63% higher than that obtained by the DCL method; the improvement in the METEOR index indicates that the text (i.e., radiology report) generated by the method described in this application is better in terms of the accuracy and flexibility of semantic expression. The CIDEr index obtained by the radiology report generation method described in this application is 16.85% higher than that obtained by the DCL method. The improvement in the CIDEr index indicates that the text generated by the method described in this application (i.e., the radiology report) has a higher sentence similarity with the reference text (i.e., the radiology report sample). In other words, the text generated by the method described in this application performs better in terms of grammatical regularity, salience, importance, and accuracy.

[0037] The MIMIC-CXR dataset used in this application is a large dataset, almost 70 times the size of the IU-Xray dataset, with significantly increased data complexity and label noise compared to the IU-XRay dataset. As shown in Table 2, when tested on the test set of the MIMIC-CXR dataset, the method described in this application outperforms the R2GenCMN method on all evaluation metrics shown in Table 2. The BLEU-1 score obtained by the radiology report generation method described in this application is 4.53% higher than that obtained by the R2GenCMN method. This indicates that when tested on the test set of the MIMIC-CXR dataset (a large dataset), the radiology report generation method described in this application has stronger matching accuracy at the single character level and can generate more content that is consistent with the single character of the reference text (i.e., the radiology report sample), thereby improving the generation quality of the text (i.e., the radiology report) generated by the method described in this application at the single character level. The BLEU-2 index obtained by the radiology report generation method described in this application is 4.13% higher than that obtained by the R2GenCMN method. This indicates that when tested on the test set of the MIMIC-CXR dataset (a large dataset), the language construction ability of the method described in this application on tuples (i.e., bigram phrase combinations) is enhanced, and more accurate bigram phrase combinations are generated, thereby enhancing the expressive quality of the text (i.e., radiology report) generated by the method described in this application at the phrase level. The BLEU-3 index obtained by the radiology report generation method described in this application is 5.41% higher than that obtained by the R2GenCMN method. This indicates that when tested on the test set of the MIMIC-CXR dataset (a large dataset), the method described in this application has stronger matching accuracy at the triplet level, that is, the text (i.e., radiology report) generated by the method described in this application has higher phrase matching accuracy. The radiology report generation method described in this application achieves a 6.60% improvement in BLEU-4 score compared to the R2GenCMN method. This indicates that when tested on the test set of the MIMIC-CXR dataset (a large dataset), the method described in this application has stronger matching accuracy at the quadruple level. The text (i.e., radiology report) generated by the method described in this application has a stronger advantage in higher-level phrase collocation and semantic consistency, further improving the quality of the generated radiology report. The ROUGE-L index obtained by the radiology report generation method described in this application is 2.88% higher than that obtained by the R2GENCMN method. This indicates that when tested on the test set of the MIMIC-CXR dataset (a large dataset), the method of this application is closer to the radiology report sample in terms of sentence structure and semantic coherence. The radiology report generation method described in this application achieves a 0.70% improvement in METEOR compared to the R2GENCMN method. This indicates that when tested on the test set of the MIMIC-CXR dataset (a large dataset), the method described in this application is superior in terms of the accuracy and flexibility of semantic representation. The radiology report generation method described in this application achieves a CIDEr index of 0.297. Since the R2GENCMN method lacks CIDEr index testing results, this application compares the CIDEr index obtained by the radiology report generation method described in this application with the better CIDEr index results obtained by other existing methods in Table 2. For example, the CIDEr index obtained by the AdaAtt method in Table 2 is 0.131. Clearly, the CIDEr index obtained by the radiology report generation method described in this application is 126.72% higher than that obtained by the AdaAtt method. This indicates that when testing on the test set of the MIMIC-CXR dataset (a large dataset), the radiology report generated by the method described in this application has a higher sentence similarity to the reference text (i.e., the radiology report sample). In other words, the text generated by the method described in this application performs better in terms of grammatical correctness, saliency, importance, and accuracy.

[0038] In summary, the text generated by the radiology report generation method described in this application (i.e., the predicted radiology report) is closer to the sample radiology report text in terms of sentence structure, semantic coherence, and consistency; the radiology report generated by the method described in this application has a higher sentence similarity to the reference text (i.e., the sample radiology report), meaning that the radiology report generated by this application is closer to the real clinical radiology report.

Claims

1. A method for generating radiological reports based on bidirectional cross-modal interaction, characterized in that: Includes the following steps: The radiological image to be generated is input into the radiological report generation network model, and after one forward propagation, a predicted radiological report is obtained; the model is obtained by training the radiological report generation network. The radiology report generation network includes an image encoding module, a knowledge aggregation module, and a bidirectional cross-attention fusion module connected to both. The fusion module is connected to the report generation module. The radiology image to be generated is input into the image encoding module to obtain the image embedding sequence and the global feature vector of the image. The output of the two-stage knowledge retrieval and fact consistency reordering process is input to the knowledge aggregation module, with the input being the global feature vector of the image and the knowledge queue. The bidirectional cross-attention fusion module is used to perform bidirectional cross-modal interaction and collaborative optimization between image features and knowledge features on the image embedding sequence and the final knowledge embedding sequence output by the knowledge aggregation module, so as to realize the enhancement of images by knowledge and the reverse correction of knowledge by images.

2. The radiology report generation method based on bidirectional cross-modal interaction according to claim 1, characterized in that: The two-stage knowledge retrieval and fact consistency reordering process includes the following steps: 1) Calculate the cosine similarity between the global feature vector of the image and the feature vectors of each text in the knowledge queue to obtain the initial similarity score. Based on the initial similarity score, obtain the preliminary candidate knowledge set and the candidate disease label vector set. 2) Based on the initial similarity scores of each text feature vector in the preliminary candidate knowledge set, obtain the comprehensive score corresponding to each text feature vector, and reorder each text feature vector in the preliminary candidate knowledge set according to the comprehensive score corresponding to each text feature vector for fact consistency. 3) Gating the candidate knowledge ranked in the top k by comprehensive score to obtain the top-k candidate knowledge text set.

3. The radiology report generation method based on bidirectional cross-modal interaction according to claim 2, characterized in that: Step 2) specifically includes the following steps: 2-1) Obtain the predicted disease label vector of the current radiological image based on the global feature vector of the image; 2-2) Calculate the fact consistency score based on the predicted disease label vector and each candidate disease label vector in the candidate disease label vector set; 2-3) Linearly add the factual consistency score between the predicted disease label vector and each candidate disease label vector and the initial similarity score corresponding to the text feature vector in the preliminary candidate knowledge set to obtain the comprehensive score corresponding to each text feature vector in the preliminary candidate knowledge set. 2-4) Reorder the text feature vectors in the preliminary candidate knowledge set according to the comprehensive score corresponding to each text feature vector to achieve factual consistency.

4. The radiology report generation method based on bidirectional cross-modal interaction according to claim 3, characterized in that: The fact consistency score is calculated as follows: First, the overlap between the predicted disease label vector and each candidate disease label vector in the candidate disease label vector set is calculated; then, the precision and recall are calculated based on the overlap; finally, the harmonic mean is calculated based on the precision and recall, which gives the fact consistency score between the predicted disease label vector and each candidate disease label vector.

5. The radiology report generation method based on bidirectional cross-modal interaction according to claim 2, characterized in that: Step 3) includes the following steps: First, the factual consistency score of the candidate knowledge ranked first is compared with the preset scoring threshold. When the factual consistency score of the first candidate knowledge is less than the preset score threshold, the comprehensive score of the first candidate knowledge is retained, and the comprehensive scores of the candidate knowledge ranked 2nd to kth are all replaced with a very small negative constant through a masking operation. The negative constant is used as the masked comprehensive score corresponding to the candidate knowledge ranked 2nd to kth. Then, the comprehensive scores of the first candidate knowledge and the masked comprehensive scores of the candidate knowledge ranked 2nd to kth are normalized to obtain the normalized weights of the top k candidate knowledge. The text feature vectors of the top k candidate knowledge and their normalized weights constitute the top-k candidate knowledge text set, which is then transmitted to the knowledge aggregation module. When the factual consistency score of the first candidate knowledge is greater than or equal to the preset score threshold, the score obtained in step 2) is retained in the comprehensive score corresponding to the top k candidate knowledge respectively; the comprehensive scores corresponding to the top k candidate knowledge are normalized to obtain the normalized weights of the top k candidate knowledge, wherein the text feature vectors of the top k candidate knowledge and their normalized weights constitute the top-k candidate knowledge text set, and the top-k candidate knowledge text set is transmitted to the knowledge aggregation module.

6. The radiology report generation method based on bidirectional cross-modal interaction according to claim 1, characterized in that: The bidirectional cross-attention fusion module includes an image-to-knowledge cross-attention unit and a knowledge-to-image cross-attention unit. Both the image-to-knowledge cross-attention unit and the knowledge-to-image cross-attention unit are connected to the fusion update and knowledge re-injection unit. The image-to-knowledge cross-attention unit outputs an enhanced image embedding sequence, and the knowledge-to-image cross-attention unit outputs an updated knowledge embedding sequence.

7. The radiology report generation method based on bidirectional cross-modal interaction according to claim 6, characterized in that: The image-to-knowledge cross-attention unit is used to project the image embedding sequence into a query matrix Query, and the final knowledge embedding sequence into a key matrix Key and a value matrix Value, respectively. The dot product of the query matrix Query and the key matrix Key is calculated and then normalized to obtain the attention weight matrix; the attention weight matrix is ​​then multiplied by the value matrix Value to obtain the cross-attention output A. Then, the cross-attention output A is residually connected to the image embedding sequence, followed by layer normalization, feature mapping, and nonlinear transformation to output an enhanced image embedding sequence.

8. The radiology report generation method based on bidirectional cross-modal interaction according to claim 6, characterized in that: The knowledge-to-image cross-attention unit is used to project the final knowledge embedding sequence into a query matrix Query', and to project the enhanced image embedding sequence into a key matrix Key' and a value matrix Value', respectively. The initial attention score matrix is ​​obtained by calculating the dot product of the query matrix Query' and the key matrix Key'. Then, the pathological region sparsification operation is performed to obtain the initial attention score matrix after sparsification. Finally, the Softmax function is used for normalization to obtain the sparse attention weight matrix. Multiply the sparse attention weight matrix and the value matrix Value' to obtain the cross-attention output B; perform residual connection between the cross-attention output B and the final knowledge embedding sequence, then perform layer normalization, feature mapping and nonlinear transformation to output the updated knowledge embedding sequence.

9. The radiology report generation method based on bidirectional cross-modal interaction according to claim 8, characterized in that: The fusion update and knowledge re-injection unit is used to perform global average pooling and deep semantic mapping on the enhanced image embedding sequence in sequence, and then map it to (0,1) to obtain the confidence gating weights of the pathological image visual features contained in the enhanced image embedding sequence; then, the confidence gating weights, the updated knowledge embedding sequence, and the final knowledge embedding sequence are adaptively and dynamically fused. The fused knowledge representation is projected into a new key matrix Key'' and a value matrix Value'', respectively, and the enhanced image embedding sequence is projected into a new query matrix Query''. Calculate the dot product of the query matrix Query'' and the key matrix Key'' and normalize it to obtain the final cross-modal attention weight matrix; multiply the final cross-modal attention weight matrix with the value matrix Value'' to obtain the visual enhancement features based on fused knowledge guidance; The visual enhancement features guided by fusion knowledge are residually connected with the enhanced image embedding sequence, then layer normalization is performed, followed by feature mapping and nonlinear transformation to obtain the final fused image embedding sequence, which is used as the input of the report generation module.

10. A radiology report generation method based on bidirectional cross-modal interaction according to claim 8, characterized in that: Perform pathological region sparsification on any row vector in the initial attention score matrix. The process includes the following steps: First, all elements in the row vector are extracted as the set of relevance scores of the knowledge features corresponding to these elements across all image regions. Statistical calculations are performed on the relevance score set to obtain the statistical mean and standard deviation of the knowledge features. Then, the dynamic activation threshold of the knowledge features is obtained based on the statistical mean and standard deviation. Then, elements in the row vector that are less than the dynamic activation threshold are replaced with negative infinity.