A chest radiograph image report generation method and device based on search enhancement generation

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining a multi-level visual fusion encoder and a deformable convolutional decoder with semantic denoising training, the problems of complex feature extraction and slow training convergence in existing technologies are solved, and efficient and accurate chest X-ray image report generation is achieved.

CN121964040BActive Publication Date: 2026-06-16ZHEJIANG UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ZHEJIANG UNIV
Filing Date: 2026-04-03
Publication Date: 2026-06-16

Application Information

Patent Timeline

03 Apr 2026

Application

16 Jun 2026

Publication

CN121964040B

IPC: G16H15/00; G16H30/20; G16H50/70; G06T7/00; G06V20/70; G06V10/44; G06V10/74; G06V10/80; G06V10/764; G06V10/82; G06N3/084

AI Tagging

Application Domain

Medical data mining Image analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing retrieval-enhanced generation methods suffer from complex feature extraction architectures, difficulty in capturing pathological information, insufficient decoupling capabilities of multi-view features, and slow training convergence when facing chest X-ray report generation tasks, thus affecting the practicality and deployment efficiency of the model.

⚗Method used

We employ a multi-level visual fusion encoder, a deformable convolutional decoder, and a semantic denoising training strategy. The multi-level visual fusion encoder extracts multi-scale visual features, and the deformable convolution enables adaptive interaction between queries and visual features. Furthermore, a hybrid expert model layer automatically decouples the feature differences between frontal and lateral views, and the semantic denoising training strategy accelerates model convergence.

🎯Benefits of technology

It achieves efficient and accurate chest X-ray report generation, significantly reduces model space usage, improves multi-view feature processing capabilities, reduces training iterations by 30%-50%, and improves localization accuracy and report generation accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN121964040B_ABST

Patent Text Reader

Abstract

The application discloses a kind of based on search enhancement generation's chest radiograph image report generation method and device, belong to medical image intelligent processing field.The method first constructs the search database containing atomized clinical finding;Then utilize multistage visual fusion encoder from pre-training visual model extraction and fuse multiscale visual features;Then utilize the decoder based on deformable convolution realizes the efficient interaction of query and visual features, and automatically decouples multi-view feature difference by introducing hybrid expert system;Finally, introduce semantic denoising training strategy to accelerate model convergence.The application effectively improves the accuracy of chest radiograph image analysis report under the generation of multiple visual angle inputs and effectively improves training efficiency, provides an efficient solution for medical image intelligent analysis system.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of intelligent medical image processing and computer vision, and in particular to a method and apparatus for generating chest X-ray image reports based on retrieval enhancement. Background Technology

[0002] Chest X-rays are one of the most widely used medical imaging methods globally, playing a crucial role in chest disease screening and image analysis. Radiology reports, as important documents recording the results of image analysis, directly impact the accuracy of clinical decisions. However, in environments with limited medical resources, this can easily lead to a backlog of imaging reports and diagnostic delays.

[0003] In recent years, deep learning-based intelligent report generation methods have developed rapidly. Retrieval-Augmented Generation (RAG) methods assist report generation by retrieving relevant visual-textual information from external knowledge bases, effectively alleviating the illusion problem in large model generation while significantly reducing training and deployment costs. However, existing RAG methods still face the following key challenges when dealing with chest X-ray report generation tasks:

[0004] Existing retrieval-enhanced generation methods face the following technical bottlenecks when dealing with chest X-ray report generation tasks: feature extraction architectures are complex and struggle to capture pathological information, typically employing complex multi-encoder structures or hybrid architectures to extract multi-scale features, significantly increasing model space and computational overhead; decoders lack effective decoupling capabilities for multi-view features, with traditional methods using identical decoders to process anteroposterior and lateral views, failing to effectively decouple view-specific feature differences, leading to training difficulties and performance degradation; and model training convergence is slow, with retrieval architectures based on end-to-end detection models generally facing the problem of unstable bipartite graph matching, usually requiring a long training time to achieve stable convergence, affecting the model's practicality and deployment efficiency.

[0005] Therefore, there is an urgent need for an efficient, accurate, and stable intelligent generation technology for chest X-ray image reports to solve the above-mentioned technical challenges. Summary of the Invention

[0006] To address the shortcomings of existing technologies, the present invention aims to provide a method for generating chest X-ray reports based on retrieval enhancement. This method achieves efficient, accurate, and stable chest X-ray report generation through an innovative multi-level visual fusion encoder, a multi-expert system-enhanced deformable convolution-based decoder, and a semantic denoising training strategy.

[0007] The objective of this invention is achieved through the following technical solution:

[0008] A first aspect of the present invention provides a method for generating chest X-ray image reports based on retrieval enhancement, comprising the following steps:

[0009] S1. Construct a standardized retrieval database containing atomic clinical findings;

[0010] S2. Input a single chest X-ray image to be processed into a pre-trained visual representation model, and extract multi-scale visual features from its multiple intermediate levels; input the multi-scale visual features into a multi-level visual fusion encoder for fusion to generate a fused visual feature representation;

[0011] S3. The fused visual feature representation is connected in series with a learnable query input hybrid expert model enhanced deformable convolutional decoder. The deformable convolution enables adaptive interaction between the query and visual features. The hybrid expert model layer automatically decouples the feature differences between the frontal and lateral views, predicts the semantic embedding of key phrases and their confidence rates, and retrieves a set of matching key phrases from the standardized retrieval database based on the semantic embedding.

[0012] In the encoder and decoder training phase proposed in steps S2 and S3, a semantic denoising training strategy is introduced to optimize the learnable query.

[0013] S4. Input the set of key phrases, historical reports, and other view prediction key phrase sets as explicit contextual constraints into the large language model to guide the large language model to generate chest X-ray image reports that conform to clinical standards and are consistent with historical reports.

[0014] The historical report is the previous report result of the individual to which the current chest X-ray belongs; the other view prediction key phrase set is, if the individual to which the current chest X-ray belongs has chest X-ray images taken at the same time from other perspectives, then the corresponding key phrase set is extracted from them through steps S2 and S3.

[0015] Furthermore, step S2 specifically includes the following sub-steps:

[0016] S21. Input the chest X-ray image to be processed into the pre-trained visual representation model and extract multi-scale features from multiple intermediate levels.

[0017] S22. Extract the category labels output by the pre-trained visual representation model and expand them to the same spatial dimension as the feature maps of each layer;

[0018] S23. The expanded category labels are fused with the spatial feature maps of each layer to generate fused multi-layer features;

[0019] S24. The fused multi-layer features are spliced and compressed in the channel dimension to obtain the final visual feature representation.

[0020] Furthermore, step S3 specifically includes the following sub-steps:

[0021] S31, Deformable Self-Interaction: The learnable query is reshaped into a two-dimensional feature map, and after channel transformation by 1×1 convolution, it is self-interacted through deformable convolution to establish long-range dependencies between queries; wherein the learnable query is a set of trainable parameters, and after interaction with the decoder and fused visual features, the prediction result is output.

[0022] S32, Deformable Cross-Interaction: The query after self-interaction is upsampled so that its spatial size is the same as the aforementioned visual feature representation. Then, the upsampled query and the visual feature representation are fused. The fused features are then subjected to deformable convolution to achieve cross-interaction between the query and the visual features.

[0023] S33. Hybrid expert routing and feature decoupling: The cross-interaction features are input into the hybrid expert model layer. The activation probability of each routing expert is calculated through a gating network. The two routing experts with the highest probabilities are selected for calculation. The automatic decoupling of the features of the frontal and lateral images is achieved through the strategy of separating the shared experts and the routing experts.

[0024] S34, Semantic Embedding Output: The output of the hybrid expert model layer is passed through two parallel linear projection heads to generate the semantic embedding vector of the key phrase and the predicted probability of each position, respectively.

[0025] S35. Similarity Search: Filter out all predicted probabilities higher than a preset threshold from the results generated in step S34. The semantic embedding vectors are used to calculate the similarity between each selected semantic embedding vector and each entry in the vector database. The entry with the highest similarity is selected as the matching result. The key phrases corresponding to all matching results are summarized to form the final set of key phrases for retrieval.

[0026] Specifically, in step S31, the deformable convolution operation is defined as:

[0027] ;

[0028] in For output position, For the predefined k-th sampling point, The learned spatial offset, The modulation scalar is used to suppress background noise. K represents the kernel weights, and K=81 represents the number of sampling points for the 9×9 convolution.

[0029] Furthermore, in step S32, the upsampled query is fused with the visual feature representation, specifically by first fusing the query feature map from the query space using bilinear interpolation. Upsampling to visual feature space The upsampled query is denoted as Then, the upsampled query features and visual features are added element-wise and fused together.

[0030] ;

[0031] Finally, the fusion features were analyzed. Perform 9×9 deformable convolutional cross-interactions and output the cross-interaction features. .

[0032] Specifically, in step S33, the hidden layer dimension of the hybrid expert model layer is 768-dimensional, the input is a two-dimensional feature map structure, and each spatial location of the feature map is processed sequentially; the layer contains 1 shared expert and 8 routing experts. The shared expert is always active to process general anatomical features across views, and the selected routing experts process view-specific detailed features.

[0033] The output of the hybrid expert model layer for the input feature x is calculated as follows:

[0034] ;

[0035] in The routing probability output by the gated network. Choose the K=2 routing experts with the highest probability. To share the number of experts, This represents the i-th shared expert. This represents the j-th routing expert. This represents the activation weight of the j-th routing expert.

[0036] Furthermore, in the encoder and decoder training phases proposed in steps S2 and S3, a semantic denoising training strategy is introduced to optimize the learnable query; that is, in the model training phase, a semantic denoising training strategy is introduced to accelerate model convergence, specifically including:

[0037] Ground value embedding acquisition: For each ground key phrase in the training batch, a text encoder is used to encode it into a semantic embedding vector, which serves as the ground value of the positive sample. ;

[0038] Semantic noise generation: Randomly select some words from the real key phrases and randomly replace or delete words to generate semantically perturbed key phrases;

[0039] Gaussian noise addition: Random noise is added to the semantic embedding vector of the ground truth of the positive samples to generate a noisy query. ;

[0040] Denoising and Reconstruction: The learnable query is replaced by the noise query and input into the decoder, and the denoised query representation is output. ;

[0041] Reconstruction loss calculation: Calculate the semantic reconstruction loss between the denoised query representation and the ground truth value of the positive sample, as well as the model's prediction confidence loss for the denoised query; the expression for the semantic denoising loss is: ;

[0042] in Represents the set of positive sample noise queries, where the total One positive sample; For reconstructed query semantic embedding, The semantic embedding truth value is the output of the text encoder corresponding to the real text phrase; L2 distance metric; confidence calibration uses binary cross-entropy loss. ; target label Defined as: a value of 1 for queries generated from real phrases, and a value of 0 for queries generated from pure random noise; This represents the model's prediction confidence for the reconstructed query. The confidence loss weighting coefficient;

[0043] Gradient backpropagation: Based on the reconstruction loss and the prediction confidence loss, gradient backpropagation is performed to update the model parameters;

[0044] The semantic denoising training and the main task training are performed alternately. In each training iteration, semantic denoising training is performed first, followed by the main task training.

[0045] Furthermore, the final loss function for complete training consists of the following joint optimization of losses:

[0046] ;

[0047] in For classification loss, focus loss is used to supervise the prediction of retrieval confidence for each query; To align the loss, the optimal match between the prediction and the true label is established using the Hungarian algorithm; For semantic denoising loss; Load balancing loss during hybrid expert model training; This is the loss for semantic comparison within the batch; , , , , These are the weighting coefficients for each loss.

[0048] A second aspect of the present invention provides a chest X-ray image report generation apparatus based on retrieval enhancement, the apparatus comprising the following modules:

[0049] Retrieval Database Construction Module: Constructs a standardized retrieval database containing atomic clinical findings;

[0050] Multi-level visual feature extraction module: Input a single chest X-ray image to be processed into a pre-trained visual representation model, extract multi-scale visual features from its multiple intermediate levels; input the multi-scale visual features into a multi-level visual fusion encoder for fusion, and generate a fused visual feature representation;

[0051] Multi-view feature retrieval module: The fused visual feature representation is connected in series with a learnable query input hybrid expert model enhanced deformable convolutional decoder. The deformable convolution enables adaptive interaction between the query and visual features. The hybrid expert model layer automatically decouples the feature differences between the frontal and lateral views, predicts the semantic embedding of key phrases and their confidence rates, and retrieves a set of matching key phrases from the standardized retrieval database based on the semantic embedding.

[0052] In the encoder and decoder training phase proposed in steps S2 and S3, a semantic denoising training strategy is introduced to optimize the learnable query.

[0053] Report generation module: The key phrase set, historical reports, and other view prediction key phrase sets are used as explicit contextual constraints input into the large language model to guide the large language model to generate chest X-ray image reports that conform to clinical standards and are consistent with historical reports.

[0054] The historical report is the previous report result of the individual to which the current chest X-ray belongs; the other view prediction key phrase set is, if the individual to which the current chest X-ray belongs has chest X-ray images taken at the same time from other perspectives, then the corresponding key phrase set is extracted from them through steps S2 and S3.

[0055] A third aspect of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the aforementioned method for generating chest X-ray images based on retrieval enhancement.

[0056] A fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the aforementioned method for generating chest X-ray images based on enhancement generation.

[0057] Compared with the prior art, the beneficial effects achieved by the present invention are:

[0058] 1) High efficiency in feature extraction: By using a multi-level visual fusion scheme with a single encoder, the model space occupancy is significantly reduced while taking into account the needs of localizing small lesions and understanding pathological semantics. High efficiency in multi-scale feature fusion can be achieved without the need for a complex attention mechanism.

[0059] 2) Effective decoupling of multi-view features: By using a strategy of separating shared experts and routing experts in the hybrid expert model layer, automatic feature decoupling and specialized processing of frontal and lateral images are achieved, improving multi-view feature processing capabilities without increasing inference computation.

[0060] 3) Fast training convergence: Through semantic denoising training strategies, a stable optimization signal is provided for query convergence, effectively accelerating model training and significantly reducing the number of iterations required for training by 30%-50%;

[0061] 4) Adaptive sampling accuracy: Adaptive interaction and sampling of queries are achieved through deformable convolution, enabling the model to dynamically adjust the receptive field according to the irregular lesion morphology, significantly improving the localization accuracy, while eliminating the impact of the initial spatial distribution of the query on the final performance. Attached Figure Description

[0062] Figure 1 This is a flowchart of a method for generating chest X-ray image reports based on retrieval enhancement provided in an embodiment of the present invention;

[0063] Figure 2 This is a schematic diagram of the multi-level visual fusion encoder process provided in an embodiment of the present invention;

[0064] Figure 3 This is a schematic diagram of the MoE-enhanced deformable convolutional decoder structure provided in an embodiment of the present invention;

[0065] Figure 4 This is a schematic diagram of the semantic denoising training strategy provided in an embodiment of the present invention. Detailed Implementation

[0066] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0067] Example 1:

[0068] Embodiment 1 of this invention discloses a method for generating chest X-ray image reports based on retrieval enhancement, such as... Figure 1 As shown, it includes the following steps:

[0069] S1: Database Construction

[0070] The retrieval database was built to extract atomized clinical findings from raw radiology reports and construct a structured query vector library. Radiology reports are typically in free text format, containing complex descriptions of multiple diseases concisely summarized in a single sentence or complex expressions including time comparisons with other reports.

[0071] S11. Text Preprocessing: The original radiology report is cleaned and segmented into sentences to identify and mark compound descriptive statements. In this embodiment, the original report text is first preprocessed to remove extra line breaks, redundant spaces, and special control characters generated by system export or manual editing, and to restore the text that breaks across lines to a continuous paragraph format, ensuring that the report content can be correctly segmented according to the period. The preprocessing operations include: (1) Line break processing - replacing line breaks in the text with spaces to eliminate sentence breaks caused by line breaks; (2) Space normalization - merging multiple consecutive spaces into a single space and removing spaces at the beginning and end of lines; (3) Special character filtering - removing invisible control characters and format marks. The preprocessed text is segmented into sentences using a sentence segmentation algorithm based on punctuation and a clinical terminology dictionary, and long paragraphs are split into independent sentences according to sentence end marks such as periods, question marks, and exclamation marks. For compound descriptions containing multiple findings concentrated in one sentence (such as "enlarged heart, increased lung markings, and possible heart failure"), special marking is performed for subsequent segmentation processing.

[0072] S12. Key Phrase Extraction: Using a large language model, candidate key phrases are extracted sentence by sentence from the segmented text through prompting engineering. The complex descriptive statements are split into independent assertions, and time-comparison descriptions and references to previous studies are filtered out. This embodiment uses a large language model with GPT-4 or equivalent capabilities and designs a special prompting template to guide the model in extracting structured clinical findings from radiology reports. The prompting template includes a task description, input format, output format requirements, and examples. The prompting template explicitly requires the model to: (1) identify and filter sentences containing time-comparison keywords (such as "compared to previous", "compared to previous film", "compared to before", "worsened", "improved", "progressed", "recovered", etc.); (2) identify and filter content that references to previous studies (such as "compared to previous examinations", "refer to previous examinations", etc.); (3) ensure that the extracted key phrases describe the independent findings of the current examination, rather than the time evolution process. For example, when the input is "enlarged heart, increased lung markings, possibly indicating heart failure", the model outputs two independent key phrases: "enlarged heart" and "increased lung markings". When the input is "compared to previous films, lung infiltration has improved", the model identifies it as a time-comparison description and outputs an empty list. The extraction process follows these principles: (1) Each key phrase should contain anatomical location and pathological description; (2) Uncertain words (such as "possibly" and "consider") are excluded; (3) Complex sentences are broken down into the smallest semantic units; and (4) All time comparisons and previous citations are filtered out.

[0073] S13. Entity Validation: Rule-based graph parsing is used to compare and validate the content before and after extraction, filtering out invalid phrases that do not conform to the clinical entity definition and retaining valid clinical findings descriptions. This embodiment uses RadGraph's rule-based graph parsing function to validate the extraction results. The validation process employs a differentiated strategy based on the sentence processing type: For sentences completely discarded by the large model, RadGraph is used for entity extraction, and the keywords are checked to see if they fall within the pre-collected time-comparison marker set. If they mainly belong to this set, the discard is confirmed as correct; otherwise, it is marked for manual review. For sentences that have been modified, the text before and after modification is input into RadGraph for graph parsing. The main entities (anatomical locations and disease entities) in the two parsing results are compared to see if they completely correspond. If the main entities are consistent, the modification is accepted; if they are inconsistent, it is marked for manual processing. For key phrases that pass the above validation, they are further checked to see if they contain complete clinical information (anatomical location + pathological description), filtering out overly vague or fragmented phrases to ensure that entries entering the vector database have clear clinical significance and retrieval value. This "post-generation validation" process ensures the clinical accuracy and consistency of database entries through RadGraph's objective validation.

[0074] S14. Vector Encoding: The verified key phrases are encoded into 768-dimensional semantic embedding vectors using a pre-trained text encoder, constructing a vector database that supports similarity retrieval, and establishing an index mapping relationship between the key phrases and the original report clauses. This embodiment uses MPNet as the text encoder. This model is pre-trained on a large-scale text corpus through masking and permutation language modeling, and has good modeling capabilities for sentence-level semantic representation. The encoding process is as follows: First, the key phrases are segmented based on a medical terminology dictionary and punctuation rules. Then, the segmented text is input into the MPNet model to obtain sentence-level semantic embeddings. Finally, the embeddings are L2 normalized to obtain semantic vectors of unit length. The vector database is constructed using FAISS (Facebook AI Similarity Search), which supports efficient approximate nearest neighbor retrieval. Each database entry contains: key phrase text (original string), semantic embedding vector (768-dimensional floating-point number), source report ID, source sentence position, and verification status marker. The database supports two similarity measures: cosine similarity and Euclidean distance, with cosine similarity used by default.

[0075] S2: Multi-level visual feature extraction

[0076] The multi-level visual fusion encoder is responsible for extracting and fusing multi-scale visual features from the pre-trained visual representation model, such as... Figure 2 As shown in the figure. This embodiment uses CXR-CLIP as a pre-trained visual representation model.

[0077] S21. Multi-scale Feature Extraction: The chest X-ray image to be processed is input into the pre-trained visual representation model. Intermediate feature maps are extracted from layers 1, 4, and 5 respectively, forming a multi-scale feature pyramid. In this embodiment, CXR-CLIP is used as the pre-trained visual representation model. Its visual encoder adopts the Vision Transformer architecture, and its text encoder adopts the Transformer architecture. This model has been pre-trained on large-scale chest X-ray-report pairs through contrastive learning. By aligning visual features with text semantics to a shared embedding space, it extracts clinically discriminative chest X-ray image features. The input image size is 512×512 pixels. After processing by the CXR-CLIP visual encoder, the number of channels in the output feature maps of each layer is 768 dimensions, and the spatial resolution remains constant. This embodiment mainly uses shallow feature maps (layers 1, 4, and 5) for subsequent processing, and deep features are expressed through CLS tokens.

[0078] S22. Category Label Expansion: Extract the category label (CLStoken, 768-dimensional vector) from the final output of the pre-trained visual representation model, and expand the category label to the same spatial dimension as the feature maps of each layer through a copy expansion operation. In this embodiment, the CLS token is located at the first position of the final layer output of CXR-CLIP and contains semantic summary information of the global image. The expansion operation is implemented through tensor broadcasting: the CLS token vector of shape [1, 768] is copied and expanded to [...]. [, 768], then reshaped into [ , The spatial tensor of

[768] . This extension method allows each spatial location to receive global semantic guidance from the CLS token, enhancing the global consistency of the feature map.

[0079] S23. Feature Fusion: The expanded category labels are added element-wise to the spatial feature maps of each layer to generate fused multi-layer features. The fusion formula is:

[0080] ;

[0081] in For the first Spatial feature map of the layer Label the features for the expanded categories.

[0082] S24. Channel Compression: The fused multi-layer features are concatenated along the channel dimension, and channel compression is performed through a 1×1 convolution to obtain the final visual feature representation. In this embodiment, the fused features of the three layers (layer 1, layer 4, and layer 5, each with 768 channels) are concatenated along the channel dimension to obtain a concatenated feature with a total of 2304 channels. Then, channel compression is performed through a 1×1 convolutional layer to obtain the final visual feature representation. Its number of channels is consistent with the number of feature channels in the pre-trained visual representation model, and its spatial size is the same as the feature size extracted by the pre-trained visual representation model. The formula for channel compression is expressed as:

[0083] ;

[0084] in This indicates a channel-level concatenation operation. This represents a 1×1 convolution operation.

[0085] S3: Multi-view Feature Retrieval

[0086] The fused visual feature representation is coupled to a deformable convolutional decoder enhanced by a learnable query input hybrid expert model (MoE) (the two are connected in series). Adaptive interaction between the query and visual features is achieved through deformable convolution, and the feature differences between anteroposterior and lateral views are automatically decoupled through the hybrid expert model layer. The semantic embedding of key phrases and their confidence rates are predicted. Based on the semantic embedding, a set of matching key phrases is retrieved from the standardized retrieval database, such as... Figure 3 As shown.

[0087] S31. Deformable Self-Interaction: The learnable query is reshaped into a two-dimensional query feature map, which is then subjected to two 1×1 convolutions for channel transformation, and then subjected to a 9×9 deformable convolution for self-interaction to establish long-range dependencies between queries. In this embodiment, the learnable query is a set of randomly initialized embedding vectors, numbered 50, with each query having a dimension of 768, representing the basic template of potential key phrases that the model needs to retrieve, denoted as... ,in For the number of queries, For query dimensions. During the self-interaction phase, the query is reshaped into a rectangular two-dimensional feature map. (e.g., 5×10=50), then feature transformation along the channel dimension is performed through two consecutive 1×1 convolutional layers, followed by normalization and the GELU activation function after each 1×1 convolutional layer; finally, processing is performed through a 9×9 deformable convolution. The deformable convolution operation is defined as:

[0088] ;

[0089] in For output position, For the predefined k-th sampling point, The learned spatial offset, The modulation scalar is used to suppress background noise. K represents the kernel weights, and K=81 represents the number of sampling points for the 9×9 convolution.

[0090] S32, Deformable Cross-Interaction: The query after self-interaction is upsampled to make its spatial size close to the visual feature representation. Then, the upsampled query and the visual feature representation are fused. The fused features are then subjected to a 9×9 deformable convolution to achieve cross-interaction between the query and the visual features. In this embodiment, the query feature map is first transformed from the query space using bilinear interpolation. Upsampling to visual feature space (as from) Upsampling The upsampled query is denoted as Then, the upsampled query features and visual features are added element-wise and fused together:

[0091] ;

[0092] Finally, the fusion features were analyzed. Perform 9×9 deformable convolutional cross-interactions and output the cross-interaction features. .

[0093] S33. Hybrid Expert Routing and Feature Decoupling: The cross-interacted features are input into the hybrid expert model layer. A gating network calculates the activation probability of each routing expert, selecting the two routing experts with the highest probabilities for further calculation. An automated decoupling of anteroposterior and lateral view features is achieved through a strategy of separating shared experts from routing experts. In this embodiment, the hidden layer of the hybrid expert model layer has a 768-dimensional dimension, and the input is a two-dimensional feature map structure. Each spatial location of the feature map is processed sequentially. The layer contains one shared expert and eight routing experts. The shared expert is always active, processing general anatomical features across views, while the selected routing experts process view-specific detailed features. The output of the hybrid expert model layer for the input feature x is calculated as follows:

[0094] ;

[0095] in The routing probability output by the gated network. Choose the K=2 routing experts with the highest probability. To share the number of experts, This represents the i-th shared expert. This represents the j-th routing expert. This represents the activation weight of the j-th routing expert.

[0096] S34. Semantic Embedding Output: The output of the hybrid expert model layer is passed through two parallel linear projection heads to generate semantic embedding vectors for key phrases and predicted probabilities for each position. In this embodiment, the semantic embedding projection head outputs a 768-dimensional vector, consistent with the semantic embedding dimension of the text encoder, and is used for similarity matching with the retrieval database; the probability prediction head outputs the existence probability of each query position, used to filter valid queries. For 50 queries, the final output consists of 50 768-dimensional semantic embedding vectors and 50 corresponding probability values.

[0097] S35. Similarity Search: Filter out all predicted probabilities higher than a preset threshold from the results generated in step S34. The semantic embedding vectors are used; for each selected semantic embedding vector, its similarity with each entry in the vector database is calculated, and the entry with the highest similarity is selected as the matching result; the key phrases corresponding to all matching results are summarized to form the final set of key phrases for retrieval. In this embodiment, cosine similarity is used to measure the similarity between semantic embedding vectors and database entries, and the similarity calculation formula is:

[0098] ;

[0099] Where q is the semantic embedding vector of the query, and d is the semantic embedding vector of the database entry. Represents the vector dot product. This represents the L2 norm of a vector. An efficient nearest neighbor search is implemented using the FAISS library.

[0100] S4: Report Generation

[0101] The key phrase set, historical reports, and other view-predicted key phrase sets are used as explicit contextual constraints input into the large language model to guide the large language model to generate chest X-ray image reports that conform to clinical standards and are consistent with historical reports. The historical reports are the previous reports of the individual to which the current chest X-ray belongs. The other view-predicted key phrase sets are those extracted from chest X-ray images taken from other views simultaneously by the individual to which the current chest X-ray belongs, through steps S2 and S3.

[0102] In this embodiment, a large language model with Llama-3-8B-Instruct or equivalent capabilities is used as the report generator. The input format is a structured prompt, which includes the following parts: (1) Task description - explaining that the model needs to generate a radiology report based on the retrieved key phrases; (2) Search results - listing the set of key phrases retrieved by S35; (3) Historical report search results - the set of key phrases extracted from the patient's historical examination reports to maintain report consistency; (4) Multi-view association results - when there are multi-view images such as anteroposterior and lateral views, the set of key phrases retrieved from each view is summarized to achieve multi-view information fusion; (5) Output requirements - requiring the generation of a professional, coherent report text that conforms to clinical standards. Example prompt template: "Based on the following key phrases found in the chest X-ray: [enlarged heart, increased lung markings, bilateral pleural effusion], and the patient's historical report key phrase: [mildly enlarged heart], please generate a professional radiology report. The report should include two parts: image findings and image impressions, using standardized medical terminology, and maintaining consistency with historical reports." The large language model generates structured report text based on the factual constraints of the search results. This design leverages the factual nature of the search results to ensure the clinical accuracy of the generated report (avoiding illusions), while also utilizing the language organization capabilities of the large language model to output coherent and professional medical text.

[0103] The above steps S1 to S4 describe the workflow of the model inference phase. During the model training phase, a semantic denoising training strategy is used to optimize the learnable query, specifically including the following steps: Figure 4 As shown:

[0104] Truth Embedding Acquisition: For each true key phrase in the training batch, it is encoded into a 768-dimensional semantic embedding vector using the same MPNet text encoder as S14, which serves as the positive sample truth value.

[0105] Semantic noise generation: Randomly select some words from the real key phrases and replace them with synonyms or randomly delete them to generate semantically perturbed key phrases;

[0106] Gaussian noise addition: Random Gaussian noise is added to the 768-dimensional semantic embedding vector of the true value of the positive sample to generate a noisy query;

[0107] Denoising and Reconstruction: The noise query is replaced by the learnable query and input into the decoder, and the denoised query representation is output.

[0108] Reconstruction loss calculation: Calculate the semantic reconstruction loss between the denoised query representation and the ground truth value of the positive sample, as well as the model's prediction confidence loss for the denoised query; the expression for the semantic denoising loss is: ;

[0109] in Represents the set of positive sample noise queries, where the total One positive sample; For reconstructed query semantic embedding, The semantic embedding truth value is the output of the text encoder corresponding to the real text phrase; L2 distance metric; confidence calibration uses binary cross-entropy loss. ; target label Defined as: a value of 1 for queries generated from real phrases, and a value of 0 for queries generated from pure random noise; This represents the model's prediction confidence for the reconstructed query. The confidence loss weight coefficients are used; gradient backpropagation: gradient backpropagation is performed based on the reconstruction loss and the prediction confidence loss to update the model parameters.

[0110] The semantic denoising training and the main task training are performed alternately. In each training iteration, semantic denoising training is performed first, followed by the main task training.

[0111] Furthermore, the final loss function for complete training consists of the following joint optimization of losses:

[0112] ;

[0113] in For classification loss, focus loss is used to supervise the prediction of retrieval confidence for each query; To align the loss, the optimal match between the prediction and the true label is established using the Hungarian algorithm; For semantic denoising loss; Load balancing loss during hybrid expert model training; This is the loss for semantic comparison within the batch; , , , , These are the weighting coefficients for each loss.

[0114] In this embodiment, the following configuration is used during model training:

[0115] Optimizer: AdamW, learning rate 1e-4, weight decay 1e-4;

[0116] Learning rate scheduling: cosine annealing, warm-up steps 500;

[0117] Batch size: 16;

[0118] Number of training rounds: 7 (including semantic denoising training) or 15 (excluding semantic denoising training);

[0119] Number of queries available for learning: 50;

[0120] Database entries retrieved: from publicly available datasets such as MIMIC-CXR and IU X-ray.

[0121] Through the above implementation methods, the present invention achieves efficient, accurate, and stable intelligent generation of chest X-ray image reports, effectively solving key problems in the prior art such as complex feature extraction architecture, difficulty in decoupling multi-view features, and slow training convergence.

[0122] Furthermore, the present invention also provides a chest X-ray image report generation device based on retrieval enhancement, the device comprising the following modules:

[0123] Retrieval Database Construction Module: Constructs a standardized retrieval database containing atomic clinical findings;

[0124] Multi-level visual feature extraction module: Input a single chest X-ray image to be processed into a pre-trained visual representation model to extract multi-scale visual features from its multiple intermediate levels; input the multi-scale visual features into a multi-level visual fusion encoder for fusion to generate a fused visual feature representation.

[0125] Multi-view feature retrieval module: The fused visual feature representation is connected in series with a learnable query input hybrid expert model enhanced deformable convolutional decoder. The deformable convolution enables adaptive interaction between the query and visual features. The hybrid expert model layer automatically decouples the feature differences between the frontal and lateral views, predicts the semantic embedding of key phrases and their confidence rates, and retrieves a set of matching key phrases from the standardized retrieval database based on the semantic embedding.

[0126] In the encoder and decoder training phase proposed in steps S2 and S3, a semantic denoising training strategy is introduced to optimize the learnable query.

[0127] Report generation module: The key phrase set, historical reports, and other view prediction key phrase sets are used as explicit contextual constraints input into the large language model to guide the large language model to generate chest X-ray image reports that conform to clinical standards and are consistent with historical reports.

[0128] The historical report is the previous report result of the individual to which the current chest X-ray belongs; the other view prediction key phrase set is, if the individual to which the current chest X-ray belongs has chest X-ray images taken at the same time from other perspectives, then the corresponding key phrase set is extracted from them through steps S2 and S3.

[0129] An electronic device is also provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the aforementioned method for generating chest X-ray images based on retrieval enhancement. A computer-readable storage medium is also provided, on which a computer program is stored. When executed by a processor, the program implements the aforementioned method for generating chest X-ray images based on retrieval enhancement.

[0130] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein.

[0131] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope.

Claims

1. A method for generating chest X-ray image reports based on retrieval enhancement, characterized in that, Includes the following steps: S1. Construct a standardized retrieval database containing atomic clinical findings; S2. Input a single chest X-ray image to be processed into a pre-trained visual representation model and extract multi-scale visual features from its multiple intermediate levels. The multi-scale visual features are input into a multi-level visual fusion encoder for fusion to generate a fused visual feature representation. S3. The fused visual feature representation is combined with a learnable query input and a hybrid expert model-enhanced deformable convolutional decoder. Adaptive interaction between the query and visual features is achieved through deformable convolution. The feature differences between the frontal and lateral views are automatically decoupled through the hybrid expert model layer. The semantic embedding of key phrases and their confidence rates are predicted. Based on the semantic embedding, a set of matching key phrases is retrieved from the standardized retrieval database. Specifically, this includes the following sub-steps: S31, Deformable Self-Interaction: The learnable query is reshaped into a two-dimensional feature map, and after channel transformation through 1×1 convolution, it is self-interacted through deformable convolution to establish long-range dependencies between queries. The learnable query is a set of trainable parameters, which are processed by the decoder and the fused visual features to output the prediction result. S32, Deformable Cross-Interaction: The query after self-interaction is upsampled so that its spatial size is the same as the aforementioned visual feature representation. Then, the upsampled query and the visual feature representation are fused. The fused features are then subjected to deformable convolution to achieve cross-interaction between the query and the visual features. S33. Hybrid expert routing and feature decoupling: The cross-interaction features are input into the hybrid expert model layer. The activation probability of each routing expert is calculated through a gating network. The two routing experts with the highest probabilities are selected for calculation. The automatic decoupling of the features of the frontal and lateral images is achieved through the strategy of separating the shared experts and the routing experts. S34, Semantic Embedding Output: The output of the hybrid expert model layer is passed through two parallel linear projection heads to generate the semantic embedding vector of the key phrase and the predicted probability of each position, respectively. S35. Similarity Search: Filter out all predicted probabilities higher than a preset threshold from the results generated in step S34. The semantic embedding vectors are used to calculate the similarity between each selected semantic embedding vector and each entry in the vector database. The entry with the highest similarity is selected as the matching result. The key phrases corresponding to all matching results are summarized to form the final set of key phrases for retrieval. S4. Input the set of key phrases, historical reports, and other view prediction key phrase sets as explicit contextual constraints into the large language model to guide the large language model to generate chest X-ray image reports that conform to clinical standards and are consistent with historical reports. The historical report is the previous report result of the individual to which the current chest X-ray belongs; the other view prediction key phrase set is, if the individual to which the current chest X-ray belongs has chest X-ray images taken at the same time from other perspectives, then the corresponding key phrase set is extracted from them through steps S2 and S3.

2. The method for generating chest X-ray image reports based on retrieval enhancement as described in claim 1, characterized in that, Step S2 specifically includes the following sub-steps: S21. Input the chest X-ray image to be processed into the pre-trained visual representation model and extract multi-scale features from multiple intermediate levels. S22. Extract the category labels output by the pre-trained visual representation model and expand them to the same spatial dimension as the feature maps of each layer; S23. The expanded category labels are fused with the spatial feature maps of each layer to generate fused multi-layer features; S24. The fused multi-layer features are spliced and compressed in the channel dimension to obtain the final visual feature representation.

3. The method for generating chest X-ray image reports based on retrieval enhancement as described in claim 1, characterized in that, In step S31, the deformable convolution operation is defined as follows: ； in For output position, For the predefined k-th sampling point, The learned spatial offset, The modulation scalar is used to suppress background noise. K represents the kernel weights, and K=81 represents the number of sampling points for the 9×9 convolution.

4. The method for generating chest X-ray image reports based on retrieval enhancement as described in claim 1, characterized in that, In step S32, the upsampled query is fused with the visual feature representation. Specifically, this involves first fusing the query feature map from the query space using bilinear interpolation. Upsampling to visual feature space The upsampled query is denoted as Then, the upsampled query features and visual features are added element-wise and fused together. ； Finally, the fusion features were analyzed. Perform 9×9 deformable convolutional cross-interactions and output the cross-interaction features. .

5. The method for generating chest X-ray image reports based on retrieval enhancement as described in claim 1, characterized in that, In step S33, the hidden layer of the hybrid expert model layer has a dimension of 768, and the input is a two-dimensional feature map structure. Each spatial location of the feature map is processed sequentially. The layer contains 1 shared expert and 8 routing experts. The shared expert is always active and processes the general anatomical features across views. The selected routing experts process view-specific detailed features. The output of the hybrid expert model layer for the input feature x is calculated as follows: ； in The routing probability output by the gated network. Choose the K=2 routing experts with the highest probability. To share the number of experts, This represents the i-th shared expert. This represents the j-th routing expert. This represents the activation weight of the j-th routing expert.

6. The method for generating chest X-ray image reports based on retrieval enhancement as described in claim 1, characterized in that, The encoder and decoder training phase proposed in steps S2 and S3 also includes introducing a semantic denoising training strategy to optimize the learnable query; During the model training phase, a semantic denoising training strategy is introduced to accelerate model convergence, specifically including: For each ground truth key phrase in the training batch, a text encoder is used to encode it into a semantic embedding vector, which serves as the ground truth value for positive samples. ; The real key phrases are randomly replaced or deleted to generate semantically perturbated key phrases; Random noise is added to the semantic embedding vector of the true positive sample values to generate a noisy query. ; The noise query is replaced by the learnable query input decoder, and the denoised query representation is output. ; Calculate the semantic reconstruction loss between the denoised query representation and the ground truth positive sample, and the model's prediction confidence loss for the denoised query; the expression for the semantic denoising loss is: ; in Represents a set of positive sample noise queries; For reconstructed query semantic embedding, For the corresponding semantic embedding truth value; Indicates L2 distance; Binary cross-entropy loss; target label The query value is 1 for queries generated from real phrases and 0 for queries generated from pure random noise. This represents the model's prediction confidence for the reconstructed query. The confidence loss weight coefficients are used; gradient backpropagation: gradient backpropagation is performed based on the reconstruction loss and the prediction confidence loss to update the model parameters; The semantic denoising training and the main task training are performed alternately. In each training iteration, semantic denoising training is performed first, followed by the main task training. The final loss function for complete training is: ； in For classification loss; For alignment loss; For semantic denoising loss; The load balancing loss during hybrid expert model training; This is the semantic comparison loss within the batch; , , , , These are the weighting coefficients for each loss.

7. A chest X-ray image report generation device based on retrieval enhancement, characterized in that, The device includes the following modules: Retrieval Database Construction Module: Constructs a standardized retrieval database containing atomic clinical findings; Multi-level visual feature extraction module: Input a single chest X-ray image to be processed into a pre-trained visual representation model, extract multi-scale visual features from its multiple intermediate levels; input the multi-scale visual features into a multi-level visual fusion encoder for fusion, and generate a fused visual feature representation; Multi-view feature retrieval module: The fused visual feature representation is connected in series with a learnable query input hybrid expert model enhanced deformable convolutional decoder. The deformable convolution enables adaptive interaction between the query and visual features. The hybrid expert model layer automatically decouples the feature differences between the frontal and lateral views, predicts the semantic embedding of key phrases and their confidence rates, and retrieves a set of matching key phrases from the standardized retrieval database based on the semantic embedding. In the encoder and decoder training phase proposed in steps S2 and S3, a semantic denoising training strategy is introduced to optimize the learnable query. Report generation module: The key phrase set, historical reports, and other view prediction key phrase sets are used as explicit contextual constraints input into the large language model to guide the large language model to generate chest X-ray image reports that conform to clinical standards and are consistent with historical reports. The historical report refers to the previous report results for the individual to which the current chest X-ray belongs; the set of other view prediction key phrases is... If the individual to whom the current chest radiograph belongs has chest radiographs taken at the same time from other perspectives, then extract the corresponding set of key phrases for that individual through steps S2 and S3.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the chest X-ray image report generation method based on retrieval enhancement as described in any one of claims 1 to 6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the program implements the method for generating chest X-ray images based on enhancement as described in any one of claims 1 to 6.