Breast ultrasound panoramic image diagnosis system and training method thereof, and breast multi-modal diagnosis method

By combining a CNN-Transformer visual encoder with bias-reduced contrastive learning and structured ultrasound report text, the problems of difficult ROI annotation and information ambiguity in breast ultrasound diagnosis were solved, achieving high-precision panoramic image diagnosis and improving the accuracy and interpretability of early breast cancer detection.

CN122265166APending Publication Date: 2026-06-23HUAZHONG UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUAZHONG UNIV OF SCI & TECH
Filing Date
2026-03-05
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing breast ultrasound diagnostic techniques rely on manual ROI annotation, which is difficult and time-consuming. The visual feature extraction architecture is simplistic and struggles to capture global structural information. Single-modal diagnosis suffers from information ambiguity, and multimodal fusion methods lack deep interaction and semantic alignment, resulting in insufficient diagnostic accuracy.

Method used

A CNN-Transformer visual encoder is used to extract panoramic ultrasound image features. Combined with a text feature extraction module and a feature fusion module, a hybrid architecture is used to capture global and local features. Structured ultrasound report text is introduced as semantic guidance. Debiased contrastive learning is used to achieve deep alignment between image and text. A frequency domain interactive feature fusion method and a lightweight cross-attention mechanism are designed.

Benefits of technology

It achieves high-precision full-field image diagnosis without ROI annotation, improves the accuracy and interpretability of early breast cancer detection, simplifies the diagnostic process, and enhances the robustness and reliability of the model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265166A_ABST
    Figure CN122265166A_ABST
Patent Text Reader

Abstract

The present application belongs to the technical field of medical image processing, and discloses a breast ultrasound panoramic image diagnosis system, a training method thereof, and a breast multi-modal diagnosis method. The original panoramic ultrasound image without manual region of interest (ROI) cropping is input into a hybrid visual encoder, the long-range dependence is captured by a GSA module, and the high-frequency detail features of the lesion edge are extracted by an LMA module; the global and local features are transformed to the frequency domain for deep interaction and fusion of the real part and the imaginary part by a context detail frequency fusion module; in the multi-modal alignment stage, a debiased contrast loss function is introduced, the false negative deviation in negative sample sampling is corrected by estimating the class prior probability, and robust alignment of image and text features is realized; finally, the multi-modal diagnosis result is output by fusing the dual-modal features. The present application does not need to rely on expensive manual ROI labeling, can directly realize high-precision automatic diagnosis on the full field of view image, and greatly simplifies the clinical deployment process.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of medical image processing technology, and more specifically, relates to a breast ultrasound panoramic image diagnostic system and its training method, and a breast multimodal diagnostic method. Background Technology

[0002] Breast cancer has become one of the most common malignant tumors among women worldwide, posing a serious threat to women's health. Clinical practice shows that early detection, early diagnosis, and early treatment are key to improving the survival rate and prognosis of breast cancer patients. Breast ultrasound (BUS), with its advantages of being non-invasive, radiation-free, low-cost, real-time, and highly sensitive to dense breast tissue, has become the preferred imaging tool for breast cancer screening and diagnosis.

[0003] However, despite the widespread clinical application of ultrasound technology, its diagnostic accuracy largely depends on the operator's experience and subjective judgment. Typical malignant tumors often present with irregular margins, irregular shapes, and posterior echo attenuation, but these features may not be obvious in early or atypical cases, or may overlap with the characteristics of benign lesions (such as fibroadenomas). Furthermore, the inherent speckle noise, artifacts, and low contrast of ultrasound images often make the boundaries of lesions indistinct. This high degree of subjectivity and limitations in image quality lead to significant inter- and intra-observer diagnostic variability, increasing the risk of missed and misdiagnosed cases.

[0004] To address these issues, CAD systems aim to objectively analyze medical images using computer algorithms, providing quantitative diagnostic suggestions to assist physicians in decision-making. Although deep learning-based CAD technology has made significant progress in recent years, existing breast ultrasound diagnostic techniques still face several major technical bottlenecks: First, the limitations of relying on manually labeled regions of interest (ROIs): Most mainstream breast ultrasound CAD systems currently employ a two-stage processing flow: "detect / segment first, then classify." This means that before classifying benign or malignant lesions, regions of interest (ROIs) must be extracted manually or using pre-trained detection models. This ROI-dependent approach has significant drawbacks: obtaining pixel-level finely annotated ROIs requires considerable time and effort from experienced physicians, making the construction of large-scale, high-quality medical datasets extremely difficult. Furthermore, different physicians may have varying definitions of the same lesion boundary, and this subjective bias can be learned by the model, affecting its generalization ability. Moreover, focusing solely on the ROI inevitably leads to the loss of background tissue information surrounding the lesion. In breast ultrasound, global features such as changes in the distortion of surrounding structures are often important indicators of malignancy. Ignoring the contextual information of the entire image reduces diagnostic accuracy. Therefore, developing an end-to-end diagnostic model that does not require ROI annotation and can directly process raw panoramic images is an urgent need in current technological development.

[0005] Second, the limitation of a single visual feature extraction architecture: In deep learning-based visual feature extraction, existing methods primarily rely on Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs). Traditional CNNs, with their local receptive fields and weight-sharing mechanisms, possess translation invariance and inductive bias, making them highly adept at extracting local textures and edge details from images. However, limited by the size of the convolutional kernel, CNNs struggle to effectively establish long-range pixel dependencies, resulting in a weaker ability to capture global structural information across the entire image. ViTs, through their self-attention mechanism, can establish global long-range dependencies, excelling at capturing the overall contextual information of an image. Simply using either CNN or Transformer is insufficient to perfectly handle the complex pathological features in breast ultrasound images—requiring both macroscopic structural assessment and microscopic edge analysis. While some research has attempted to combine both, effectively integrating features from these two paradigms deeply across feature channels and spatial dimensions, particularly addressing their semantic-level differences, remains a significant technical challenge.

[0006] Third, the limitations of single-modal information: Relying solely on image modalities for diagnosis inherently suffers from information ambiguity. In clinical practice, doctors not only observe images during diagnosis but also refer to the patient's ultrasound report text (such as textual descriptions of lesion shape, location, echo patterns, and blood flow signals). This textual information contains the doctor's high-level semantic understanding of the lesion, providing clear semantic anchors for ambiguous visual features. However, existing multimodal fusion methods are relatively simple; most methods merely concatenate image feature vectors and text feature vectors, lacking deep interaction and semantic alignment during feature extraction. Currently popular vision-language pre-trained models typically employ a contrastive learning framework. In the natural image domain, randomly selected negative samples are usually semantically unrelated. However, in the medical field, data exhibits characteristics of "high intra-class variance and low inter-class variance." Samples from different patients in a batch may all have malignant tumors; they are semantically highly similar "true positive" samples. If the standard contrastive learning loss function (InfoNCE) is applied directly, these semantically similar samples will be treated as negative samples and forcibly pushed apart. This "false negative" problem will severely damage the structure of the feature space, causing the model to be unable to learn robust discriminative features. Summary of the Invention

[0007] In view of the above-mentioned defects or improvement needs of the existing technology, the present invention provides a breast ultrasound panoramic image diagnostic system and its training method, as well as a breast multimodal diagnostic method, the purpose of which is to achieve high-precision full-field image diagnosis without the need for ROI annotation.

[0008] To achieve the above objectives, the present invention provides a breast ultrasound panoramic image diagnostic system, comprising: a CNN-Transformer visual encoder, a text feature extraction module, a feature fusion module, and a classification module; The CNN-Transformer visual encoder is used to extract image features from the original panoramic ultrasound image without manual region of interest (ROI) cropping; the text feature extraction module is used to obtain the text features contained in the structured clinical ultrasound text report corresponding to the original panoramic ultrasound image; the feature fusion module is used to fuse the image features and the text features to obtain fused features; the classification module is used to obtain multimodal diagnostic results based on the fused features. The CNN-Transformer visual encoder includes a convolutional neural network backbone and multiple cascaded encoding networks. The convolutional neural network backbone is used to extract shallow features from the original panoramic ultrasound image. The encoding networks include several stacked FFD-Blocks, each FFD-Block comprising: The GSA module is used to perform channel-dimensional self-attention on the shallow features to generate an original attention matrix, and then to filter the original attention matrix by channel, selecting the features of the top k channels with the highest feature response values ​​as global features. ; The LMA module is used to obtain a feature map containing local texture information based on the shallow features. and to Global average pooling is used to obtain the feature map. ,Will and After performing a dot product, an activation function is applied, followed by pointwise convolution to generate local features. ; The SAGM module is used for global feature-based... and local features Features after fusion Multi-scale processing is performed and the weights of features at different scales are adaptively adjusted to obtain the image features.

[0009] Furthermore, the FFD-Block also includes a context detail frequency fusion module; The context detail frequency fusion module is used to integrate global features and local features The global features are obtained by performing a two-stage fusion. and local features Features after fusion The two-stage fusion includes: In the first stage, based on global features Computational Spatial Attention Mask Weighted local features The first feature is obtained; and based on the local features... Computational Channel Attention Mask Weighted global features The second feature is obtained; the first and second features are summed to obtain the fusion feature of the first stage. ; In the second stage, the fusion features are... After amplifying the high-frequency signal using a linear layer, the resulting feature tensor is projected from the spatial domain to the frequency domain using a Fast Fourier Transform (FFT). The real and imaginary parts of the frequency domain features are concatenated along the channel dimension, and then interacted across the entire frequency band using a learnable linear layer to obtain the fused frequency domain features. Finally, the frequency domain features are restored to the spatial domain using an Inverse Fast Fourier Transform (IFT) to obtain the fused features. .

[0010] Furthermore, the fusion feature for:

[0011]

[0012]

[0013] in, This represents the sigmoid activation function. represents dot product, PW represents pointwise convolution, and GELU represents the GELU activation function.

[0014] Furthermore, the original attention matrix is ​​subjected to channel filtering using a diagnostic saliency operator, specifically including: The shallow features are linearly transformed and activated to generate cue guidance features. These cue guidance features are flattened into low-rank vectors, and the global average of these low-rank vectors is calculated to generate a global scaling factor. The original attention matrix is ​​then filtered using this global scaling factor, retaining only the features from the k channels with the highest response values ​​as the global features. ; wherein, the global features for:

[0015] In the formula, The diagnostic saliency operator is defined as follows: T represents matrix transpose, d represents the dimension of the feature channel, Softmax represents the Softmax function, and the Q matrix, K matrix, and V matrix are generated by point convolution and depth convolution of the shallow features. The original attention matrix is ​​generated by the dot product of the Q matrix and the K matrix.

[0016] Furthermore, the SAGM module includes parallel multi-scale convolutional branches and a gating mechanism, wherein the parallel multi-scale convolutional branches are used to process the fused features. Multi-scale processing is performed to perceive lesions of different sizes, and the feature weights of different scales are adaptively adjusted through the gating mechanism to obtain the image features; The feature fusion module employs a lightweight cross-attention mechanism to fuse the image features and the text features. Specifically, it includes: using the image features as the query vector and the text features as the key and value vector, calculating the image-text relevance weight, and integrating the text features into the image features according to the image-text relevance weight to obtain the fused features.

[0017] Further, the clinical ultrasound text report undergoes the aforementioned structured processing, including: According to the BI-RADS standard, the clinical ultrasound text report is parsed into multiple fixed semantic partitions, and a special delimiter is inserted between each semantic partition to clarify the semantic boundaries; missing data items in each semantic partition are detected, and null value markers are inserted at the corresponding positions to obtain the structured clinical ultrasound text report. The semantic partitioning includes lesion physical size, lesion morphology and edge features, echo and calcification patterns, surrounding tissue association and blood flow signals.

[0018] The present invention also provides a training method for a breast ultrasound panoramic image diagnostic system, wherein the breast ultrasound panoramic image diagnostic system is any one of the breast ultrasound panoramic image diagnostic systems described above, and the training method includes: The breast ultrasound panoramic image diagnostic system is trained end-to-end using a training dataset with category labels. The network parameters are updated through backpropagation algorithm. When the loss converges, the trained breast ultrasound panoramic image diagnostic system is obtained. The loss function used for training includes bias-reflective loss, wherein the bias-reflective loss... for:

[0019] In the formula, the function s represents the calculation of similarity; I represents the current anchor sample, i.e., the image feature; This represents the "true positive" paired sample corresponding to the anchor sample, i.e., the text feature; The text features that are not paired with the image features within a batch, where N represents the number of samples in a batch; is a false negative sample, that is, a text feature in the negative sample set that is estimated to be semantically similar to the anchor sample; M is the number of summation terms used to estimate the expectation of the false negative sample; Temperature coefficient; is the preset class prior probability, representing the probability estimate of a false negative sample in a randomly sampled set of unpaired samples.

[0020] This invention also provides a method for diagnosing breast cancer using panoramic ultrasound images, comprising: The original panoramic ultrasound image of the patient to be diagnosed and the corresponding clinical ultrasound text report are input into the trained breast ultrasound panoramic image diagnostic system to obtain multimodal diagnostic results; wherein, the breast ultrasound panoramic image diagnostic system is any of the breast ultrasound panoramic image diagnostic systems described above.

[0021] The present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the training method for the breast ultrasound panoramic image diagnostic system as described above, and / or implements the breast ultrasound panoramic image diagnostic method as described above.

[0022] The present invention also provides a computer program product, comprising a computer program that, when the computer program is run on a computer, causes the computer to execute the training method of the breast ultrasound panoramic image diagnostic system described above, and / or execute the breast ultrasound panoramic image diagnostic method described above.

[0023] In summary, the above-described technical solutions conceived in this invention can achieve the following beneficial effects: (1) This invention designs a hybrid CNN-Transformer visual encoder to capture both global and local features simultaneously, and introduces structured ultrasound report text as semantic guidance. It utilizes debiased contrastive learning to achieve deep alignment between images and text, thereby achieving high-precision full-field image diagnosis without the need for ROI annotation. Specifically, the system of this invention directly takes the original full-field image containing the complete anatomical background as input, abandoning the cumbersome and expensive manual ROI annotation process in traditional methods. It uses text semantic guidance to replace the traditional spatial ROI guidance. By converting natural language reports into structured semantic features, when cross-attention is used to fuse image features and text features in the fusion stage, these text features act as "semantic anchors." For example, when the text contains the description of "irregular shape," the model will automatically search for the corresponding visual region in the image through the cross-attention mechanism. This strong semantic supervision signal is sufficient to guide the model to focus on the lesion area, thereby achieving diagnostic performance comparable to models using fine ROI annotation without using pixel-level spatial annotation.

[0024] To achieve precise lesion localization in complex backgrounds, this invention deeply simulates the clinical process of reviewing ultrasound images. When interpreting ultrasound images, doctors first quickly scan the entire image, searching for obvious structural abnormalities or echo changes. After identifying suspicious areas, they then carefully observe the edge features, internal echo textures, and other subtle characteristics of those areas. This invention designs a hybrid CNN-Transformer architecture, employing a parallel collaborative mechanism between the GSA module and the Local Morphological Attention (LMA) module. GSA utilizes the global modeling capabilities of the Transformer to simulate a "global scan," identifying potential abnormal areas (such as structural distortions) through global channel scanning, equivalent to a doctor's "rough look." LMA focuses on depicting local texture details, equivalent to a doctor's "close inspection." Combined with the SAGM module, the model can automatically focus on lesions of different scales from the panoramic image, achieving truly "ROI-free" high-precision diagnosis.

[0025] (2) Furthermore, a deep fusion method based on frequency domain interaction was designed. This invention addresses the significant differences in semantics and frequency domain distribution between CNN features (local, high-frequency) and Transformer features (global, low-frequency), proposing a Contextual Detail Frequency Fusion (CDFF) module. It is not limited to simple splicing or addition in the spatial domain, but introduces frequency domain analysis. Through Fast Fourier Transform, features are decomposed into real and imaginary parts, allowing global contextual information to interact with local detail information in the frequency domain. This design is based on the physical characteristics of ultrasound images—the spiculated edges of malignant tumors correspond to high-frequency components, while the smooth boundaries of benign tumors correspond to low-frequency components. Fusion in the frequency domain can more effectively preserve these key diagnostic frequency features. The CDFF module acts as the brain's associative mechanism, integrating the outputs of the GSA and LMA modules at the frequency domain level to form a complete lesion cognition.

[0026] (3) Further, in calculating global features In order to suppress a large number of background channels representing normal glands or fat layers, after calculating the original attention matrix, the original attention matrix is ​​filtered by a designed diagnostic significance operator. Only the k key channels with the highest response values ​​are retained, and redundant channels representing normal tissue background are suppressed. This enables the model to dynamically filter background noise and focus on sparse feature channels with diagnostic significance, so as to achieve global and rapid localization of pathological features. In calculating local features At the same time, spatial pooling and activation functions are used to adaptively enhance these key local features, thereby effectively compensating for the lack of global features in perceiving micro-textures while suppressing background noise such as normal tissues.

[0027] (5) Furthermore, the structured text preprocessing of this invention enhances interpretability. The CGSA method uses delimiters to divide the report into independent partitions that conform to clinical logic and fills missing items with null markers. This processing not only standardizes the input but also enhances the interpretability of the model. By analyzing the cross-attention weights in the final feature fusion stage, it is possible to trace which part of the clinical description the model's prediction is based on (e.g., based on "morphology" or "blood flow"), thereby increasing doctors' trust in the AI ​​diagnostic results.

[0028] (6) Furthermore, in order to address the sampling bias problem in medical data (the false negative problem caused by the "high intra-class similarity" that is common in medical image datasets), a solution bias contrast loss is designed between image features and text features. The InfoNCE loss function replaces the standard loss function. This loss function is calculated by mathematically modifying the negative sample term, explicitly subtracting the estimated probability contribution of false negative samples (i.e., data that are actually of the same class but are treated as negative samples). Specifically, this is achieved by introducing prior class distribution probabilities. When calculating the denominator of the contrast loss, the contribution of estimated false negative samples (i.e., semantically similar cases that were misclassified as negative samples) is subtracted from the sum of negative samples. This correction eliminates the bias in the optimization process, corrects the direction of gradient optimization, and enables the model to correctly close the distance between similar samples, learn more compact and robust feature representations, and significantly improve classification accuracy.

[0029] In summary, this invention can directly process raw full-field breast ultrasound images without manually labeled regions of interest (ROIs). By incorporating clinical text reports as semantic guidance, it utilizes a hybrid CNN-Transformer architecture to extract multi-scale visual features and combines debiased contrastive learning technology to achieve deep alignment between image and text features. This enables high-precision automated diagnosis across the entire field of view, greatly simplifying the clinical deployment process. This invention is applicable to breast ultrasound screening scenarios, particularly for the early detection and diagnosis of tumors in dense breast tissue. Attached Figure Description

[0030] Figure 1 This is a flowchart illustrating the overall architecture of the multimodal breast cancer diagnostic framework proposed in this invention. Figure 2 This is a detailed internal structure diagram of FFD-Block; Figure 3 A detailed flowchart for the Context Detail Frequency Fusion (CDFF) module; Figure 4 This is a schematic diagram of a multimodal fusion module; Figure 5 This is a schematic diagram of the "twin dataset" constructed in this invention. Detailed Implementation

[0031] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.

[0032] In one implementation of the present invention, a breast ultrasound full-field image diagnostic system based on multimodal fusion is provided, the specific implementation process of which is as follows: (1) Overall system architecture The specific implementation process is as follows: Figure 1 As shown: The breast ultrasound full-field image diagnostic system based on multimodal fusion provided in this embodiment of the invention has a core processing flow divided into two parallel streams: a visual feature extraction stream and a text feature extraction stream, as well as subsequent feature fusion and classification modules. The system acquires the patient's original panoramic ultrasound image and corresponding ultrasound text report. The original panoramic ultrasound image (i.e., the image without manual region of interest (ROI) cropping, maintaining its original panoramic image, only undergoing uniform size adjustment and normalization) first passes through a Stem module (the backbone network of a convolutional neural network) to extract low-level (shallow) features from the original panoramic ultrasound image. Subsequently, the shallow features enter a multi-level encoder to obtain visual features, each level containing... A stacked FFD-Block. The text stream input is unstructured ultrasound examination report text. First, it is preprocessed using a Clinically Guided Structured Approach (CGSA) to obtain a structured semantic sequence. Then, the structured semantic sequence is input into a pre-trained medical domain language model (PubMedBERT) to extract high-dimensional text feature vectors containing rich medical professional semantics. Visual features and text features are aligned through debiased contrastive learning and fused through a feature fusion module (Modal-Fusion) with cross-attention to obtain fused features. The final fused features output benign / malignant classification results through a classification module; in this embodiment, the classification module is a fully connected layer.

[0033] (2) Data preprocessing details All input breast ultrasound images were uniformly resized to 256x256 pixels. To enhance the model's robustness, data augmentation strategies were employed during training, including random brightness and contrast adjustments, random horizontal and vertical flipping, and random rotations ranging from -15° to +15°. These operations simulated the diversity of clinical acquisitions.

[0034] The original ultrasound report text is free text, for example: "Hypoechoic nodule in the right breast, approximately 15 mm in size, irregular in shape, with spiculated edges, no obvious blood flow." This embodiment uses a Clinically Guided Structured Approach (CGSA) to convert it into a structured sequence. Based on the BI-RADS standard (Breast Imaging Reporting and Data System), the unstructured free text report is parsed and reconstructed into a structured semantic sequence containing specific delimiters [SEP] and blank fillers ([NULL]). In this embodiment, the text is mapped to the following four partitions.

[0035] Lesions and size: Extract the maximum diameter and aspect ratio; Shape, orientation, and edges: Extract shape (circular / irregular), orientation (parallel / non-parallel), and edges (smooth / unsmooth). Echo and calcification: Extract echo type (low / high / mixed), back echo (attenuation / enhancement), and calcification status (microcalcification / coarse calcification). Surrounding structures and blood flow: Extracting structural distortion and blood flow grading. This structured processing enables the model to clearly distinguish clinical features across different dimensions.

[0036] The structured sequence is explicitly divided into feature partitions with independent clinical semantics, such as "lesion physical size", "morphological characteristics", "echoic and calcification characteristics", and "surrounding tissue association and blood flow signal", as shown in Table 1: Table 1. Logic of Preprocessing Text in the Clinically Guided Structured Approach (CGSA) of this Invention

[0037] (3) Construction and implementation of the hybrid CNN-Transformer visual encoder FFD-Net This embodiment details the internal logic of the core component of the visual encoder, FFD-Block, such as... Figure 1 As shown. First, the shallow features of the image are extracted using the Stem Block of a convolutional neural network. In this embodiment of the invention, the Stem module of FFD-Net consists of the first two residual stages of ResNet (a type of CNN), used to downsample the input image and extract basic features, with an output feature dimension of... The shallow features are represented by H and W, which represent the height and width of the image, respectively. In this embodiment, both are 256 pixels, and C represents the number of channels. The features then enter a multi-level encoding stage, with each level containing several stacked FFD-Blocks. In this embodiment, the encoder contains 4 stages, with each stage stacking 2 FFD-Blocks (i.e., ...). ={2,2,2,2}), used to extract image features. Within each FFD-Block, two feature extraction branches are executed in parallel. First, the Global Structural Attention (GSA) branch: by calculating a self-attention map along the channel dimension, it models long-range dependencies between feature channels, capturing global anatomical information of the image. Simultaneously, the Dynamic Diagnostic Saliency Operator (DSO) adaptively calculates channel importance weights, filtering out redundant channels representing background noise and retaining pathological feature channels with diagnostic significance. The Local Morphological Attention (LMA) branch: utilizes a deep separable convolutional network to focus on extracting high-frequency local morphological details such as spiculations and microcalcifications at lesion edges. The Contextual Detail Frequency Fusion (CDFF) module deeply fuses the features from the GSA and LMA branches. The fusion process includes cross-paradigm alignment and frequency domain deep fusion, achieving complementarity between global semantics and local texture at the frequency domain level. Finally, the Scale Adaptive Gating (SAGM) module performs multi-scale processing on the fused features. Different lesions of different sizes are perceived through parallel multi-scale convolutional branches (such as 3x3 and 5x5 convolutional kernels), and the feature weights of different scales are adaptively adjusted through a gating mechanism.

[0038] Figure 2 The following is a detailed internal structure diagram of FFD-Block, and the specific implementation method is as follows: (3.1) Implementation of Global Structural Attention Module (GSA) The GSA module aims to establish long-range dependencies by computing self-attention along the channel dimension. Specifically, the input feature F, which is the shallow feature output by the Stem module after normalization (LN), is then processed by point convolution and... Deep convolution generates query (Q), key (K), and value (V) matrices. The dot product of Q and K generates an attention map (the original attention matrix). To suppress the large number of background channels representing normal glands or fat layers, this embodiment designs a diagnostic saliency operator (DSO). The input feature F is transformed by global average pooling. The channel weight vector is generated through two fully connected layers. The weight vector Used to modulate the original attention matrix. This factor is used for Top-k filtering of the original attention matrix, retaining only the key channels with the highest response values ​​(most prominent features), thereby achieving rapid global localization of pathological features. The original attention matrix is ​​generated by calculating the dot product of Q and K, and then filtered using weights generated by DSO, as shown in the following formula: ; in, Represents global features. represents the Top-n selection operation, also known as the Diagnostic Significance Operator (DSO), which allows the model to focus on sparse feature channels with diagnostic significance. T represents the matrix transpose, d represents the dimension of the feature channels, and Softmax represents the Softmax function.

[0039] (3.2) Implementation of Local Morphology Attention Module (LMA) The LMA module focuses on local high-frequency details. It utilizes depthwise separable convolution to focus the receptive field on a local neighborhood, enabling precise extraction of high-frequency morphological details such as spiculations and microcalcifications at the edges of lesions in breast ultrasound images, thus achieving feature extraction from a local neighborhood. The input feature F passes through a linear layer, After depthwise convolution and pointwise convolution, a feature map containing local texture information is obtained. To highlight key edge information, Global average pooling (GAP) is performed to obtain Feature map and After performing a dot product, the GELU activation function is applied, followed by pointwise convolution to generate local features. The formula is as follows: ; ; in Indicates local features, PW represents pointwise convolution, which enhances the expression of high-frequency morphological details such as burrs and microcalcifications at the edge of lesions, and makes up for the lack of perception of micro-texture by global branches.

[0040] In this embodiment of the invention, spatial pooling and activation functions are used to adaptively enhance these key local features, thereby effectively compensating for the deficiency of global features in perceiving micro-textures while suppressing background noise such as normal tissue.

[0041] (3.3) Implementation of the Context Detail Frequency Fusion (CDFF) module To address the significant differences in semantics and frequency domain distribution between local features extracted by CNNs and global features extracted by Transformers, FFD-Block incorporates a CDFF module, such as... Figure 3 As shown, this module employs a two-stage fusion strategy based on frequency domain interaction. In the first stage, cross-paradigm alignment is performed, i.e., spatial attention masks are computed using global features generated by GSA. We use the local features generated by LMA in a weighted manner, and simultaneously use the local features generated by LMA to calculate the channel attention mask. By weighting the global features generated by GSA, complementary features can be achieved in both spatial and channel dimensions.

[0042] ; ; ; in This represents the sigmoid activation function. This represents the features after the first stage of cross-paradigm alignment.

[0043] In the second stage, deep frequency fusion is performed, which is one of the core innovations of this embodiment. This is done to utilize frequency domain information to align the features... The high-frequency signal is amplified by a linear layer, and then the feature tensor amplified by the linear layer is projected from the spatial domain to the frequency domain using a fast Fourier transform (FFT), decomposing it into real and imaginary components.

[0044] ; The model concatenates the real and imaginary parts along the channel dimension and performs interactive operations across the entire frequency band through learnable linear layers. This enables deep coupling between global low-frequency structural information (such as lesion contours) and local high-frequency detail information (such as edge spurs) in the frequency domain. Finally, the fused frequency domain features are restored back to the spatial domain using Inverse Fast Fourier Transform (IFFT), outputting the fused features. .

[0045] ; in, This indicates a splicing operation.

[0046] (3.4) Implementation of Scale Adaptive Gating Module (SAGM) The FFD-Block also has a scale-adaptive gating module linked at its end to fuse features. The normalized features, after being added to the shallow features output by the Stem module, are replaced by parallel 3x3 and 5x5 convolutional branches and gating mechanisms instead of the traditional Transformer FFN. This allows for adaptive adjustment of the weights of different receptive fields, ensuring that the model can simultaneously capture lesions of various scales, from tiny nodules a few millimeters in diameter to massive masses several centimeters in size.

[0047] (4) Specific implementation of multimodal alignment and bias-reduced contrastive learning (DCL) This embodiment details how to solve the problem of "false negatives" in medical data.

[0048] (4.1) Problem Definition After obtaining the feature representations of images and text, this embodiment improves diagnostic accuracy by constructing a multimodal joint learning framework. Addressing the common problem of "high intra-class similarity" in medical image data (i.e., malignant tumors from different patients are highly similar in both image and text descriptions), traditional contrastive learning loss functions (InfoNCE) easily and incorrectly treat these semantically similar unpaired samples as negative samples, leading to "false negative" sampling bias. Specifically: ; This represents the traditional contrastive learning loss function. In medical batch processing, if sample j and sample i both contain malignant tumors, they are semantically similar. (express and The similarity between samples can be very high, and InfoNCE will artificially widen their distance, leading to model confusion. N represents the number of samples in a batch. Represents the i-th image feature. Represents the i-th text feature. Represents the j-th text feature. This represents the temperature coefficient (hyperparameter).

[0049] (4.2) Derivation and application of DCL loss function This invention introduces Debiased Contrastive Loss (DCL). It assumes that negative samples (randomly selected samples as negative samples) are sampled from a data distribution (a set of samples in a batch) with probability... In reality, these are positive samples (false negatives). Therefore, the calculation logic of the biased contrastive loss function in this embodiment of the invention is as follows: when calculating the denominator of the contrastive loss, all unpaired samples within the batch are not simply regarded as negative samples; instead, they are based on a preset class prior probability ( Estimate the proportion of samples in the negative sample set that belong to the same pathological category (false negative) as the anchor sample; subtract the estimated contribution value of the false negative samples from the sum of the negative samples in the denominator, thereby pushing away only the true negative samples in the feature space and preserving the semantic compactness of samples of the same type.

[0050] In this embodiment of the invention, the expected term in the denominator is decomposed and corrected: ; in, The expected value of the total number of samples in the sample set. The expected value of the positive samples in the sample set. Let be the expected value of the negative samples in the sample set.

[0051] This leads to the corrected negative sample term, which is then substituted into the loss function to correct the direction of gradient optimization. ; in, The debiased contrast loss designed for this invention uses the function s to calculate similarity; I represents the current anchor sample, i.e., the extracted multi-paradigm fused image features. This represents the "true positive" paired sample corresponding to the anchor sample, i.e., the extracted structured text features. is the temperature coefficient, which is a hyperparameter in contrastive learning that controls the smoothness of the probability distribution and the model's attention to difficult samples. is the predefined class prior probability, which represents the probability estimate that, in a randomly sampled set of unpaired samples, the sample actually belongs to the same pathological category as the anchor sample (i.e., a false negative). The unpaired text features within the batch. In the traditional InfoNCE loss, all unpaired samples within the batch are simply and crudely treated as negative samples. N is the total number of unpaired samples (initial negative samples) in the denominator that participate in the summation. Let be the text features in the negative sample set that are estimated to be semantically similar to the anchor sample (belonging to the same disease category), i.e., false negative samples. M is the number of summation terms used to estimate the expected value of false negative samples.

[0052] This correction effectively removes the influence of false negative samples from the denominator, allowing the model to push away only the truly negative samples (such as between benign and malignant), while maintaining the clustering of similar samples in the feature space.

[0053] (5) Feature fusion and classifier design In the feature fusion stage, the feature fusion module uses a lightweight cross-attention mechanism to fuse visual and textual features. For details, please refer to [link to documentation / documentation]. Figure 4 Image features (The output of FFD-Net) serves as the query vector, using text features. The output of PubMedBERT is used as a key and value vector. Image-text relevance weights are calculated, and based on these weights, key diagnostic semantics (such as "irregular shape") from the text are dynamically injected into the visual features, generating a multimodal feature representation that incorporates semantic guidance. In this embodiment of the invention, a joint loss function is used: ; Classification tasks Focal Loss is used to address the imbalance between benign and malignant samples. As a balancing factor, experiments have shown that setting it to 0.1 yields the best results. The model is trained end-to-end using a training dataset labeled with benign and malignant categories. Network parameters are updated via backpropagation, enabling the model to simultaneously learn accurate classification decision boundaries and consistent cross-modal semantic representations.

[0054] (6) Model training configuration and evaluation metrics The training and inference process in this embodiment is based on Python 3.8 and the PyTorch 2.4 framework, and the hardware environment includes two NVIDIA GeForce RTX 4090 D graphics processors. To accommodate the differences between pre-trained weights and newly added layers, a differentiated learning rate strategy is adopted: the learning rate of the backbone network (Stem) is set to a relatively low value. The newly designed FFD-Net layers and fusion modules have a relatively high initial learning rate. The weights are decayed using a cosine annealing strategy. During training, the AdamW optimizer is used, with a batch size of 16 and a weight decay factor of 0.05, for a total of 100 epochs. The loss function is the Focal Loss function for classification tasks. ) and cross-modal aligned DCL Loss ( Weighted composition, weight coefficient Experiments have shown that setting the value to 0.1 yields the best results.

[0055] The model performance evaluation employed a rigorous "twin dataset" validation paradigm, comparing the diagnostic performance of the same group of patients on full-view images (input of this invention) and manually annotated ROI images (traditional input), such as... Figure 5As shown, the first column is the original image, the second column is the original full-field panoramic image with complete anatomical background preserved, representing the input of this invention, and the third column is the manually annotated ROI image containing only the lesion area, representing the input of the traditional method. This figure is used to visually compare the two data formats. Considering the small dataset size, to ensure the reliability of the experimental results and minimize the adverse effects of random data partitioning, a five-fold cross-validation method was used for the experiment, and the average result was selected as the final experimental result. The evaluation metrics used in the experiment included AUC (area under the ROC curve), ACC (accuracy), Specificity, Precision, Recall, and F1-score. The experiment demonstrates that this invention, by designing a hybrid CNN-Transformer visual encoder to simultaneously capture global and local features, and introducing structured ultrasound report text as semantic guidance, utilizes debiased contrastive learning to achieve deep alignment between image and text, thereby achieving high-precision full-field image diagnosis without the need for ROI annotation.

[0056] In practical applications, the breast ultrasound panoramic image of the patient to be diagnosed and the corresponding ultrasound text report are input into the trained breast ultrasound panoramic image diagnostic system to obtain multimodal diagnostic results, namely the classification results of benign and malignant breast cancer.

[0057] In summary, the core design concept of this invention stems from in-depth simulation of the clinical physician's image reading process and profound insight into the characteristics of medical data.

[0058] (1) Simulate the cognitive process of “global scanning-local focusing” When interpreting ultrasound images, doctors first quickly scan the entire image, looking for obvious structural abnormalities or echo changes. After identifying suspicious areas, they carefully observe the edge features, internal echo textures, and other subtle characteristics of those areas. To simulate this process, this invention designs a hybrid CNN-Transformer architecture. The GSA module utilizes the global modeling capabilities of the Transformer to simulate a "global scan," filtering out feature channels related to the lesion through channel attention. The LMA module utilizes the local perception capabilities of the CNN to simulate "local focusing," characterizing the morphological details of the lesion. The CDFF module acts as the brain's associative mechanism, integrating these two types of information at the frequency domain level to form a complete understanding of the lesion.

[0059] (2) Using the constraint of "semantic consistency" to compensate for "visual ambiguity" The visual features of ultrasound images are often ambiguous and ambiguous (for example, some benign nodules may also appear hypoechoic). However, physician text reports are high-level semantic summaries of these visual features after professional judgment. This invention argues that multimodal fusion is not merely the superposition of information, but rather the use of the deterministic semantics of text to constrain and clarify the visual features of images. By constructing a joint learning framework for images and text, the model learns that "blurred dark areas in the image" correspond to "hypoechoic nodules in the text," thereby using textual information to eliminate visual uncertainty.

[0060] (3) Use semantic guidance instead of spatial guidance Traditional CAD relies on expensive ROI annotation to tell the model "where to look." This invention proposes a paradigm shift: using readily available ultrasound reports to guide the model. Although the text does not directly provide the coordinates of the lesion, its descriptive features (such as "edge spiculation") correspond to specific regions in the image. Through bias-reducing contrastive learning and cross-attention mechanisms, the model can automatically discover these correspondences, thereby maintaining high-precision localization and diagnostic capabilities while reducing annotation costs.

[0061] (4) Mathematical corrections to adapt to the distribution characteristics of medical data Addressing the high incidence of false negatives in medical data, this invention avoids blindly using general-purpose deep learning components. Instead, it starts from the underlying mathematical principles of the loss function, introducing debiased contrastive learning. By estimating class prior probabilities, it corrects the false negative bias in negative sample sampling, achieving robust alignment of image and text features. This approach embodies the concept of "data-driven algorithm design," meaning that algorithm design must respect and adapt to the inherent distribution characteristics of the data, eliminating sampling bias through mathematical corrections, thereby unleashing the potential of deep learning models on small-sample medical data.

[0062] This invention also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the methods described above.

[0063] Specifically, the memory may include high-speed random access memory, as well as non-volatile memory, such as hard disks, RAM, plug-in hard disks, smart media cards (SMC), secure digital (SD) cards, flash cards, at least one disk storage device, flash memory device, or other volatile solid-state storage devices.

[0064] The relevant technical solutions are the same as above, and will not be repeated here.

[0065] This invention also provides a computer program product, including a computer program that, when run on a computer, causes the computer to perform the steps of the methods described in the above embodiments.

[0066] The relevant technical solutions are the same as above, and will not be repeated here.

[0067] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A panoramic ultrasound imaging diagnostic system for breast cancer, characterized in that, include: CNN-Transformer visual encoder, text feature extraction module, feature fusion module, and classification module; The CNN-Transformer visual encoder is used to extract image features from the original panoramic ultrasound image without manual region of interest (ROI) cropping; the text feature extraction module is used to obtain the text features contained in the structured clinical ultrasound text report corresponding to the original panoramic ultrasound image; the feature fusion module is used to fuse the image features and the text features to obtain fused features; the classification module is used to obtain multimodal diagnostic results based on the fused features. The CNN-Transformer visual encoder includes a convolutional neural network backbone and multiple cascaded encoding networks. The convolutional neural network backbone is used to extract shallow features from the original panoramic ultrasound image. The encoding networks include several stacked FFD-Blocks, each FFD-Block comprising: The GSA module is used to perform channel-dimensional self-attention on the shallow features to generate an original attention matrix, and then to filter the original attention matrix by channel, selecting the features of the top k channels with the highest feature response values ​​as global features. ; The LMA module is used to obtain a feature map containing local texture information based on the shallow features. and to Global average pooling is used to obtain the feature map. ,Will and After performing a dot product, an activation function is applied, followed by pointwise convolution to generate local features. ; The SAGM module is used for global feature-based... and local features Features after fusion Multi-scale processing is performed and the weights of features at different scales are adaptively adjusted to obtain the image features.

2. The breast ultrasound panoramic imaging diagnostic system according to claim 1, characterized in that, The FFD-Block also includes a context detail frequency fusion module; The context detail frequency fusion module is used to integrate global features and local features The global features are obtained by performing a two-stage fusion. and local features Features after fusion The two-stage fusion includes: In the first stage, based on global features Computational Spatial Attention Mask Weighted local features The first feature is obtained; and based on the local features... Computational Channel Attention Mask Weighted global features The second feature is obtained; the first and second features are summed to obtain the fusion feature of the first stage. ; In the second stage, the fusion features are... After amplifying the high-frequency signal using a linear layer, the resulting feature tensor is projected from the spatial domain to the frequency domain using a Fast Fourier Transform (FFT). The real and imaginary parts of the frequency domain features are concatenated along the channel dimension, and then interacted across the entire frequency band using a learnable linear layer to obtain the fused frequency domain features. Finally, the frequency domain features are restored to the spatial domain using an Inverse Fast Fourier Transform (IFT) to obtain the fused features. .

3. The breast ultrasound panoramic imaging diagnostic system according to claim 2, characterized in that, The fusion feature for: in, This represents the sigmoid activation function. represents dot product, PW represents pointwise convolution, and GELU represents the GELU activation function.

4. The breast ultrasound panoramic imaging diagnostic system according to any one of claims 1-3, characterized in that, Channel filtering of the original attention matrix is ​​performed using a diagnostic saliency operator, specifically including: The shallow features are linearly transformed and activated to generate cue guidance features. These cue guidance features are flattened into low-rank vectors, and the global average of these low-rank vectors is calculated to generate a global scaling factor. The original attention matrix is ​​then filtered using this global scaling factor, retaining only the features from the k channels with the highest response values ​​as the global features. ; wherein, the global features for: In the formula, The diagnostic saliency operator is defined as follows: T represents matrix transpose, d represents the dimension of the feature channel, Softmax represents the Softmax function, and the Q matrix, K matrix, and V matrix are generated by point convolution and depth convolution of the shallow features. The original attention matrix is ​​generated by the dot product of the Q matrix and the K matrix.

5. The breast ultrasound panoramic imaging diagnostic system according to claim 1, characterized in that, The SAGM module includes parallel multi-scale convolutional branches and a gating mechanism. The parallel multi-scale convolutional branches are used to process the fused features. Multi-scale processing is performed to perceive lesions of different sizes, and the feature weights of different scales are adaptively adjusted through the gating mechanism to obtain the image features; The feature fusion module employs a lightweight cross-attention mechanism to fuse the image features and the text features. Specifically, it includes: using the image features as the query vector and the text features as the key and value vector, calculating the image-text relevance weight, and integrating the text features into the image features according to the image-text relevance weight to obtain the fused features.

6. The breast ultrasound panoramic imaging diagnostic system according to claim 5, characterized in that, The structured processing of the clinical ultrasound text report includes: According to the BI-RADS standard, the clinical ultrasound text report is parsed into multiple fixed semantic partitions, and a special delimiter is inserted between each semantic partition to clarify the semantic boundaries; missing data items in each semantic partition are detected, and null value markers are inserted at the corresponding positions to obtain the structured clinical ultrasound text report. The semantic partitioning includes lesion physical size, lesion morphology and edge features, echo and calcification patterns, surrounding tissue association and blood flow signals.

7. A training method for a breast ultrasound panoramic image diagnostic system, characterized in that, The breast ultrasound panoramic image diagnostic system is the breast ultrasound panoramic image diagnostic system according to any one of claims 1-6, and the training method includes: The breast ultrasound panoramic image diagnostic system is trained end-to-end using a training dataset with category labels. The network parameters are updated through backpropagation algorithm. When the loss converges, the trained breast ultrasound panoramic image diagnostic system is obtained. The loss function used for training includes bias-reflective loss, wherein the bias-reflective loss... for: In the formula, the function s represents the calculation of similarity; I represents the current anchor sample, i.e., the image feature; This represents the "true positive" paired sample corresponding to the anchor sample, i.e., the text feature; The text features within a batch are those not paired with the image features, and N represents the number of samples in a batch; is a false negative sample, that is, a text feature in the negative sample set that is estimated to be semantically similar to the anchor sample; M is the number of summation terms used to estimate the expectation of the false negative sample; Temperature coefficient; is the preset class prior probability, representing the probability estimate of a false negative sample in a randomly sampled set of unpaired samples.

8. A method for diagnosing breast cancer using panoramic ultrasound images, characterized in that, include: The original panoramic ultrasound image of the patient to be diagnosed and the corresponding clinical ultrasound text report are input into the trained breast ultrasound panoramic image diagnostic system to obtain multimodal diagnostic results; wherein, the breast ultrasound panoramic image diagnostic system is the breast ultrasound panoramic image diagnostic system according to any one of claims 1-6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the training method of the breast ultrasound panoramic image diagnostic system as described in claim 7, and / or the breast ultrasound panoramic image diagnostic method as described in claim 8.

10. A computer program product, characterized in that, Includes a computer program that, when run on a computer, causes the computer to perform the training method for the breast ultrasound panoramic image diagnostic system of claim 7, and / or perform the breast ultrasound panoramic image diagnostic method of claim 8.