An image segmentation method based on cosine consistency screening and double-flow complementary semantic guidance

By employing a cosine consistency screening and dual-stream complementary semantic-guided image segmentation method, the problem of insufficient feature interaction in prostate cancer MRI images was solved, thereby improving segmentation accuracy and lesion localization capabilities.

CN122244075APending Publication Date: 2026-06-19HUAIYIN INSTITUTE OF TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUAIYIN INSTITUTE OF TECHNOLOGY
Filing Date
2026-03-20
Publication Date
2026-06-19

Smart Images

  • Figure CN122244075A_ABST
    Figure CN122244075A_ABST
Patent Text Reader

Abstract

A method for image segmentation based on cosine consistency filtering and dual-stream complementary semantic guidance includes: extracting shallow features from the input image; obtaining cross-branch consistency metrics based on cosine similarity in the channel and spatial dimensions, and introducing lightweight channel attention to obtain channel importance scores; jointly ranking the consistency and importance scores to obtain a cosine consistency sparse filtering module (CSSB), which suppresses redundant information and background noise and improves the semantic quality of shallow discrimination. In the cross-scale fusion stage, a dual-stream complementary multi-scale semantic guidance module (FBAC) is constructed, generating foreground and background attention maps from high-level fusion features, explicitly decomposing deep semantics into a foreground enhancement stream and a background suppression stream; and using multi-scale dilated convolution to adaptively model contextual information from different receptive fields, and using residual methods to re-inject the decoder to enhance lesion localization and boundary detail restoration. This invention can further improve the overall accuracy and stability of image segmentation tasks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of medical image segmentation and deep learning technology, specifically to an image segmentation method based on cosine consistency screening and dual-stream complementary semantic guidance. Background Technology

[0002] Prostate cancer has become the second most common cancer among men worldwide and is rapidly rising to become the fifth leading cause of cancer death in men. With an aging population and unhealthy lifestyles, the incidence of prostate cancer is increasing year by year, and the age of onset is trending younger, making early screening for prostate cancer essential. With the rapid development of artificial intelligence technology, deep neural network-based medical image segmentation methods have achieved excellent performance in segmenting various organs and lesions. However, in the complex scenario of prostate cancer, many challenges remain, requiring further improvement and optimization of the algorithm.

[0003] Convolutional neural networks are efficient models of local texture and spatial hierarchy, while attention mechanisms compensate for the shortcomings of pure convolution in global dependency modeling and fine-grained boundary characterization by adaptively focusing on key regions and discriminative features in channel, spatial and cross-scale dimensions. These methods significantly improve segmentation performance in global modeling, detail enhancement and small target localization, but they rely on complex multi-module stacking and fine feature fusion design, resulting in high computational and parameter overhead. Furthermore, they still have shortcomings in cross-dataset generalization and robustness to small lesion boundaries with irregular shapes.

[0004] Multi-scale feature fusion, by extracting and interacting features at different resolutions and receptive fields, enables the model to simultaneously preserve shallow texture details and deep semantic context. It is an important strategy for improving the segmentation of small targets, blurred boundaries, and structures with varied morphologies. However, these multi-scale feature fusion strategies often have limitations in the sufficiency of feature interaction and integration, failing to optimize the balance between global semantics and local details during the decoding process, thus affecting the fine segmentation of irregular or blurred boundaries.

[0005] Medical image segmentation aims to accurately separate target tissues from the background in images, which is a key technology for achieving accurate lesion diagnosis and intraoperative operation assistance. The above methods improve the discrimination ability in scenes with blurred boundaries, small targets and noise interference by explicitly or implicitly modeling the foreground and background and promoting complementary information fusion. However, most of them rely on additional strategies and are more sensitive to the quality of intermediate results, which can easily introduce error accumulation. There is still room for improvement towards a simpler and more robust integrated modeling approach.

[0006] While existing methods have made some progress in medical image segmentation, they still fail to fully utilize the potential of techniques such as multi-scale feature combination, feature similarity filtering, and attention mechanisms. This deficiency means that models may not be able to comprehensively capture and integrate key features when processing complex prostate cancer images, thus affecting the accuracy and reliability of diagnosis. Summary of the Invention

[0007] To address the technical problems of existing medical image segmentation methods in prostate cancer MRI image processing, such as the significant interference of background noise on shallow features and insufficient multi-scale feature interaction leading to blurred boundaries and inaccurate localization of small lesions, this technical solution provides an image segmentation method based on cosine consistency screening and dual-stream complementary semantic guidance. This method utilizes a joint metric of channel importance and cosine similarity to select the top 50% of the most relevant features in both channel and spatial dimensions, preserving shallow discrimination regions and suppressing redundant features. Simultaneously, by designing explicit modeling of low-level local detail information and high-level semantic information, and employing multi-scale contextual information, the method reconstructs the shallow feature map recovery of the feature-guided decoder. This effectively solves the technical problems.

[0008] This invention is achieved through the following technical solution:

[0009] An image segmentation method based on cosine consistency filtering and dual-stream complementary semantic guidance includes the following steps:

[0010] The backbone architecture of an MRI image segmentation network model is constructed based on the classic encoder-decoder paradigm. The core structure and layers include: encoder, bottleneck layer, convolutional decoder unit, and bridging fusion layer. The first two layers of the encoder extract shallow features from the input image.

[0011] A cosine consistency sparse filtering module (CSSB) is constructed and embedded in the intermediate layer of the encoder of the segmentation network model. Cross-branch consistency is obtained by calculating the cosine similarity of bi-branch features in the channel dimension and spatial dimension, and lightweight channel attention is introduced to obtain channel importance scores. The consistency scores and importance scores are jointly ranked to obtain the cosine consistency sparse filtering module (CSSB).

[0012] In the deep and shallow feature interaction stage of the segmentation network model, a dual-stream complementary multi-scale semantic guidance module (FBAC) is constructed and connected. The FBAC module is configured to generate a foreground attention map from high-resolution features and construct a complementary background attention map to form a dual-stream branch. Semantic guidance features are obtained through dynamic multi-scale modeling to guide the decoder to recover details. This completes the construction of the entire MRI image segmentation network model.

[0013] Furthermore, the backbone architecture of the MRI image segmentation network model includes:

[0014] (1) The shallow layer of the encoder is the first layer of the encoder, which contains a double convolutional layer: two 3×3 convolutional layers + BN normalization function + ReLU activation function, which are used to obtain local detail features that are relatively similar in the early local area;

[0015] (2) The intermediate layer of the encoder is the second and third layers of the encoder, which embeds the cosine consistent sparse filtering module CSSB, which is used to perform filtering on shallow features of the same scale in the two branches;

[0016] (3) The deep layer of the encoder is the fourth layer of the encoder, which contains double convolution, and passes the filtered features to the next stage;

[0017] (4) The bottleneck layer is located in the deep and shallow feature interaction stage between the encoder and decoder;

[0018] (5) The convolutional decoder unit is used to receive features and gradually restore the spatial resolution of the image. It combines skip connections to achieve the step-by-step restoration of features.

[0019] (6) The cross-fusion layer is a dual-stream complementary multi-scale semantic guidance module FBAC between the bottleneck layer of the encoder and the shallow layer of the decoder, which is used for the guidance and fusion of foreground and background features.

[0020] Furthermore, the encoder is used to perform multi-scale feature extraction on the input MRI image, and the shallow / first layer of the encoder contains a double convolutional layer to obtain early local details that are more similar; the decoder is used to receive the features and gradually restore the spatial resolution.

[0021] Furthermore, the specific construction and feature processing steps of the backbone architecture of the MRI image segmentation network model include:

[0022] S1. Construct the overall segmentation network CSFBNet, with an overall structure based on an encoder-decoder architecture; the input image size is 256×256, and the encoder uses a convolutional module for feature extraction;

[0023] S2. Introduce a CSSB module in the shallow layer of the encoder: Construct two parameter-independent double convolutional branches to obtain double-branch features of the same scale, and perform joint selection based on the cosine consistency of the two features in the channel and spatial dimensions and the channel importance weights; after completing the selection of Top-K channels and Top-M spatial positions, perform masking weighting, concatenation, and... Convolutional fusion is performed, and residual connections are made with the initial features to obtain filtered and enhanced shallow features;

[0024] S3. Introduce the FBAC module at the cross-scale semantic interaction position: Input the high-resolution features output by the encoder and the low-resolution deep features of the bottleneck layer into the FBAC module, generate a foreground-background complementary attention map based on the high-resolution features; after channel alignment and upsampling of the deep features, construct the foreground stream and background stream, and perform multi-scale modeling through adaptive dilated convolution (ADC); then perform complementary fusion of the two stream features, and inject them back into the high-resolution fused features in a residual manner to form an enhanced cross-scale semantic representation;

[0025] S4. The multi-scale features processed by CSSB and FBAC are input into the decoder, and the features are upsampled step by step through the decoding path and fused with skip connection features to output the final segmentation result.

[0026] Furthermore, the CSSB module is configured to perform channel-level and spatial pixel-level consistency filtering on shallow features of the same scale in both branches, and use channel importance scores for sorting judgment, thereby retaining shallow discrimination regions and suppressing redundant features; then the filtered features are passed to the bottleneck layer of the encoder through the fourth layer of double convolution for fusion in the next step of the FBAC module.

[0027] Furthermore, the Cosine Consistency Sparse Filtering (CSSB) module is used to perform consistency filtering and redundant feature suppression on similar features of early dual-path shallow signals. Its specific feature extraction and processing methods include the following steps:

[0028] 1) Extract initial feature information from the first layer of the encoder through two independent paths via double convolution, and then downsample the paths to obtain the feature tensor. For example, the following formula:

[0029] ;

[0030] in , For the real number field, For batch size, For the number of channels, and These are the height and width of the feature map, respectively; The kernel size is [size]. The convolution stride is denoted by 'Down'; 'Down' indicates downsampling; 'DoubleConv' represents two 3×3 convolution layers plus BN normalization and ReLU activation functions.

[0031] 2) Next, the two branch features are processed... After linear projection and alignment, the feature representation used for consistency calculation is obtained, as shown in the following formula:

[0032] ;

[0033] in, , ∈ Let be the spatial dimension flattening vectors, representing the initial features from the two independent path branches after linear projection. After alignment, the consistent spatial dimension flattening feature tensor is used to compute the following: for the ... One channel, and Flattened in spatial dimension to a length of The vector is used for subsequent cosine similarity calculation; It is a learnable linear projection matrix, implemented in the network through a 1×1 convolution operation with shared weights, aiming to perform information exchange and alignment of the two features along the channel dimension.

[0034] The spatial response of each channel is vectorized, and the cosine similarity between the two branches on that channel is calculated to obtain the channel consistency score, as shown in the following formula:

[0035] ;

[0036] in, Indicates the two-way branch feature in the first... The cosine similarity on each channel is used to measure the consistency of the spatial response distribution of that channel. The larger the value, the more consistent the representation of the two features on that channel. It is the flattened high-resolution channel index set, c Let the Cth channel be... , These represent the characteristics of the two branches at the 1st... Flatten the feature vectors of spatial dimensions on each channel; Represents the height of the feature map in the spatial dimension. With width The product of these is the total number of pixels in a single feature channel; To prevent the smallest constant from being divided by zero

[0037] Channel description vectors are constructed based on weighted fusion of two-way features, layer normalization, and global average pooling. Channel importance scores are then obtained through two layers of MLP and Softmax. As shown in the formula below:

[0038] ;

[0039] in, This represents the channel attention score; Softmax is the normalization function. ∈ , ∈ The parameters are learnable, and the linear mapping is obtained by 1×1 convolution; GELU is the activation function. This describes the global importance of channels, GAP represents global average pooling, and LN represents the layer normalization function. and 1- for and Each layer normalizes its respective weight;

[0040] 3) Multiply the channel consistency score by the channel importance weight to obtain a comprehensive score, and sort them in descending order of comprehensive score, selecting the Top-... Channel Index Set As shown in the following formula:

[0041] ;

[0042] in, Indicates the first The importance score of each channel Indicates the two branches at the th Cosine similarity across channels This represents a combined score of the importance score and the consistency score of the fusion channel; This indicates a comprehensive score for all channels. The channel sorting results obtained by arranging in descending order; the Argsort function represents the sorting function; Sort in descending order; This means selecting the top scorers from each of the two path branches. Important and consistent feature groups; This indicates the number of high-scoring channels to be retained; here, the total number of channels is used. The former ,Right now Based on this, the top scorers are selected. The set of channel indexes is denoted as ;

[0043] 4) After completing the Top-K channel filtering, take the channel vector for each spatial location (H, W). , And calculate the spatial cosine similarity to obtain the spatial consistency map. As shown in the following formula:

[0044] ;

[0045] in, , ∈ To prevent division by zero of extremely small constants;

[0046] Spatial Consistency Map Flatten the vector into a length of H×W, and perform Top-M spatial location selection to obtain the set. And based on this set, construct a binary mask M; specifically as shown in the following formula:

[0047] ;

[0048] Where (i,j) ∈ , The function is when the spatial position Belongs to set hour, ,otherwise vec( )∈ , for Flattened vector; Indicates the number of reserved spatial locations;

[0049] Set the retention ratio to ,therefore ;when The term "top-" indicates that the top 50% of spatial locations with the highest similarity are retained. The operation is executed independently on each batch of samples; ArgTopM This means taking the largest value from the input vector. The set of indices corresponding to each element;

[0050] 5) Broadcast the binary mask to the channel dimension, weight the two features element by element, concatenate them, and then... Convolutional fusion is performed, and finally, a residual connection is executed with the initial features entering the CSSB module to obtain the CSSB output. The specific formula is shown below:

[0051] +Residual;

[0052] in, This indicates a convolution with a kernel size of 1, used to restore the channels from 2K to K for fusion; Concat indicates feature concatenation along the channel dimension. `residual` indicates element-wise multiplication; `residual` indicates residual joins performed on the initial features entering the CSSB module.

[0053] Furthermore, the dual-stream complementary multi-scale semantic guidance module FBAC is used to explicitly model the complementary semantic relationship between the foreground and background, and utilizes multi-scale contextual information to guide deep feature fusion. Its specific construction and processing steps include:

[0054] A1: The final new feature map A deep feature map is obtained through a double convolution layer. The high-resolution features output by the encoder are ;Depend on Generate a single-channel foreground attention map Background attention maps are obtained in a complementary manner. The specific formula is as follows:

[0055] ;

[0056] in, This represents the Sigmoid activation function. A 1×1 convolution is used to adjust the channels for fusion; This represents depthwise separable convolution used to generate foreground attention maps. ; ;

[0057] At the same time Perform channel alignment and upsampling to the same level as Same scale:

[0058] ;

[0059] for Convolution is used to adjust channels and perform feature fusion; This represents the deep semantic feature map output by the last layer of the encoder, which contains rich semantic information but has low spatial resolution.

[0060] A2: Construct the foreground flow by multiplying the attention map element-wise with the aligned deep features. With background flow The specific formula is as follows:

[0061] ;

[0062] in, For element-wise multiplication, it is automatically broadcast to the channel dimension;

[0063] Subsequently, multi-scale dilated convolutions are performed on the foreground and background streams respectively to obtain the outputs of each dilation rate branch, and then based on global average pooling and the channel description function. Calculate branch weights The branch outputs are then dynamically weighted and fused to obtain the ADC output, as shown in the following formula:

[0064] ;

[0065] in, Indicates the expansion rate of Dilated convolution; ; Indicating the expansion rate Below, based on foreground features and background features The first one obtained by dilated convolution branch Multi-scale branch output features; This indicates a global average pooling operation, used to extract channel-level global statistics. This represents a lightweight multilayer perceptron used to generate weight descriptions for each branch; This represents the normalization function, used to map the responses of each branch to weight coefficients that sum to 1; This represents the dynamic weights corresponding to the three dilated convolution branches; Indicates the first The scalar weights corresponding to each branch. ; Indicates the first Feature maps output by each dilation rate branch; This represents element-wise multiplication; Represents the linear rectification activation function; Indicates foreground features and background features The output features are obtained after multi-scale dilated convolution extraction and dynamic weighted fusion.

[0066] A3: The foreground and background streams share the same ADC module for enhancement. After obtaining two multi-scale features, the complementary fusion feature map FM is obtained by adding them element-wise, as shown in the following formula:

[0067] ;

[0068] A4: Will The features are resized to the same resolution as the original low-resolution features through interpolation and then refined by convolutional blocks. The concatenation is performed along the channel dimension, and then fused into a convolutional block to obtain the desired result. The formula is as follows:

[0069] ;

[0070] in, and pass Convolution is implemented to adjust channel dimensions and perform feature fusion. This indicates a channel-dimensional splicing operation; This indicates an interpolation scaling operation; and These represent deep feature maps. The spatial height and width, i.e. The target resolution is used to fuse complementary feature maps. Resampling to Same size;

[0071] Finally Upsample back to high resolution and with The FBAC output enhancement feature is obtained by summing the residuals, as shown in the following formula:

[0072] + ;

[0073] in, This indicates the high-resolution enhanced features obtained after processing by the Foreground-Background Complementary Enhancement Module (FBAC), which is used to fuse shallow detail information with deep semantic information, thereby improving feature representation capabilities.

[0074] Furthermore, the deep and shallow feature interaction stage of the segmentation network model consists of the bottleneck layer of the encoder and the shallow layer of the decoder.

[0075] Furthermore, the steps also include:

[0076] Obtain the image dataset to be processed, divide the dataset into training and testing sets, and uniformly adjust the size of the input images to a preset size;

[0077] Based on the constructed segmentation network model, the model is iteratively trained and its parameters are optimized using the partitioned training set; the trained model is tested using the test set, and the model is then used to perform accurate segmentation of MRI images.

[0078] Furthermore, the image dataset to be processed is a medical image dataset, which contains multiple MRI images of prostate cancer; the samples in the image dataset to be processed are scaled, and all samples are uniformly modified to 256*256.

[0079] Beneficial effects

[0080] The image segmentation method proposed in this invention, based on cosine consistency screening and dual-stream complementary semantic guidance, has the following advantages compared with existing technologies:

[0081] (1) This invention retains the encoding and decoding feature extraction and reconstruction capabilities of the original CSFBNet backbone network, and combines feature filtering and semantic guidance strategies (cosine consistent sparse filtering module CSSB and dual-stream complementary multi-scale semantic guidance module FBAC) to more effectively filter out redundant and interference information in the shallow stage of the encoder, and explicitly model the complementary semantic relationship between the foreground and background in the deep stage. It uses multi-scale context information to guide the decoding and reconstruction process, thereby enhancing the model's ability to express and locate the features of the lesion area.

[0082] (2) The Cosine Consistency Sparse Filtering Module (CSSB) designed in this invention is the core innovation of this patent. Its architecture is mainly composed of a dual-branch feature extraction unit, a consistency and importance measurement unit, and a sparse filtering fusion unit. In terms of specific construction steps: First, an independent dual-convolution path is constructed to extract dual-branch features of the same scale, and channel alignment is performed through convolution; Second, the cosine consistency of the features between branches in the channel dimension and spatial dimension is calculated respectively, and the channel importance measurement is obtained by using the attention mechanism formed by global average pooling and multilayer perceptron (MLP); Finally, on this basis, a joint scoring and ranking strategy is adopted to complete the sparse selection of Top-K channels and Top-M spatial positions, focusing on retaining shallow semantic channels and spatial responses with high consistency and strong importance, and splicing and residual fusion output of the filtered features to achieve effective retention of shallow discriminative information and suppression of redundant background responses.

[0083] (3) The dual-stream complementary multi-scale semantic guidance module FBAC designed in this invention is also the core innovation of this invention. Its overall architecture is mainly composed of a foreground and background attention generation unit, a dual-stream feature mapping unit, and a multi-scale adaptive fusion unit. In terms of specific construction steps: First, an attention generation branch is constructed to generate a foreground attention map from high-resolution features and construct its complementary background attention map; Second, a dual-stream feature processing path is constructed to fuse the aligned deep semantic features with the aforementioned attention map, mapping them as a foreground enhancement stream and a background suppression stream, respectively; Next, an adaptive multi-scale convolution is introduced to perform differentiated modeling and dynamic fusion of contextual information at different scales in the dual-stream branch, realizing adaptive expression of lesion scale changes and contextual dependencies; Finally, a residual connection path is constructed to ultimately inject the fused and enhanced features back into the decoding path in a residual manner to complete the guidance of deep semantics on the recovery of shallow details.

[0084] (4) This invention comprehensively considers segmentation performance and computational complexity, embeds the CSSB module into the second and third layers of the encoder, and introduces the multi-scale semantic guidance module FBAC into the cross-fusion layer. The cross-fusion layer is set between the bottleneck layer of the encoder and the shallow layer of the decoder. It realizes the use of the CSSB module to perform filtering on shallow features of the same scale in the two branches, and uses the channel importance score for sorting and judgment, retains the shallow discrimination region and suppresses redundant features, and then introduces the filtered features into the FBAC module to perform feature guidance and fusion on the foreground and background. It achieves the collaborative optimization of shallow information filtering and deep semantic guidance without changing the backbone of the end-to-end network structure, thereby improving the feature representation quality and overall segmentation performance of the prostate cancer MRI image segmentation task. Attached Figure Description

[0085] Figure 1 This is the overall flowchart of the present invention.

[0086] Figure 2 This is a diagram showing the overall architecture of the CSFBNet segmentation network in this invention.

[0087] Figure 3 This is an architecture diagram of the Cosine Consistent Sparse Filtering (CSSB) module in this invention.

[0088] Figure 4 This is an architecture diagram of the dual-stream complementary multi-scale semantic guidance module FBAC in this invention.

[0089] Figure 5 This is a diagram showing the DICE results of the 5-fold cross-test of the PICAI dataset in this embodiment.

[0090] Figure 6 This is a graph showing the IOU results of the 5-fold cross-test of the PICAI dataset in this embodiment.

[0091] Figure 7 This is a graph showing the Precision results of the 5-fold cross-test of the PICAI dataset in this embodiment.

[0092] Figure 8 This is a graph showing the recall results of the 5-fold cross-test of the PICAI dataset in this embodiment.

[0093] Figure 9 This is a visualization of the segmentation results of this patented model compared with other models in this embodiment.

[0094] Figure 10 This is a schematic diagram illustrating the attention visualization results of the model in this embodiment. Detailed Implementation

[0095] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments. The described embodiments are merely some embodiments of the present invention, and not all embodiments. Various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the design concept of the present invention should fall within the protection scope of the present invention.

[0096] Example 1:

[0097] like Figure 1 As shown, an image segmentation method based on cosine consistency filtering and dual-stream complementary semantic guidance includes the following steps:

[0098] Step 1: Obtain the image dataset to be processed, divide the dataset into training and testing sets, and uniformly adjust the input image size to a preset size; in this embodiment, it is...

[0099] The publicly available PICAI dataset, the proprietary HY prostate cancer MRI dataset, and the publicly available PROMISE12 dataset were divided into training and testing sets, respectively. Specifically:

[0100] Step 1.1: Collect relevant prostate cancer MRI segmentation datasets.

[0101] This implementation uses three datasets:

[0102] (1) The PICAI public dataset provides multi-parameter magnetic resonance imaging of prostate tumors based on T2WI, DWI, and apparent diffusion coefficient modalities, covering a total of 1295 clinical cases, including benign and malignant pathological types. Among them, only 220 cases (17% of the total) of clinically significant prostate tumor regions were accurately labeled by professional physicians; the remaining 1075 cases (83%) were benign tissue or indolent prostate cancer, and no lesion labeling information was included. In this embodiment, the above-mentioned manually labeled 220 cases were selected for segmentation training and evaluation.

[0103] (2) The HY Prostate MRI dataset was constructed by the First People's Hospital of Huai'an, Jiangsu Province, China. It includes 398 MRI images of prostate cancer from 169 different patients and provides corresponding segmentation and annotation information.

[0104] (3) The PROMISE12 public dataset is the 2012 prostate MRI image public segmentation challenge dataset, which provides prostate MRI images and their segmentation annotation information.

[0105] Step 1.2: Divide the above three datasets into training and test sets.

[0106] (1) PICAI dataset: The selected 220 labeled cases were divided using a five-fold cross-validation method and a fixed random seed of 42; each fold contained approximately 174 cases (80%) in the training set and 46 cases (20%) in the test set.

[0107] (2) HY Prostate MRI dataset: The dataset is divided into training and test sets according to a 4:1 ratio of patients to ensure that different slice images of the same patient only exist in the training or test set; the training set contains 320 images of 136 patients and the test set contains 78 images of 33 patients.

[0108] (3) PROMISE12 dataset: The dataset was divided according to the official requirements. The training set consisted of 1277 prostate images (containing 50 patients), and the test set consisted of 795 prostate images (containing 30 patients).

[0109] Step 1.3: Only scale the samples on the three datasets, and adjust all sample images to a size of 256×256 to facilitate observation of the model's performance.

[0110] The MRI slices and their corresponding annotations in the PICAI, HY, and PROMISE12 datasets were scaled uniformly to ensure that the network input size was consistent, thus guaranteeing that subsequent model training and comparative experiments were conducted at the same input scale.

[0111] Step 2: Construct the backbone architecture of the MRI image segmentation network model; the backbone architecture of this MRI image segmentation network model is an improved network structure based on encoding and decoding, and CSSB and FBAC modules are inserted in the encoding and decoding stages according to a preset strategy to improve the performance of the CSFBNet segmentation network structure.

[0112] like Figure 2 As shown, the overall architecture of the CSFBNet segmentation network is built upon the classic encoder-decoder paradigm. Its macroscopic network architecture is primarily constructed from four core units: a convolutional encoder unit for multi-scale feature extraction; a cosine consistent sparse filtering module (CSSB) cascaded and embedded in the shallow layers of the encoder for feature redundancy removal; a dual-stream complementary multi-scale semantic guidance module (FBAC) bridging the bottleneck between deep and shallow features for foreground and background feature guidance; and a convolutional decoder unit incorporating skip connections for progressively restoring spatial resolution. The core structure and layers of the MRI image segmentation network model include:

[0113] (1) The shallow layer of the encoder is the first layer of the encoder, which contains a double convolutional layer: two 3×3 convolutional layers + BN normalization function + ReLU activation function, which are used to obtain local detail features that are similar in the early local area.

[0114] (2) The intermediate layer of the encoder is the second and third layers of the encoder, which embeds the cosine consistent sparse filtering module CSSB, used to perform filtering on shallow features of the same scale in the two branches.

[0115] (3) The deep layer of the encoder is the fourth layer of the encoder, which contains double convolution, and passes the filtered features to the next stage.

[0116] (4) The bottleneck layer is located in the deep and shallow feature interaction stage between the encoder and the decoder.

[0117] (5) The convolutional decoder unit is used to receive features and gradually restore the spatial resolution of the image. It combines skip connections to achieve the step-by-step restoration of features.

[0118] (6) The cross-fusion layer is a dual-stream complementary multi-scale semantic guidance module FBAC between the bottleneck layer of the encoder and the shallow layer of the decoder, which is used for the guidance and fusion of foreground and background features.

[0119] The encoder is used to extract multi-scale features from the input MRI image. The shallow / first layer of the encoder contains a double convolutional layer to obtain early local details that are similar. The decoder is used to receive the features and gradually restore the spatial resolution.

[0120] The specific construction and feature processing steps of the backbone architecture of the MRI image segmentation network model include:

[0121] Step 2.1: Construct the overall segmentation network CSFBNet, with an overall structure based on an encoder-decoder architecture; the input image size is 256×256, and the encoder uses a convolutional module for feature extraction.

[0122] Step 2.2: Introduce the CSSB module in the shallow layer of the encoder: Construct two parameter-independent double convolutional branches to obtain double-branch features of the same scale, and perform joint selection based on the cosine consistency of the two features in the channel and spatial dimensions and the channel importance weights; after completing the selection of Top-K channels and Top-M spatial positions, perform masking weighting, concatenation, and... Convolutional fusion is performed, and residual connections are made with the initial features to obtain filtered and enhanced shallow features.

[0123] Step 2.3: Introduce the FBAC module at the cross-scale semantic interaction position: Input the high-resolution features output by the encoder and the low-resolution deep features of the bottleneck layer into the FBAC module, and generate a foreground-background complementary attention map based on the high-resolution features; after channel alignment and upsampling of the deep features, construct the foreground stream and background stream, and perform multi-scale modeling through adaptive dilated convolution (ADC); then perform complementary fusion of the dual-stream features, and inject them back into the high-resolution fused features in a residual manner to form an enhanced cross-scale semantic representation.

[0124] Step 2.4: The multi-scale features processed by CSSB and FBAC are input into the decoder. The features are upsampled step by step through the decoding path and fused with skip connection features to output the final segmentation result.

[0125] Step 3: Construct and embed the Cosine Consistent Sparse Filtering (CSSB) module in the second and third layers of the encoder of the segmentation network model; for example... Figure 3 As shown, the CSSB module is configured to perform channel-level and spatial pixel-level consistency filtering on shallow features of the same scale in both branches, and use channel importance scores for sorting judgment, thereby preserving shallow discrimination regions and suppressing redundant features; then the filtered features are passed to the bottleneck layer of the encoder through the fourth layer of double convolution for fusion in the next step of the FBAC module.

[0126] The Cosine Consistency Sparse Filtering (CSSB) module performs consistency filtering and sparsity preservation on the shallow dual-branch features of the encoder, used for consistency filtering and redundant feature suppression of similar features in early dual-path shallow signals. In terms of architecture, the CSSB module includes: a dual-branch independent convolution extraction unit for obtaining initial features at the same scale; a channel similarity calculation module (CSM) and a channel importance evaluation module (CIM) for calculating channel-level consistency and importance scores; a spatial similarity module for generating spatial dimension masks; and a concatenation and residual fusion unit for outputting the final features. The specific feature extraction and processing methods of the CSSB module include the following steps:

[0127] Step 3.1: Extract initial feature information from the first layer of the encoder through two independent paths using double convolution, and then downsample the path to obtain the feature tensor as shown in the following formula:

[0128] The input features from the first layer of the encoder are fed into two independent paths, and the initial features are extracted through double convolution and downsampled to obtain a dual-branch feature tensor. The specific content of the module is as follows:

[0129] ;

[0130] in, , For the real number field, For batch size, For the number of channels, and These are the height and width of the feature map, respectively; The kernel size is [size]. is the convolution stride; Down indicates downsampling; DoubleConv is two layers of 3×3 convolution + BN normalization function + ReLU activation function.

[0131] Step 3.2: Next, process the two branch features. After linear projection and alignment, the feature representation used for consistency calculation is obtained, as shown in the following formula:

[0132] ;

[0133] in, , ∈ Let be the spatial dimension flattening vectors, representing the initial features from the two independent path branches after linear projection. After alignment, the consistent spatial dimension flattening feature tensor is used to compute the following: for the ... One channel, and Flattened in spatial dimension to a length of The vector is used for subsequent cosine similarity calculation; It is a learnable linear projection matrix, implemented in the network through 1×1 convolution operations with shared weights, aiming to perform channel-dimensional information exchange and alignment between two feature paths.

[0134] The spatial response of each channel is vectorized, and the cosine similarity between the two branches on that channel is calculated to obtain the channel consistency score, as shown in the following formula:

[0135] ;

[0136] in, Indicates the two-way branch feature in the first... The cosine similarity on each channel is used to measure the consistency of the spatial response distribution of that channel. The larger the value, the more consistent the representation of the two features on that channel. It is the flattened high-resolution channel index set, c Let the Cth channel be... , These represent the characteristics of the two branches at the 1st... Flatten the feature vectors of spatial dimensions on each channel; Represents the height of the feature map in the spatial dimension. With width The product of , i.e., the total number of pixels in a single feature channel; a very small constant to prevent division by zero.

[0137] Step 3.3: To avoid relying solely on consistency screening and causing the model to favor channels that are "stable but not necessarily relevant to the target," a lightweight channel attention branch is introduced: First, the two features are fused by weight and layer normalized, then global average pooling is used to obtain the channel-level semantic description vector; two layers of MLP are used to extract the dependencies between channels, and Softmax normalization is applied to the channel dimension to obtain the channel importance weights (channel attention scores). As shown in the following formula:

[0138] ;

[0139] in, This represents the channel attention score; Softmax is the normalization function. ∈ , ∈ The parameters are learnable, and the linear mapping is obtained by 1×1 convolution; GELU is the activation function. This describes the global importance of channels, GAP represents global average pooling, and LN represents the layer normalization function. and 1- for and Each layer normalizes its respective weight.

[0140] Step 3.4: Multiply the channel consistency score by the channel importance weight to obtain a comprehensive score, and sort the comprehensive scores in descending order. Select the Top-K (top 50% in this paper) channel index set for each of the two branches. The filtered bi-branch features are obtained as shown in the following formula:

[0141] ;

[0142] in, Indicates the first The importance score of each channel Indicates the two branches at the th Cosine similarity across channels This represents a combined score of the importance score and the consistency score of the fusion channel; This indicates a comprehensive score for all channels. The channel sorting results obtained by arranging in descending order; the Argsort function represents the sorting function; Sort in descending order; This means selecting the top scorers from each of the two path branches. Important and consistent feature groups; This indicates the number of high-scoring channels to be retained; here, the total number of channels is used. The former ,Right now Based on this, the top scorers are selected. The set of channel indexes is denoted as .

[0143] Step 3.5: After completing the Top-K channel filtering, continue to filter for the positions with the highest consistency in the spatial dimension: first, normalize the channel dimension; for each spatial position ( Take the corresponding channel vector , And calculate the spatial cosine similarity to obtain the spatial consistency map. As shown in the following formula:

[0144] ;

[0145] in, , ∈ To prevent division by zero of extremely small constants.

[0146] Step 3.6: Spatial Consistency Map Flatten the vector into a length of H×W, and select the Top-M spatial locations with the highest consistency to form a set. And based on this set, construct a binary mask M; (this paper sets the retention ratio to 50%), as shown in the following formula:

[0147] ;

[0148] Where (i,j)∈ , The function is when the spatial position Belongs to set hour, ,otherwise vec( )∈ , for Flattened vector; Indicates the number of spatial locations to be reserved.

[0149] In this embodiment, the retention ratio is set to... ,therefore ;when The term "top-" indicates that the top 50% of spatial locations with the highest similarity are retained. The operation is executed independently on each batch of samples; ArgTopM This means taking the largest value from the input vector. The set of indices corresponding to each element.

[0150] Step 3.7: Apply a binary mask The broadcast is sent to the channel dimension, and the two features are weighted element-wise and then concatenated. Convolutional fusion is performed, and finally, a residual connection is executed with the initial features entering the CSSB module to obtain the CSSB output. The specific formula is shown below:

[0151] +Residual;

[0152] in, This indicates a convolution with a kernel size of 1, used to restore the channels from 2K to K for fusion; Concat indicates feature concatenation along the channel dimension. `residual` indicates element-wise multiplication; `residual` indicates residual joins performed on the initial features entering the CSSB module.

[0153] Step 4: Construct and connect a dual-stream complementary multi-scale semantic guidance module (FBAC) in the deep and shallow feature interaction stage of the segmentation network model (i.e., the bottleneck layer of the encoder and the shallow layer of the decoder). For example... Figure 4 As shown, the FBAC module includes: a foreground / background attention generation unit (FBGM) for generating complementary masks, a two-stream feature mapping unit for constructing independent branches, an adaptive dilated convolutional unit (ADC) for capturing contextual information from different receptive fields, and a residual connection fusion unit for final semantic reconstruction. The FBAC module is configured to generate foreground attention maps from high-resolution features and construct complementary background attention maps, forming two-stream branches. Semantic guidance features are obtained through dynamic multi-scale modeling to guide the decoder in recovering details, thus completing the construction of the entire MRI image segmentation network model.

[0154] The dual-stream complementary multi-scale semantic guidance module FBAC is used to explicitly model the complementary semantic relationship between foreground and background, and to guide deep feature fusion using multi-scale contextual information. Its specific construction and processing process includes the following steps:

[0155] Step 4.1: The high-resolution features output by the encoder's first layer are... The bottleneck layer has low-resolution deep features. Firstly, by Generate a single-channel foreground attention map Background attention maps are obtained in a complementary manner. The specific formula is as follows:

[0156] ;

[0157] in, This represents the Sigmoid activation function. A 1×1 convolution is used to adjust the channels for fusion; This represents depthwise separable convolution used to generate foreground attention maps. ; .

[0158] Subsequently, depthwise separable convolutions were used to construct the attention map, and then... Perform channel alignment and upsampling to the same level For the same scale, as shown in the following formula:

[0159] ;

[0160] for Convolution is used to adjust channels and perform feature fusion; This represents the deep semantic feature map output by the last layer of the encoder. It contains rich semantic information, but its spatial resolution is low.

[0161] Step 4.2: Multiply the attention map element-wise with the aligned deep features to perform feature fusion and construct the foreground flow. With background flow The two paths are complemented by element-wise multiplication, as shown in the following formula:

[0162] ;

[0163] in, For element-wise multiplication, it is automatically broadcast to the channel dimension.

[0164] Step 4.3: Perform multi-scale dilated convolution on the foreground and background flows respectively, and introduce an ADC module to adjust the dilation rate set. Dynamically weighted fusion of branch outputs:

[0165] ;

[0166] First, multi-scale dilated convolution is performed to reconstruct features. Then, channel descriptions are obtained through operations such as global average pooling. Importance scores for each branch are generated and dynamically weighted and fused to obtain the ADC output, as shown in the following formula:

[0167] ;

[0168] in, Indicates the expansion rate of Dilated convolution; ; Indicating the expansion rate Below, based on foreground features and background features The first one obtained by dilated convolution branch Multi-scale branch output features; This indicates a global average pooling operation, used to extract channel-level global statistics. This represents a lightweight multilayer perceptron used to generate weight descriptions for each branch; This represents the normalization function, used to map the responses of each branch to weight coefficients that sum to 1; This represents the dynamic weights corresponding to the three dilated convolution branches; Indicates the first The scalar weights corresponding to each branch. ; Indicates the first Feature maps output by each dilation rate branch; This represents element-wise multiplication; Represents the linear rectification activation function; Indicates foreground features and background features The output features are obtained after multi-scale dilated convolution extraction and dynamic weighted fusion.

[0169] Step 4.4: Enhance the shared ADC module for the foreground and background streams to obtain two multi-scale features. Then, use element-wise addition to obtain the complementary fusion feature map FM, as shown in the following formula:

[0170] .

[0171] Step 4.5: Merge complementary feature maps Semantic fusion with the original low-resolution features: The features are resized to the same resolution as the original low-resolution features through interpolation, and then refined using convolutional blocks. The concatenation is performed along the channel dimension, and then semantics are obtained by fusion of convolutional block compression and reorganization. The formula is as follows:

[0172]

[0173] in, and pass Convolution is implemented to adjust channel dimensions and perform feature fusion. This indicates a channel-dimensional splicing operation; This indicates an interpolation scaling operation; and These represent deep feature maps. The spatial height and width, i.e. The target resolution is used to fuse complementary feature maps. Resampling to Same size

[0174] Step 4.6: Take the result obtained in Step 4.5 Upsample back to high resolution and with The residuals are summed to obtain the FBAC output enhanced features, as shown in the following formula:

[0175] + ;

[0176] in, This indicates the high-resolution enhanced features obtained after processing by the Foreground-Background Complementary Enhancement Module (FBAC), which is used to fuse shallow detail information with deep semantic information, thereby improving feature representation capabilities.

[0177] Step 5: Based on the segmentation network model built in Steps 2 to 4, iteratively train and optimize the model using the training set divided in Step 1; test the trained model using the test set, and use the model to perform accurate segmentation of prostate cancer MRI images.

[0178] The proposed modules are integrated to form a complete end-to-end processing flow, enabling the medical image segmentation task. Details are as follows:

[0179] Step 5.1: Input the medical image to be processed into the network (input size is...) The data is then fed into the encoder for layer-by-layer feature extraction.

[0180] Step 5.2: The input image enters the first layer of the encoder, and obtains shallow features through double convolution branches; the CSSB process is executed to complete the consistency filtering of channels and space, and the filtered shallow features are obtained and output through residual connections.

[0181] Step 5.3: Continue to extract mid-to-high-level semantic features through the encoder; in the stage of cross-scale fusion with the decoder, input the high-resolution features and low-resolution deep semantic features into the FBAC module to complete foreground / background complementary attention construction, multi-scale modeling, dual-stream fusion and residual backinjection, and obtain the enhanced cross-scale fused features.

[0182] Step 5.4: Feed the enhanced multi-scale features into the decoder to restore the spatial resolution step by step, output the segmentation prediction results, and complete the model training and testing process.

[0183] Comparative experiment:

[0184] To verify the effectiveness of the method of the present invention, comparative experiments were conducted with various existing medical image segmentation methods under the same experimental environment and unified training strategy. Furthermore, the impact of key module configuration on the results was verified through ablation experiments and visualization analysis.

[0185] The proposed algorithm was compared with several state-of-the-art (SOTA) and newer models on two prostate cancer datasets and one prostate cancer dataset. The models were tested on the PROMISE12, HY, and PICAI datasets. Evaluation metrics included Dice coefficient (DSC), Intersection over Union (IoU), HD95, precision, and recall. Comparison methods included TransUNet, AttentionUNet, UNet++, CATNet, H2former, Hiformer, MDSAUNet, DAMAF, PMFSNet, and MAUNet, with comparisons performed under the same partitioning and training strategies.

[0186] To verify the effectiveness and advancement of the method of this invention, this embodiment compares and tests the invention with several existing advanced medical image segmentation networks on three prostate cancer MRI datasets: PROMISE12, HY, and PICAI.

[0187] Experimental results show that the present invention achieves optimal overall performance in various complex image scenarios: On the PROMISE12 dataset with relatively clear boundaries, the present invention achieves a Dice score of 0.9017. The dual-stream complementary multi-scale semantic guidance module (FBAC) effectively overcomes the limitations of existing models (such as UNet++) in balancing precision and recall, significantly reducing the false detection rate. On the HY dataset with small sample size and difficult feature extraction, the Dice score of the present invention jumps significantly to 0.6539. The cosine consistent sparse filtering module (CSSB) efficiently removes redundant interference in the early stage, successfully avoiding the feature alignment bottleneck that existing complex attention models (such as HiFormer) are prone to fall into, and exhibits extremely strong noise robustness. On the PICAI dataset containing a large number of small lesions and severe artifacts, the five-fold cross-validation test proves that the index distribution of the present invention is more concentrated and the stability is the strongest.

[0188] Overall, this invention innovatively employs a dual-end collaborative strategy of "shallow discriminative feature filtering" and "complementary modeling of shallow and deep semantic foreground and background," which not only more accurately locates small lesion areas but also greatly suppresses false positives caused by background artifacts. It significantly outperforms existing technologies in terms of overall accuracy and generalization ability for prostate cancer image segmentation. (Refer to Tables 1 and 2.) Figure 5 , Figure 6 , Figure 7 , Figure 8 As shown:

[0189] Table 1. Segmentation results of various algorithms on the PROMISE12 dataset.

[0190] Model DSC (F1 Score) Intersection over Union (IOU) HD95 (pixel) Precision Recall TransUnet 0.8580 0.7513 32.0593 0.8809 0.8363 AttentionUnet 0.8932 0.8071 20.4641 0.9068 0.8800 Unet++ 0.8946 0.8093 27.7194 0.8866 0.8933 CATNet 0.8899 0.8017 31.6395 0.8866 0.8933 H2former 0.8327 0.7133 38.8425 0.8353 0.8301 Hiformer 0.8459 0.7329 39.4061 0.8922 0.8041 MDSAUNet 0.8785 0.7833 39.5247 0.8754 0.8816 DAMAF 0.8373 0.7201 38.5186 0.8326 0.8420 PMFSNet 0.8783 0.7831 31.6309 0.8875 0.8693 MAUNet 0.8793 0.7846 28.4060 0.8877 0.8710 CSFBNet(Ours) 0.9017 0.8210 23.3712 0.8988 0.9046

[0191] Table 2. Segmentation results of various algorithms on the HY dataset.

[0192] Model DSC coefficient (F1 Score) Intersection over Union (IOU) HD95 (pixel) Precision Recall TransUnet 0.5772 0.4057 55.6917 0.5428 0.6162 AttentionUnet 0.6003 0.4288 43.4560 0.5072 0.7352 Unet++ 0.6128 0.4418 32.0605 0.5628 0.6727 CATNet 0.6008 0.4294 42.2291 0.5575 0.6515 H2former 0.4989 0.3323 42.5718 0.4514 0.5575 Hiformer 0.4505 0.2907 38.8074 0.4828 0.4222 MDSAUNet 0.5860 0.4144 41.8127 0.5249 0.6632 DAMAF 0.4989 0.3323 38.2232 0.5581 0.4510 PMFSNet 0.4686 0.3060 57.5306 0.3816 0.6070 MAUNet 0.5743 0.4028 45.3057 0.5059 0.6639 CSFBNet(Ours) 0.6539 0.4857 30.8618 0.5834 0.7437

[0193] The proposed model (CSFBNet) achieved a Dice coefficient (DSC) of 0.9017 and an Intersection over Union (IOU) of 0.8210. Compared to the suboptimal state-of-the-art model UNet++ (DSC 0.8946, IOU 0.8093), this invention achieves further breakthroughs in core evaluation metrics. Furthermore, the HD95 score of this invention reaches 23.3712, lower than most comparative models, indicating a significant advantage in lesion edge fit and demonstrating the effectiveness of the dual-stream complementary multi-scale semantic guidance module (FBAC) in enhancing lesion localization and boundary detail restoration.

[0194] Secondly, on the private HY dataset, which has high feature extraction difficulty and a small sample size (as shown in Table 2), the performance advantage of this invention is even more significant. Its DSC and IOU jump to 0.6539 and 0.4857 respectively, achieving a substantial performance improvement compared to the well-performing mainstream model UNet++ (DSC 0.6128, IOU 0.4418). Furthermore, its HD95 score drops significantly by 30.8618, far exceeding other complex comparative models (such as H2former's 42.5718). This performance fully verifies the excellent ability of the Cosine Consistency Sparse Selection Module (CSSB) to efficiently remove redundant interference and improve the discriminative power of shallow features in the early stages, giving it strong noise robustness in small sample and high-noise scenarios.

[0195] Finally, regarding the PICAI public dataset (such as...), which contains a large number of tiny lesions and complex artifacts... Figures 5 to 8 (as shown) Figures 5 to 8 The box plot distributions of DICE, IOU, Precision, and Recall metrics for each comparative model under five-fold cross-validation are presented. The comprehensive data characteristics of the charts clearly show that the model of this invention (CSFBNet) not only leads in the five-fold mean (i.e., the diamond-shaped markers in the chart) for all evaluation metrics, but more importantly, its box distribution (i.e., the interquartile range reflecting the data fluctuation range) is the most compact and concentrated compared to models such as AttentionUNet and UNet++, and its lower bound (Whisker's bottom) is significantly higher than other models. This distribution characteristic intuitively and conclusively reflects that the algorithm of this invention exhibits extremely high predictive stability and cross-domain generalization ability when dealing with unseen samples under different data folds, effectively overcoming the performance fluctuations and false positives that existing models are prone to when dealing with small and irregular lesions.

[0196] The algorithm model proposed in this invention has achieved a reasonable grasp of shallow feature selection, deep multi-scale feature fusion, and highlighting important features. It has achieved excellent results on three datasets, proving its feasibility.

[0197] Visualization of segmentation results:

[0198] Reference Figure 9 This invention visualizes the prediction mask results of various typical segmentation models and the CSFBNet model of this invention side-by-side on three datasets: HY, PROMISE12, and PICAI. This allows for a direct comparison of the differences between different models in terms of target region localization, contour fitting, and background false detection. Overall, on the PROMISE12 dataset, due to the relatively clear organ boundaries, most methods can obtain a relatively complete prostate contour, but some models still exhibit local discontinuities or slight deformations at boundary transitions. In contrast, the predicted contour of the model of this invention is closer to the labeled area, and the boundary connectivity is more stable. Under more challenging datasets such as HY and PICAI, the visualization results show that some comparative models are prone to two typical phenomena: first, missed detection of target regions, resulting in incomplete predicted foreground; second, the generation of scattered artifacts in complex backgrounds or missegmentation of adjacent non-target tissues as foreground, forming false detection regions. In contrast, the predicted foreground region of the model of this invention is more concentrated, with fewer stray artifacts, and it performs more stably in maintaining connectivity and boundary fitting in small target regions, thus demonstrating more reliable segmentation output quality.

[0199] Attention visualization:

[0200] Reference Figure 10 To further demonstrate the response changes of the model in the feature learning process, this embodiment visualizes the feature responses at different stages of the network. Figure 9From left to right: (a) Original prostate image; (b) First layer convolutional features of the encoder; (c) Feature response after introducing CSSB in the corresponding layer of the encoder; (d) Feature response after introducing FBAC; (e) Final output features; Sample results from the HY, PROMISE12, and PICAI datasets are given from top to bottom. The visualization results show that the model's response evolves from dispersed to focused: In the shallow stage (b), the response is more distributed in high-frequency structures such as contours and textures, with a wide overall coverage, but limited discrimination for low-contrast or small-scale target regions; after introducing the CSSB module (c), through joint screening of cross-branch consistency and channel importance, the feature response is more concentrated in the candidate target-related regions, and background pseudo-activations are relatively reduced, indicating that shallow redundancy and noise information are suppressed. Furthermore, after introducing the FBAC module (d), combined with foreground-background complementary attention and multi-scale semantic guidance, the model's response to the target region is more stable and continuously suppresses background interference, making the representation of the target's interior and boundary regions clearer. Finally, in the output stage (e), the response region is more compact and the shape is more coherent, and the high response of non-target regions is further weakened. This trend also holds true in small lesion and strong interference scenarios such as PICAI, which reflects the synergistic effect of CSSB and FBAC in "shallow noise filtering" and "deep and shallow semantic guidance", prompting the model to gradually form a more reliable target focus and boundary characterization.

[0201] Ablation experiments: To further confirm that the proposed modules can effectively improve the segmentation performance of the backbone network, the performance of each module was evaluated individually. The results for the three datasets are shown in Tables 3, 4, and 5, respectively.

[0202] Table 3. Impact of CSSB and FBAC on overall model performance on the PROMISE12 dataset.

[0203] Model DSC coefficient (F1 Score) Intersection over Union (IOU) HD95 (pixel) Precision Recall Baseline 0.8901 0.8019 25.3095 0.8952 0.8851 Baseline+CSSB 0.8992 0.8168 30.6925 0.8940 0.9044 Baseline+FBAC 0.8986 0.8159 27.6168 0.9114 0.8861 Baseline+CSSB+FBAC(CSFBNet) 0.9017 0.8210 23.3712 0.8988 0.9046

[0204] Table 4. Impact of CSSB and FBAC on overall model performance on the HY dataset.

[0205] Model DSC coefficient Intersection and Union HD95 (pixel) accuracy Recall rate Baseline 0.5971 0.4265 49.4635 0.4851 0.7792 Baseline+CSSB 0.6375 0.4679 37.3705 0.5727 0.7188 Baseline+FBAC 0.6390 0.4695 28.5115 0.5340 0.7953 Baseline+CSSB+FBA(CSFBNet) 0.6539 0.4857 30.8618 0.5834 0.7437

[0206] Table 5. Impact of CSSB and FBAC on overall model performance on the PICAI dataset.

[0207] Model DSC coefficient Intersection and Union HD95 (pixel) accuracy Recall rate Baseline 0.4571 0.2963 128.9189 0.4530 0.4614 Baseline+CSSB 0.4632 0.3014 104.9758 0.4034 0.5439 Baseline+FBAC 0.4732 0.2867 112.8617 0.4707 0.4231 Baseline+CSSB+FBAC(CSFBNet) 0.4784 0.3144 109.3276 0.4431 0.5197

[0208] The cosine consistent sparse filtering module CSSB and the dual-stream complementary multi-scale semantic guidance module FBAC proposed in this invention have achieved superior segmentation performance on multiple datasets. After introducing CSSB or FBAC into the baseline network, the model has improved on the PROMISE12, HY and PICAI datasets, indicating that both innovative modules can stably bring performance gains. Furthermore, when CSSB and FBAC are jointly constructed to form the complete CSFBNet, the overall segmentation performance further reaches the optimal result, fully demonstrating the effectiveness and practical value of the two innovative designs and their synergistic combination proposed in this invention.

[0209] To further demonstrate the applicability and feasibility of the proposed Cosine Consistent Sparse Filtering (CSSB) module, it was considered to be embedded into different coding layer configurations of the baseline network UNet for comparative verification, as shown in Tables 6, 7, and 8:

[0210] Table 6. Impact of the number of CSSB layers on overall model performance on the PROMISE12 dataset.

[0211] Model DSC coefficient (F1 Score) Intersection over Union (IOU) HD95 (pixel) Precision Recall Baseline + CSSB (First Layer) 0.8956 0.8109 25.9549 0.9071 0.8844 Baseline + CSSB (First and Second Layers) 0.8992 0.8168 30.6925 0.8940 0.9044 Baseline + CSSB (First, Second, and Third Layers) 0.8984 0.8156 28.5764 0.8979 0.8990

[0212] Table 7. Impact of the number of CSSB layers on the overall model performance on the HY dataset.

[0213] Model DSC coefficient (F1 Score) Intersection over Union (IOU) HD95 (pixel) Precision Recall Baseline + CSSB (First Layer) 0.6291 0.4589 27.5169 0.5447 0.7444 Baseline + CSSB (First and Second Layers) 0.6375 0.4679 37.3705 0.5727 0.7188 Baseline + CSSB (First, Second, and Third Layers) 0.6056 0.4343 37.7858 0.5230 0.7191

[0214] Table 8. Impact of the number of CSSB layers on overall model performance on the PICAI dataset.

[0215] Model DSC coefficient (F1 Score) Intersection over Union (IOU) HD95 (pixel) Precision Recall Baseline + CSSB (First Layer) 0.4662 0.3040 100.7347 0.4415 0.4939 Baseline + CSSB (First and Second Layers) 0.4687 0.3061 110.3379 0.4511 0.4878 Baseline + CSSB (First, Second, and Third Layers) 0.4664 0.3041 120.7470 0.4574 0.4757

[0216] By configuring different CSSB insertion schemes (inserting only layer 1, inserting layers 1+2, and inserting layers 1+2+3) and evaluating them on the PROMISE12, HY, and PICAI datasets, the results show that all configurations after introducing CSSB improved segmentation metrics across multiple datasets. The "layer 1+2" configuration showed more consistent performance across all three datasets. This demonstrates that the CSSB module proposed in this invention can stably improve the segmentation performance of baseline models under different insertion configurations and has good applicability.

[0217] To further demonstrate the applicability and feasibility of the dual-stream complementary multi-scale semantic guidance module FBAC proposed in this invention, a variant comparison of its key internal components is considered, as shown in Tables 9, 10, and 11:

[0218] Table 9. Impact of each FBAC submodule on model performance on the PROMISE12 dataset

[0219] method FPAM Shuangliu ADC dynamic weights DSC coefficient Intersection and Union HD95 (pixel) accuracy Recall rate w / o FPAM × √ √ 0.8984 0.8155 27.9480 0.9047 0.8922 w / o ADC-weight √ √ × 0.8986 0.8159 26.1099 0.9072 0.8902 Single-stream √ × √ 0.8927 0.8062 22.9093 0.9258 0.8619 Full(Ours) √ √ √ 0.8986 0.8159 27.6168 0.9114 0.8861

[0220] Table 10. Impact of each FBAC submodule on model performance on the HY dataset

[0221] method FPAM Shuangliu ADC dynamic weights DSC coefficient Intersection and Union HD95 (pixel) accuracy Recall rate w / o FPAM × √ √ 0.6081 0.4368 24.9648 0.5324 0.7087 w / o ADC-weight √ √ × 0.6332 0.4633 18.4697 0.5726 0.7081 Single-stream √ × √ 0.6267 0.4563 31.6139 0.5660 0.7019 Full(Ours) √ √ √ 0.6390 0.4695 28.5115 0.5340 0.7953

[0222] Table 11. Impact of each FBAC submodule on model performance on the PICAI dataset

[0223] method FPAM Shuangliu ADC dynamic weights DSC coefficient Intersection and Union HD95 (pixel) accuracy Recall rate w / o FPAM × √ √ 0.4680 0.3055 119.0857 0.4328 0.5095 w / o ADC-weight √ √ × 0.4645 0.3025 113.2640 0.4397 0.4923 Single-stream √ × √ 0.4682 0.3056 109.2007 0.4271 0.5181 Full(Ours) √ √ √ 0.4732 0.2867 112.8617 0.4707 0.4231

[0224] By constructing variants such as removing the foreground-first attention map (without FPAM), removing the ADC dynamic weights (without ADC-weight), and replacing the foreground / background dual-stream with a single-stream structure, and evaluating them on the PROMISE12, HY, and PICAI datasets compared to the full structure, the results show that the full FBAC structure achieves superior segmentation metrics on multiple datasets, and different variants exhibit varying degrees of performance variation compared to the full structure. This demonstrates that the FBAC module proposed in this invention remains effective under different composition conditions, and its full structure exhibits more stable overall performance, reflecting the feasibility and applicability of the module design.

[0225] This invention can be combined with a medical image computer diagnostic system to assist doctors in the segmentation and diagnosis of medical images in clinical practice, providing valuable reference.

[0226] The above embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement it accordingly. They should not be construed as limiting the scope of protection of the present invention. All equivalent changes or modifications made in accordance with the spirit and essence of the present invention should be covered by the present invention.

Claims

1. An image segmentation method based on cosine consistency screening and dual-stream complementary semantic guidance, characterized in that, Including the following steps: The backbone architecture of an MRI image segmentation network model is constructed based on the classic encoder-decoder paradigm. The core structure and layers include: encoder, bottleneck layer, convolutional decoder unit, and bridging fusion layer. The first two layers of the encoder extract shallow features from the input image. A cosine consistency sparse filtering module (CSSB) is constructed and embedded in the intermediate layer of the encoder of the segmentation network model. Cross-branch consistency is obtained by calculating the cosine similarity of bi-branch features in the channel dimension and spatial dimension, and lightweight channel attention is introduced to obtain channel importance scores. The consistency scores and importance scores are jointly ranked to obtain the cosine consistency sparse filtering module (CSSB). In the deep and shallow feature interaction stage of the segmentation network model, a dual-stream complementary multi-scale semantic guidance module (FBAC) is constructed and connected. The FBAC module is configured to generate a foreground attention map from high-resolution features and construct a complementary background attention map to form a dual-stream branch. Semantic guidance features are obtained through dynamic multi-scale modeling to guide the decoder to recover details. This completes the construction of the entire MRI image segmentation network model.

2. The image segmentation method based on cosine consistency screening and dual-stream complementary semantic guidance according to claim 1, characterized in that: The backbone architecture of the MRI image segmentation network model includes: (1) The shallow layer of the encoder is the first layer of the encoder, which contains a double convolutional layer: two 3×3 convolutional layers + BN normalization function + ReLU activation function, which are used to obtain local detail features that are relatively similar in the early local area; (2) The intermediate layer of the encoder is the second and third layers of the encoder, which embeds the cosine consistent sparse filtering module CSSB, which is used to perform filtering on shallow features of the same scale in the two branches; (3) The deep layer of the encoder is the fourth layer of the encoder, which contains double convolution, and passes the filtered features to the next stage; (4) The bottleneck layer is located in the deep and shallow feature interaction stage between the encoder and decoder; (5) The convolutional decoder unit is used to receive features and gradually restore the spatial resolution of the image. It combines skip connections to achieve the step-by-step restoration of features. (6) The cross-fusion layer is a dual-stream complementary multi-scale semantic guidance module FBAC between the bottleneck layer of the encoder and the shallow layer of the decoder, which is used for the guidance and fusion of foreground and background features.

3. The image segmentation method based on cosine consistency screening and dual-stream complementary semantic guidance according to claim 1 or 2, characterized in that: The encoder is used to extract multi-scale features from the input MRI image. The shallow / first layer of the encoder contains a double convolutional layer to obtain early local details that are similar. The decoder is used to receive the features and gradually restore the spatial resolution.

4. The image segmentation method based on cosine consistency screening and dual-stream complementary semantic guidance according to claim 1, characterized in that: The specific construction and feature processing steps of the backbone architecture of the MRI image segmentation network model include: S1. Construct the overall segmentation network CSFBNet, with an overall structure based on an encoder-decoder architecture; the input image size is 256×256, and the encoder uses a convolutional module for feature extraction; S2. Introduce a CSSB module in the shallow layer of the encoder: Construct two parameter-independent double convolutional branches to obtain double-branch features of the same scale, and perform joint selection based on the cosine consistency of the two features in the channel and spatial dimensions and the channel importance weights; after completing the selection of Top-K channels and Top-M spatial positions, perform masking weighting, concatenation, and... Convolutional fusion is performed, and residual connections are made with the initial features to obtain filtered and enhanced shallow features; S3. Introduce the FBAC module at the cross-scale semantic interaction position: Input the high-resolution features output by the encoder and the low-resolution deep features of the bottleneck layer into the FBAC module, generate a foreground-background complementary attention map based on the high-resolution features; after channel alignment and upsampling of the deep features, construct the foreground stream and background stream, and perform multi-scale modeling through adaptive dilated convolution (ADC); then perform complementary fusion of the two stream features, and inject them back into the high-resolution fused features in a residual manner to form an enhanced cross-scale semantic representation; S4. The multi-scale features processed by CSSB and FBAC are input into the decoder, and the features are upsampled step by step through the decoding path and fused with skip connection features to output the final segmentation result.

5. The image segmentation method based on cosine consistency screening and dual-stream complementary semantic guidance according to claim 1, characterized in that: The CSSB module is configured to perform channel-level and spatial pixel-level consistency filtering on shallow features of the same scale in both branches, and to use channel importance scores for sorting and judgment, thereby preserving shallow discrimination regions and suppressing redundant features. The filtered features are then passed to the bottleneck layer of the encoder via a fourth-layer double convolution for fusion in the next FBAC module.

6. The image segmentation method based on cosine consistency screening and dual-stream complementary semantic guidance according to claim 1 or 5, characterized in that: The cosine consistent sparse filtering module (CSSB) is used to perform consistency filtering and redundant feature suppression on similar features of early dual-path shallow signals. Its specific feature extraction and processing methods include the following steps: 1) Extract initial feature information from the first layer of the encoder through two independent paths via double convolution, and then downsample the paths to obtain the feature tensor. For example, the following formula: ; in , For the real number field, For batch size, For the number of channels, and These are the height and width of the feature map, respectively; The kernel size is [size]. The stride is 'Down'; 'Down' indicates downsampling; 'DoubleConv' represents two 3×3 convolution layers + BN normalization function + ReLU activation function. 2) Next, the two branch features are processed... After linear projection and alignment, the feature representation used for consistency calculation is obtained, as shown in the following formula: ; in, , ∈ Let be the spatial dimension flattening vectors, representing the initial features from the two independent path branches after linear projection. After alignment, the consistent spatial dimension flattening feature tensor is used to compute the following: for the ... One channel, and Flattened in spatial dimension to a length of The vector is used for subsequent cosine similarity calculation; It is a learnable linear projection matrix, implemented in the network through a 1×1 convolution operation with shared weights, aiming to perform information exchange and alignment of the two features along the channel dimension. The spatial response of each channel is vectorized, and the cosine similarity between the two branches on that channel is calculated to obtain the channel consistency score, as shown in the following formula: ; in, Indicates the two-way branch feature in the first... The cosine similarity on each channel is used to measure the consistency of the spatial response distribution of that channel. The larger the value, the more consistent the representation of the two features on that channel. It is the flattened high-resolution channel index set, c Let the Cth channel be... , The two branch features are respectively represented in the first... Flatten the feature vectors of spatial dimensions on each channel; Represents the height of the feature map in the spatial dimension. With width The product of these is the total number of pixels in a single feature channel; a very small constant to prevent division by zero. Channel description vectors are constructed based on weighted fusion of two-way features, layer normalization, and global average pooling. Channel importance scores are then obtained through two layers of MLP and Softmax. As shown in the formula below: ; in, This represents the channel attention score; Softmax is the normalization function. ∈ , ∈ The parameters are learnable, and the linear mapping is obtained by 1×1 convolution; GELU is the activation function. This describes the global importance of channels, GAP represents global average pooling, and LN represents the layer normalization function. and 1- for and Each layer normalizes its respective weight; 3) Multiply the channel consistency score by the channel importance weight to obtain a comprehensive score, and sort them in descending order of comprehensive score, selecting the Top-... Channel Index Set As shown in the following formula: ; in, Indicates the first The importance score of each channel Indicates the two branches at the th Cosine similarity across channels This represents a combined score of the importance score and the consistency score of the fusion channel; This indicates a comprehensive score for all channels. The channel sorting results obtained by arranging in descending order; the Argsort function represents the sorting function; Sort in descending order; This means selecting the top scorers from each of the two path branches. Important and consistent feature groups; This indicates the number of high-scoring channels to be retained; here, the total number of channels is used. The former ,Right now Based on this, the top scorers (T) are selected. The set of channel indexes is denoted as ; 4) After completing the Top-K channel filtering, take the channel vector for each spatial location (H, W). , And calculate the spatial cosine similarity to obtain the spatial consistency map. As shown in the following formula: ; in, , ∈ To prevent division by zero of extremely small constants; Spatial Consistency Map Flatten the vector into a length of H×W, and perform Top-M spatial location selection to obtain the set. And based on this set, construct a binary mask M; specifically as shown in the following formula: ; Where (i,j) ∈ , The function is when the spatial position Belongs to set hour, ,otherwise vec( )∈ , for Flattened vector; Indicates the number of reserved spatial locations; Set the retention ratio to ,therefore ;when The term "top-" indicates that the top 50% of spatial locations with the highest similarity are retained. The operation is executed independently on each batch of samples; ArgTopM This means taking the largest value from the input vector. The set of indices corresponding to each element; 5) Broadcast the binary mask to the channel dimension, weight the two features element by element, concatenate them, and then... Convolutional fusion is performed, and finally, a residual connection is executed with the initial features entering the CSSB module to obtain the CSSB output. The specific formula is shown below: +Residual ; in, This indicates a convolution with a kernel size of 1, used to restore the channels from 2K to K for fusion; Concat indicates feature concatenation along the channel dimension. `residual` indicates element-wise multiplication; `residual` indicates residual joins performed on the initial features entering the CSSB module.

7. The image segmentation method based on cosine consistency screening and dual-stream complementary semantic guidance according to claim 6, characterized in that: The dual-stream complementary multi-scale semantic guidance module FBAC is used to explicitly model the complementary semantic relationship between the foreground and background, and to guide deep feature fusion using multi-scale contextual information. Its specific construction and processing process includes the following steps: A1: The final new feature map A deep feature map is obtained through a double convolution layer. The high-resolution features output by the encoder are ;Depend on Generate a single-channel foreground attention map Background attention maps are obtained in a complementary manner. The specific formula is as follows: ; in, This represents the Sigmoid activation function. A 1×1 convolution is used to adjust the channels for fusion; This represents depthwise separable convolution used to generate foreground attention maps. ; ; At the same time Perform channel alignment and upsampling to the same level as Same scale: ; Up represents the upsampling operation; for Convolution is used to adjust channels and perform feature fusion; This represents the deep semantic feature map output by the last layer of the encoder, which contains rich semantic information but has low spatial resolution. A2: Construct the foreground flow by multiplying the attention map element-wise with the aligned deep features. With background flow The specific formula is as follows: ; in, For element-wise multiplication, it is automatically broadcast to the channel dimension; Subsequently, multi-scale dilated convolutions are performed on the foreground and background streams respectively to obtain the outputs of each dilation rate branch, and then based on global average pooling and the channel description function. Calculate branch weights The branch outputs are then dynamically weighted and fused to obtain the ADC output, as shown in the following formula: ; in, Indicates the expansion rate of Dilated convolution; ; Indicating the expansion rate Below, based on foreground features and background features The first one obtained by dilated convolution branch Multi-scale branch output features; This indicates a global average pooling operation, used to extract channel-level global statistics. This represents a lightweight multilayer perceptron used to generate weight descriptions for each branch; This represents the normalization function, used to map the responses of each branch to weight coefficients that sum to 1; This represents the dynamic weights corresponding to the three dilated convolution branches; Indicates the first The scalar weights corresponding to each branch ; Indicates the first Feature maps output by each dilation rate branch; This represents element-wise multiplication; Represents the linear rectification activation function; Indicates foreground features and background features The output features are obtained after multi-scale dilated convolution extraction and dynamic weighted fusion. A3: The foreground and background streams share the same ADC module for enhancement. After obtaining two multi-scale features, the complementary fusion feature map FM is obtained by adding them element-wise, as shown in the following formula: ; A4: Will The features are resized to the same resolution as the original low-resolution features through interpolation and then refined by convolutional blocks. The concatenation is performed along the channel dimension, and then fused into a convolutional block to obtain the desired result. The formula is as follows: ; in, and pass Convolution is implemented to adjust channel dimensions and perform feature fusion. This indicates a channel-dimensional splicing operation; This indicates an interpolation scaling operation; and These represent deep feature maps. The spatial height and width, i.e. The target resolution is used to fuse complementary feature maps. Resampling to Same size; Finally Upsample back to high resolution and with The FBAC output enhancement feature is obtained by summing the residuals, as shown in the following formula: + ; in, This indicates the high-resolution enhanced features obtained after processing by the Foreground-Background Complementary Enhancement Module (FBAC), which is used to fuse shallow detail information with deep semantic information, thereby improving feature representation capabilities.

8. The image segmentation method based on cosine consistency screening and dual-stream complementary semantic guidance according to claim 1, characterized in that: The deep and shallow feature interaction stage of the segmentation network model consists of the bottleneck layer of the encoder and the shallow layer of the decoder.

9. The image segmentation method based on cosine consistency screening and dual-stream complementary semantic guidance according to claim 1, characterized in that: The steps also include: Obtain the image dataset to be processed, divide the dataset into training and testing sets, and uniformly adjust the size of the input images to a preset size; Based on the constructed segmentation network model, the model is iteratively trained and its parameters are optimized using the partitioned training set; the trained model is tested using the test set, and the model is then used to perform accurate segmentation of MRI images.

10. The image segmentation method based on cosine consistency screening and dual-stream complementary semantic guidance according to claim 9, characterized in that: The image dataset to be processed is a medical image dataset, which contains multiple MRI images of prostate cancer. The samples in the image dataset to be processed are scaled, and all samples are uniformly modified to 256*256.