A stem cell image convergence degree calculation method and system based on a U-Net network
By using an improved U-Net network, combining strip channel spatial attention and strip ternary attention structures, the sensitivity to illumination and noise in stem cell image segmentation is solved, achieving high-precision calculation of stem cell confluence and improving the robustness and accuracy of segmentation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- QINGDAO UNIV OF SCI & TECH
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-12
AI Technical Summary
Existing methods for stem cell image segmentation and confluence calculation are sensitive to changes in illumination and noise interference, highly dependent on parameter adjustments, and difficult to achieve in-situ real-time monitoring. Furthermore, traditional U-Net networks are unable to capture the long-distance dependencies of the slender spindle-shaped structure of stem cells, resulting in large errors and underestimation of confluence calculation.
An improved U-Net network is adopted, embedding a strip channel spatial attention structure and a strip ternary attention structure. Combined with full-process standardized preprocessing, a high-precision binary convergence mask image is generated through multi-level feature extraction and reconstruction of the encoder and decoder, and the pixel ratio of the stem cell region is calculated.
It significantly improves the accuracy and robustness of stem cell confluence calculation, can adapt to different light and noise levels, accurately distinguishes narrow gaps between adherent cells, improves the clarity of segmentation boundaries and regional consistency, and ensures high accuracy of confluence calculation.
Smart Images

Figure CN122200652A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of medical image processing and computer vision technology, and in particular to a method and system for calculating the confluence of stem cell images based on U-Net network. Background Technology
[0002] Consolidation rate calculation during stem cell culture is a key quantitative indicator for assessing cell growth status, determining passage timing, and screening for drug efficacy. Traditional methods mainly rely on manual microscopic observation or indirect measurement using cell counters, which suffers from high subjectivity, low efficiency, and difficulty in achieving in-situ, real-time monitoring. With the development of computer vision and deep learning technologies, image processing-based automated analysis methods have gradually become a research hotspot.
[0003] However, existing methods for stem cell image segmentation and confluence calculation still have the following shortcomings: First, traditional image processing techniques (such as thresholding, watershed algorithms, and active contour models) require manual parameter adjustment for different images, and are sensitive to changes in illumination, noise interference, and cell morphological heterogeneity. They are prone to oversegmentation or undersegmentation when cells are densely clustered or the background contains debris, leading to significant errors in confluence calculation. Second, while conventional U-Net and its variants introduce skip connections and attention mechanisms, their convolutional kernels often use square receptive fields, making it difficult to effectively capture the long-distance dependencies of the elongated spindle-shaped structures unique to stem cells. Simultaneously, existing attention modules often focus on single-dimensional recalibration of channels or space, lacking the ability to jointly model the anisotropic geometric features of cells and global topological coherence. Third, addressing the challenges of blurred cell boundaries and narrow gaps in the later stages of high-density growth, traditional decoders are prone to smoothing effects during upsampling, causing adjacent individuals within cell clusters to be unable to be accurately separated. Ultimately, overlapping areas in the mask are merged into a single connected region, directly leading to an underestimation of confluence.
[0004] How to solve the above-mentioned technical problems is the challenge facing this invention. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention provides a U-Net-based method and system for calculating stem cell confluence in images, which significantly improves the accuracy and robustness of stem cell confluence calculation.
[0006] The technical solution adopted by this invention to solve its technical problem is as follows: This invention provides a method for calculating the confluence of stem cell images based on U-Net networks, comprising the following steps: S1. Acquire stem cell images and preprocess them to obtain preprocessed stem cell images; S2. Construct a stem cell image segmentation network and input the preprocessed stem cell image into the stem cell image segmentation network; the stem cell image segmentation network adopts an improved U-Net network, which includes several layers of symmetrical encoders and decoders; the encoder embeds a strip channel spatial attention structure, and the decoder embeds a strip ternary attention structure; S3. The preprocessed stem cell image is downsampled and initial features are extracted by the encoder, and anisotropic features are extracted by the strip channel spatial attention structure to obtain an enhanced encoded feature map. S4. The enhanced encoded feature map is upsampled by the decoder and weighted by the striped ternary attention structure to obtain the enhanced decoded feature map. S5. Based on the enhanced decoding feature map, generate a binary confluence mask image, and calculate the pixel ratio of the stem cell region according to the binary confluence mask image to obtain the confluence of the stem cell image.
[0007] Preferably, the preprocessing in step S1 includes: region cropping, grayscale conversion and normalization, noise reduction and size standardization.
[0008] Preferably, the convolutional blocks in the encoder of the stem cell image segmentation network are replaced with residual blocks, wherein the residual blocks include batch normalization layers, activation functions, convolutional layers, and identity mapping connections; The stem cell image segmentation network employs a composite loss function, which includes weighted binary cross-entropy loss and weighted cross-union ratio loss.
[0009] Preferably, the strip-shaped channel spatial attention structure includes a first encoding path, a second encoding path, and a third encoding path, wherein the third encoding path includes a first sub-branch, a second sub-branch, and a third sub-branch; the first encoding path, the second encoding path, and the third encoding path are parallel path structures; The extraction of anisotropic features through the strip-channel spatial attention structure in step S3 includes: In the first encoding path, the feature map is strip pooled along the horizontal and vertical directions respectively, and then fused after one-dimensional convolution and upsampling. Then, the encoding spatial attention weights are generated by convolution and Sigmoid activation. The encoding spatial attention weights are multiplied element-wise with the preprocessed stem cell image to output the first encoding enhancement feature. In the second encoding path, the feature map is directly output as an identity mapping and used as the second encoding enhancement feature; In the third encoding path, a spatial weight map is generated through the first sub-branch, the spatial features of the second sub-branch are extracted through the second sub-branch, and the features are extracted through the third sub-branch. Then, the channel is recalibrated through a cascaded double SE structure to obtain the output of the third sub-branch. The spatial weight map is then multiplied element-wise with the spatial features of the second sub-branch, and then matrix multiplied with the output of the third sub-branch to output the third encoding enhancement feature. The first, second, and third coding enhancement features are added and fused element by element to obtain the enhanced coding feature map; The cascaded dual-SE structure includes a first part and a second part. The first part is used to complete channel recalibration guided by spatial topology through strip pooling combined with global average pooling, and the second part is used to complete channel recalibration guided by significant detail through global max pooling.
[0010] Preferably, the striped ternary attention structure includes a first decoding path, a second decoding path, and a third decoding path, wherein the first decoding path, the second decoding path, and the third decoding path are parallel path structures; The weighted processing using the strip ternary attention structure in step S4 includes: In the first decoding path, the feature map is rotated 90° counterclockwise along the image height axis. Global max pooling and global average pooling are performed on the rotated feature map and concatenated along the channel. The first weight is generated by convolution and Sigmoid activation. The first weight is multiplied element-wise with the rotated feature map and then rotated 90° clockwise along the height axis to output the first weighted feature. In the second decoding path, the feature map is rotated 90° counterclockwise along the image width axis. Global max pooling and global average pooling are performed on the rotated feature map and concatenated along the channel. The second weight is generated by convolution and Sigmoid activation. The second weight is multiplied element-wise with the rotated feature map and then rotated 90° clockwise along the width axis to output the second weighted feature. In the third decoding path, bar pooling and max pooling are performed in parallel on the feature map. The pooling results are concatenated and then processed by convolution and depthwise separable bar convolution to output the third weighted feature. After the first, second, and third weighted features are fused element-wise, the decoding space attention weights are generated by the Sigmoid activation function. The decoding space attention weights are then multiplied element-wise with the decoding feature map to obtain the enhanced decoding feature map.
[0011] Preferably, in step S4, the decoder further includes a cross-layer splicing mechanism, which includes: The output of the i-th decoder level is concatenated with the output of the (i+1)-th decoder level along the channel, and then fused by convolutional dimensionality reduction to obtain the fused result. The fused result is used as the input of the i-th decoder level.
[0012] Preferably, the generation of the binary convergence mask image in step S5 includes: The enhanced decoded feature map is mapped to a probability map through an activation function. The probability value of each pixel in the probability map is compared with a preset threshold. When the probability value is greater than or equal to the preset threshold, it is determined to be a stem cell region and assigned a first pixel value. Otherwise, it is determined to be a background region and assigned a second pixel value, thus obtaining a binary convergence mask image.
[0013] This invention also provides a stem cell image confluence calculation system based on a U-Net network, comprising: The image acquisition and preprocessing module is used to acquire stem cell images and preprocess the stem cell images to obtain preprocessed stem cell images. The network management module is used to construct a stem cell image segmentation network and input the preprocessed stem cell images into the stem cell image segmentation network. The encoding execution module is used to downsample the preprocessed stem cell image, extract initial features, and extract anisotropic features through a strip channel spatial attention structure to obtain an enhanced encoded feature map. The decoding execution module is used to upsample the enhanced encoded feature map through the decoder and perform weighted processing through a striped ternary attention structure to obtain the enhanced decoded feature map. The confluence calculation module is used to generate a binary confluence mask image based on the enhanced decoded feature map, and calculate the pixel ratio of the stem cell region based on the binary confluence mask image to obtain the confluence of the stem cell image.
[0014] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the above-described method for calculating the confluence of stem cell images based on the U-Net network.
[0015] The present invention also provides a computer storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the above-described method for calculating the confluence of stem cell images based on a U-Net network.
[0016] The beneficial effects of this invention are as follows: it significantly improves the accuracy and robustness of stem cell confluence calculation. Employing an end-to-end deep learning detection architecture, combined with standardized image preprocessing throughout the process, it eliminates the need for manual parameter adjustments and effectively adapts to stem cell microscopic images under different lighting and noise levels, greatly enhancing the method's generalization ability and anti-interference capability. A strip-shaped channel spatial attention structure is embedded in the encoder, using strip pooling and depthwise separable strip convolution to accurately adapt to the anisotropic morphology of stem cells. Simultaneously, combined with a third encoding path and cascaded dual SE modules, it achieves deep fusion of long-distance spatial dependence and channel feature recalibration, significantly improving the network's ability to capture stem cell morphological features and segmentation accuracy. A strip-shaped ternary attention structure is embedded in each level of the decoder, establishing cross-dimensional long-range dependence through a multi-branch interactive architecture, directionally strengthening the stem cell boundary feature response, effectively suppressing the boundary smoothing effect during upsampling, and accurately distinguishing the narrow gaps between adherent cells. Furthermore, combined with a cross-layer stitching mechanism to compensate for details lost during upsampling, it further improves the clarity and regional consistency of segmentation boundaries, ensuring high accuracy in confluence calculation. Attached Figure Description
[0017] Figure 1 This is a diagram illustrating the method steps of the present invention.
[0018] Figure 2 This is a system module diagram of the present invention.
[0019] Figure 3 This is a diagram of the stem cell image segmentation network architecture of the present invention.
[0020] Figure 4 This is a diagram of the encoder residual block structure of the stem cell image segmentation network of the present invention.
[0021] Figure 5 This is a diagram of the encoder strip channel spatial attention structure of the stem cell image segmentation network of the present invention.
[0022] Figure 6 This is a diagram of the encoder cascaded dual SE structure of the stem cell image segmentation network of the present invention.
[0023] Figure 7 This is a diagram of the decoder strip ternary attention structure of the stem cell image segmentation network of the present invention.
[0024] Figure 8This is a schematic diagram comparing the segmentation effect of stem cell growth images with other network models in Embodiment 3 of the present invention. (a) is a schematic diagram comparing the segmentation effect of stem cell growth images with U-Net model, U-Net++ model and SegNet model; (b) is a schematic diagram comparing the segmentation effect of stem cell growth images with Attention U-Net model, Fcn8s model and Fcn32s model; (c) is a schematic diagram comparing the segmentation effect of stem cell growth images with LinkNet model, R2U-Net model and CE-Net model.
[0025] Figure 9 This is a diagram of the internal structure of a computer device according to Embodiment 4 of the present invention. Detailed Implementation
[0026] To clearly illustrate the technical features of this solution, the following detailed implementation method will be used to explain the solution.
[0027] Example 1: See Figure 1 , 3 As shown in Figures 4, 5, and 6, this embodiment is a method for calculating the confluence of stem cell images based on a U-Net network, including the following steps: S1. Acquire stem cell images and preprocess them to obtain preprocessed stem cell images; The core function of this step is to complete the compliant acquisition and standardized processing of the input images for the segmentation network, providing a standardized, noise-controlled, and feature-complete input data source for the subsequent network model, and ensuring the stability and consistency of the subsequent segmentation accuracy and convergence calculation results from the data source end.
[0028] It should be noted that the stem cell images in this embodiment are bright-field color microscopic images acquired by an inverted biological microscope during the growth cycle of mesenchymal stem cells cultured in vitro. The images completely cover the effective adherent growth area of the stem cell culture container and can fully present the inherent spindle-shaped and elongated anisotropic morphological characteristics and real-time growth distribution of stem cells.
[0029] Furthermore, in this step, image acquisition adopts standardized imaging parameter settings. Specifically, a 10× flat achromatic objective lens can be used, the exposure time is fixed at 15ms, the image gain is locked at 1×, and the acquired raw images are stored in lossless TIFF format. The pixel resolution of a single raw image is not less than 1024×1024, so as to fully preserve key morphological details such as the edge texture and pseudopodia outline of stem cells.
[0030] It should also be noted that after the original images are acquired, an initial image screening operation is performed simultaneously to remove invalid images that are overexposed, severely out of focus, have large areas of culture medium contamination, or have a field of view shift. Only valid original images are retained for subsequent preprocessing, thus avoiding the systematic interference of invalid data with the convergence calculation results.
[0031] Furthermore, a standardized preprocessing procedure is performed on the selected viable primitive stem cell images, specifically including: 1. Image effective area cropping and format standardization: The core purpose of this operation is to remove irrelevant and invalid areas such as black borders, scales, and character annotations generated during the microscope imaging process, and retain only the valid image content containing the stem cell growth area; at the same time, all original images are uniformly converted into single-channel 8-bit grayscale images to eliminate image format differences caused by different imaging devices and different acquisition batches.
[0032] 2. Image grayscale conversion and normalization processing: The core objective of this operation is to convert the acquired bright-field microscopic color images into single-channel grayscale images and unify the grayscale distribution range of the images, eliminating grayscale value shifts caused by differences in illumination and fluctuations in exposure parameters between different acquisition batches. Specifically, the pixel grayscale values of the single-channel grayscale image are linearly mapped to a standardized range of 0-1, completing full-image grayscale normalization. This ensures consistent grayscale distribution across different batches of images, improving the subsequent network's ability to generalize and adapt to images from different sources.
[0033] 3. Image background noise suppression processing: The core objective of this operation is to filter out Gaussian noise, salt-and-pepper noise, and random background interference from suspended debris in the culture medium generated during the acquisition of microscopic images, while preserving the edge and morphological details of stem cells. Specifically, an adaptive median filter is used to smooth the normalized image, with a filter window size of 3×3. This suppresses background noise while avoiding excessive smoothing and blurring of the elongated edge features of the stem cells.
[0034] 4. Image size normalization processing: The core objective of this operation is to ensure that the dimensions of all input images perfectly match the specifications of the input layer of the subsequent segmentation network, thereby guaranteeing the normal execution of the network's forward propagation. Specifically, a bicubic interpolation algorithm is used to uniformly scale the filtered images to the 512×512 pixel resolution specified by the network's input layer, resulting in the final preprocessed stem cell image.
[0035] It should be noted that all preprocessing operations in this step adopt a standardized process without parameter offset, performing completely consistent processing steps on all input images without any manual intervention, thus ensuring the full automation and repeatability of the entire convergence calculation method.
[0036] S2. Construct a stem cell image segmentation network and input the preprocessed stem cell image into the stem cell image segmentation network; the stem cell image segmentation network adopts an improved U-Net network, which includes several layers of symmetrical encoders and decoders; the encoder embeds a strip channel spatial attention structure, and the decoder embeds a strip ternary attention structure; It should be noted that the stem cell image segmentation network in this embodiment is an SCSA-Net segmentation network based on the classic U-Net architecture. The network adopts a 6-layer symmetrical encoder-decoder architecture. The number of layers and feature map size of the encoder and decoder are completely corresponding. Cross-layer feature fusion is achieved between the two through skip connections at the same layer. The input of the network is the preprocessed stem cell image output in step S1. The input specification is a standardized grayscale image of 512×512 pixels in a single channel. The output of the network is a pixel-level segmentation probability feature map with the same size as the input image, which provides a basis for subsequent binary mask generation and convergence calculation.
[0037] The loss function of the stem cell image segmentation network is defined as follows:
[0038] In the formula, For weighted binary cross-entropy loss, For weighted average loss;
[0039] In the formula, Represents the probability predicted by the model. A true label mask. Represents the standard binary cross-entropy function. The coordinates of the pixel.
[0040] Edge enhancement weights The calculation method is as follows:
[0041] in This represents the average pooling operation.
[0042] Furthermore, the encoder, serving as the network's feature extraction entity, employs a hierarchical structure with progressive downsampling. Its core function is to extract multi-scale features from the input stem cell image, ranging from shallow details to deep semantics, generating multi-level encoded feature maps. It should also be noted that this network replaces all the basic double convolutional blocks of the encoder in the classic U-Net with residual block structures. Specifically, the core components of the residual block include two batch normalization layers, two ReLU activation functions, two convolutional layers, and one identity mapping connection. This residual structure, on the one hand, alleviates the gradient vanishing problem that may arise from increased network depth through shortcut connections, and on the other hand, promotes direct gradient propagation and cross-layer feature reuse, thereby enhancing the model's feature extraction capabilities in complex stem cell image segmentation tasks.
[0043] It should be noted that a strip channel spatial attention module (SCSA structure) is embedded in each level of the encoder. This module is used to anisotropically enhance the high semantic feature map output by the deep layers of the encoder to accurately adapt to the spindle-shaped and elongated geometric features of stem cells.
[0044] Furthermore, the SCSA structure employs three parallel feature extraction paths, denoted as path a, path b, and path c. The outputs of these three paths are fused to obtain a spatially-channel jointly enhanced feature map. Specifically: Path a: This path implements a strip spatial attention mechanism through strip pooling and depthwise separable strip convolution. Its core operations include: first, performing strip pooling on the input encoded feature map along the horizontal and vertical directions to obtain intermediate features that capture long-distance contextual dependencies; then, restoring the feature map size through one-dimensional convolution and upsampling to obtain feature maps in both directions.
[0045] The feature maps from the two directions are then stitched together along the channel dimension to obtain a channel-dimensional stitched feature map.
[0046] Finally, the channel-dimensional concatenated feature map is sequentially processed through 1×1 convolution, two depthwise separable strip convolutions, 1×1 convolution, and the Sigmoid activation function to generate a path a spatial weight map. The path a spatial weight map is then multiplied element-wise with the input encoded feature map to obtain the path a feature map.
[0047] It should be noted that this approach can effectively reduce interference from irrelevant areas such as background debris and artifacts, accurately capture the spindle-shaped morphology of stem cells, and maintain their topological coherence.
[0048] Path b: This path is designed as an identity mapping branch, which does not perform any nonlinear transformation on the input encoded feature map and directly passes the original features to the fusion node. It should be noted that this design draws on the idea of residual learning to compensate for the effective information that may be lost in complex operations in paths a and c, and to alleviate the gradient vanishing problem in deep networks.
[0049] Path c: This path is an improved self-attention branch used to establish global long-distance dependencies between rows and columns in the feature map. In path c, the input encoded feature map passes through three sub-branches: q, k, and v. In the q-branch, the encoded feature map of the input is used to generate the q-branch spatial weight map through two consecutive 1×1 convolutions, a BN layer, and a Sigmoid function.
[0050] In the k-branch, the input encoded feature map is processed by two 3×3 convolutions and a BN layer to extract feature information and obtain the k-branch feature map.
[0051] In the v-branch, the input encoded feature map is processed by two 3×3 convolutions and BN layers to extract features, and then input into a cascaded dual SE structure (SDSE module) for channel recalibration to obtain the v-branch feature map.
[0052] Furthermore, the v branch embeds a cascaded dual SE structure (SDSE module). The cascaded dual SE structure consists of two parts, A and B, connected in series, used for progressive recalibration of the feature channels. In part A, the input feature map (the feature map obtained by extracting features from two 3×3 convolutions and BN layers in the v branch) first undergoes bar pooling and global average pooling in sequence to generate the weight vector of part A.
[0053] It should be noted that this strip pooling operation constructs long-distance context modeling in both horizontal and vertical directions, effectively overcoming the defect of traditional global average pooling in easily losing spatial morphological information when processing slender targets.
[0054] Subsequently, the channel weight vector is normalized by a fully connected layer and a Sigmoid function to obtain the normalized A-part weight vector. This normalized weight vector is then multiplied element-wise with the input feature map (the feature map obtained by extracting features through two 3×3 convolutions and BN layers in the v branch) to obtain the A-part feature map.
[0055] In part B, global max pooling is performed on the feature map of part A to extract the most salient information (such as cell edges, textures, etc.) in the feature map, resulting in the weight vector of part B. Then, the weight vector of part B is normalized by passing it through a 1×1 convolutional layer (replacing the traditional fully connected layer) and the Sigmoid function to obtain the normalized weight vector of part B. Then, it is multiplied element-wise with the feature map of part A to obtain the final feature map of the SDSE module (i.e., the v-branch feature map).
[0056] It should be noted that, through the spatial topology restoration in Part A and the significant detail enhancement in Part B, the SDSE module achieves progressive channel attention calibration, enabling the network to adaptively highlight the most discriminative feature channels, thereby achieving more accurate segmentation performance in high-density overlapping cell scenes.
[0057] Furthermore, after extracting the three branches q, k, and v, the spatial weight map of branch q is multiplied element-wise with the feature map of branch k, and then matrix multiplied with the feature map of branch v to achieve deep fusion of spatial attention and channel attention, and output the feature map of path c.
[0058] It should also be noted that the outputs of the above three paths are fused by element-wise addition to obtain an enhanced coding feature map. This feature map not only integrates spatial long-distance dependence and channel recalibration capabilities, but also retains the original effective information by means of identity mapping, providing feature inputs with rich semantic and morphological information for subsequent decoders.
[0059] Furthermore, the decoder, as the main body of feature reconstruction in the network, adopts a hierarchical structure of progressive upsampling, symmetrical to the encoder's hierarchy. Its core function is to restore the spatial resolution of the deep semantic features output by the encoder. Between the encoder and decoder, the decoder receives shallow detail features from the encoder at the same level through skip connections, achieving cross-level fusion of deep semantic information and shallow detail information. Between different levels of the decoder, the output feature maps of two adjacent layers (such as level i and level i+1) are concatenated along the channel dimension, which can effectively compensate for the spatial details and semantic information lost during upsampling, significantly improving the clarity of segmentation boundaries and region consistency.
[0060] It should be noted that the decoder also employs deep supervision techniques. Specifically, auxiliary classifiers are introduced at different levels of the decoding path (e.g., levels 4, 5, and 6). This involves connecting a 1×1 convolutional layer and an upsampling layer after the output feature map of the corresponding level to generate an auxiliary segmentation prediction of the same size as the input image, which then participates in the loss function calculation. This technique allows gradients to be directly backpropagated to shallower networks, alleviating the gradient vanishing problem, while simultaneously forcing the network to learn features at multiple scales, improving the detail completeness and semantic consistency of the final segmentation result.
[0061] It should also be noted that each stage of the decoder contains an embedded striped ternary attention structure (STA structure). This module is used to enhance the boundary response of dense cell regions in the decoded feature map, suppress the smoothing effect that may be produced by traditional large-size convolution, thereby improving the clarity and accuracy of segmentation boundaries.
[0062] Furthermore, the STA structure adopts a three-branch interactive structure, specifically including path l, path m, and path n: Path l: First, the input feature map (the enhanced encoded feature map from the encoder) is rotated 90° counterclockwise along the image height axis (i.e., a height axis dimension permutation operation is performed), resulting in the rotated feature map. Then, global max pooling and global average pooling are performed on the rotated feature map, and the pooling results are concatenated along the channel dimension to obtain the concatenated feature map of path l. Next, the concatenated feature map of path l is processed through a 7×7 convolutional layer and a sigmoid function to generate the path l weights. Finally, the path l weights are multiplied element-wise with the rotated feature map, and then rotated 90° clockwise along the height axis to restore the original shape, resulting in the path l feature map. This path is used to establish long-range dependencies across dimensions along the height direction.
[0063] Path m: Using operations symmetrical to path l, rotation, pooling, and attention calculations are performed along the image width axis to establish cross-dimensional long-range dependencies along the width direction.
[0064] Path n: This path does not perform tensor rotation, but instead performs strip pooling and max pooling on the input feature map (the enhanced encoded feature map from the encoder), and concatenates the results along the channel to obtain the concatenated feature map of path n; then, the concatenated feature map of path n is reconstructed by a 1×1 convolution to obtain the reconstructed feature map; then, the reconstructed feature map is sequentially passed through two depthwise separable strip convolutions (1×3 and 3×1 respectively) and a 1×1 convolution to extract supplementary spatial context features, thus obtaining the feature map of path n.
[0065] It should also be noted that the outputs of the three paths mentioned above are fused by element-wise addition, and then a spatial attention weight map with values ranging from 0 to 1 is generated by the Sigmoid activation function. Finally, this weight map is multiplied element-wise with the original input feature map to obtain the enhanced decoding feature map of the STA structure output.
[0066] Through the above three-branch design, the STA structure can jointly establish the depth dependency relationship between channels and space without avoiding dimensional reduction, accurately capture and enhance the boundary information of long strip or directional target regions, and effectively solve the problems of blurred intercellular spaces and oversegmentation in high-density stem cell clusters.
[0067] It should also be noted that the parameter settings (such as kernel size, stride, padding method, etc.) of all convolutional layers, batch normalization layers, pooling layers and activation functions in the above network are all determined by those skilled in the art based on conventional deep learning practices, and will not be elaborated here.
[0068] In step S2, a complete stem cell image segmentation network is constructed, and the preprocessed stem cell images are input into the network, providing a model basis and input data for feature extraction, attention enhancement, and convergence calculation in subsequent steps S3 to S5.
[0069] S3. The preprocessed stem cell image is downsampled and initial features are extracted by the encoder, and anisotropic features are extracted by the strip channel spatial attention structure to obtain an enhanced encoded feature map. In this step, the preprocessed stem cell image from step S1 is downsampled and its features extracted step by step by an encoder to generate multi-level encoded feature maps. The encoded feature maps output by each level of the encoder are then input into the strip channel spatial attention module (i.e., SCSA structure) embedded in that level. This module enhances the anisotropic features according to the structure described in step S2 and outputs an enhanced encoded feature map.
[0070] It should be noted that the encoder adopts the 6-layer symmetrical structure described in step S2, with each layer containing a residual block and a downsampling layer. The first stage of the encoder receives the preprocessed stem cell image output from step S1, which has a size of 512×512 pixels and 1 channel.
[0071] Furthermore, in each encoding level, residual blocks first extract features from the input feature map. The structure of the residual block has been described in detail in step S2 and will not be repeated here. The residual block performs nonlinear transformation and refinement on the features through its internal convolution, batch normalization, and activation functions. At the same time, its identity mapping connection ensures the direct transmission of input features, effectively guaranteeing the integrity of gradient flow and feature information.
[0072] It should also be noted that after feature extraction is completed in the residual block, the encoder halves the spatial resolution of the feature map through a 2×2 max pooling operation with a stride of 2, while doubling the number of channels through a convolution operation. After six successive downsampling passes, the encoder generates six levels of encoded feature maps with spatial resolutions as follows: Level 1: 256×256 pixels, Level 2: 128×128 pixels, Level 3: 64×64 pixels, Level 4: 32×32 pixels, Level 5: 16×16 pixels, and Level 6: 8×8 pixels. Among them, the shallow feature maps (Levels 1 and 2) mainly retain spatial details such as the edges and textures of stem cells; the middle feature maps (Levels 3 and 4) begin to show local shape and structural features; and the deep feature maps (Levels 5 and 6) are rich in global semantic information and can characterize the overall distribution and category attributes of stem cell regions.
[0073] Furthermore, the feature maps output by each level of the encoder are saved through skip connections for subsequent fusion by the corresponding levels of the decoder.
[0074] It should be noted that the feature map output from the deepest level (level 6) of the encoder has a size of 8×8 pixels and 512 channels. This feature map has the largest receptive field and the richest semantic information. This feature map is then input into the SCSA structure deployed in each layer of the encoder.
[0075] Furthermore, after receiving the SCSA structure, anisotropic feature enhancement is performed according to the three parallel paths described in step S2 (the strip-shaped spatial attention path of path a, the identity mapping path of path b, and the spatial-channel joint self-attention path of path c) and their respective specific operations, resulting in an enhanced encoded feature map. This module's processing aims to accurately capture the long-range dependencies of the stem cell's spindle-shaped morphology, suppress background debris, bubbles, and other noise interference, and simultaneously enhance key features in both the spatial and channel dimensions.
[0076] In step S3, the encoder completes the extraction of deep semantic features from the original stem cell image and uses the SCSA structure to achieve targeted enhancement for the anisotropic morphology of stem cells, providing high-quality feature input for the decoder's subsequent accurate boundary restoration and segmentation mask generation.
[0077] S4. The enhanced encoded feature map is upsampled by the decoder and weighted by the striped ternary attention structure to obtain the enhanced decoded feature map. In this step, the enhanced encoded feature map output from step S3 is input to the decoder. The decoder recovers the spatial resolution of the feature map by upsampling step by step, and fuses the feature maps of the same level of the encoder by skip connections at each decoding level. At the same time, the decoded feature map is weighted by the strip triple attention module (i.e., STA structure) embedded in each decoding level, and the enhanced decoded feature map is output.
[0078] It should be noted that the decoder adopts a 6-layer structure symmetrical to the encoder, with each layer containing an upsampling layer and an STA structure. In each decoding layer, the spatial resolution of the feature map is first doubled through an upsampling layer. In this embodiment, transposed convolution is used as the upsampling operation, with a kernel size of 2×2 and a stride of 2. After upsampling, the size of the feature map is restored to the resolution of the corresponding encoder feature map at that layer.
[0079] It should be noted that an STA structure is embedded in each decoding layer. This module adaptively weights the decoding feature map of the current layer according to the three-branch interaction structure described in step S2 (path l rotates along the height axis, path m rotates along the width axis, and path n uses a combination of strip pooling and max pooling), resulting in an enhanced decoding feature map. The core function of the STA structure is to strengthen the boundary response of dense cell regions and suppress the smoothing effect that may be produced by traditional convolution operations, thereby improving the clarity and accuracy of segmentation boundaries.
[0080] It should also be noted that the decoder includes a cross-layer concatenation mechanism. Specifically, the output feature maps of two adjacent layers (e.g., level i and level i+1) in the decoder are concatenated along the channel dimension, and then a 1×1 convolutional layer is used for channel dimensionality reduction and feature fusion to form a vertical feature transfer path. This structure can effectively compensate for the spatial details and semantic information lost during upsampling, significantly improving the clarity of segmentation boundaries and regional consistency.
[0081] In step S4, the decoder completes the reconstruction from deep semantic features to full-resolution features, and achieves precise focusing on dense cell boundaries with the help of the STA structure, laying the feature foundation for generating a high-precision stem cell segmentation mask.
[0082] S5. Based on the enhanced decoding feature map, generate a binary confluence mask image, and calculate the pixel ratio of the stem cell region according to the binary confluence mask image to obtain the confluence of the stem cell image.
[0083] The input for this step is the boundary enhancement decoding feature map that is the final output of step S4 and perfectly matches the size of the input stem cell image. The overall execution process is divided into two core stages: the first stage is the binary confluence mask image generation stage, which transforms the high-dimensional decoding feature map into a binary image that accurately distinguishes the stem cell foreground from the culture medium background; the second stage is the confluence quantification calculation stage, which completes pixel-level statistics and proportion calculation based on the binary mask and outputs the final standardized confluence result.
[0084] Furthermore, the generation of the binary convergence mask image is performed first, and the specific execution process is as follows: The first step is to perform pixel-level probability mapping processing on the boundary enhancement decoding feature map output in step S4, normalizing the feature value of each pixel in the feature map to a probability range of 0-1, and generating a single-channel stem cell foreground probability map.
[0085] It should be noted that the Sigmoid activation function is used to complete the probability mapping in this embodiment. For each pixel in the probability map, the closer the output value is to 1, the higher the confidence that the pixel belongs to the stem cell region; the closer the output value is to 0, the higher the confidence that the pixel belongs to the culture medium background.
[0086] The second step is to set a standardized segmentation threshold, perform binarization judgment on the foreground probability map, and generate the final binary convergence mask image.
[0087] Furthermore, in this embodiment, the segmentation threshold is fixed at 0.5, and point-by-point judgment is performed on each pixel in the foreground probability map: if the foreground probability value corresponding to the pixel is greater than or equal to 0.5, then the pixel is judged as the stem cell foreground region and assigned a value of 1; if the foreground probability value corresponding to the pixel is less than 0.5, then the pixel is judged as the non-cellular background region and assigned a value of 0; finally, a binary confluence mask image with the same size as the input stem cell image and containing only two gray values, 0 and 1, is generated.
[0088] It should also be noted that in the binary confluence mask image, the region with a pixel value of 1 completely corresponds to the stem cell adherent growth region in the original image, while the region with a pixel value of 0 corresponds to non-cellular invalid regions such as culture medium background, suspended debris, and imaging artifacts. This achieves pixel-level precise differentiation between the stem cell region and the background region, providing an unbiased statistical basis for subsequent quantitative calculation of confluence.
[0089] Furthermore, based on the generated binary confluence mask image, the quantitative calculation of stem cell confluence is performed. The specific execution process is as follows: The first step is to perform full-image pixel statistics on the binary convergence mask image to obtain the total number of pixels in the entire image and the number of effective pixels identified as stem cell foreground regions.
[0090] It should be noted that the total number of pixels in the image is the product of the height and width pixel values of the mask image, which is exactly the same as the total number of pixels in the preprocessed stem cell image in step S1; the effective number of pixels in the stem cell foreground region is the cumulative statistical value of all pixels in the mask image that are assigned a value of 1.
[0091] The second step is to calculate the pixel area ratio of the stem cell region based on the statistically obtained number of two types of pixels, and obtain the standardized confluence value corresponding to the stem cell image.
[0092] Furthermore, the formula for quantifying the confluence is: Confluence = Number of effective pixels in the stem cell foreground region / Total number of pixels in the image × 100%.
[0093] It should be noted that the final output confluence is presented as a percentage, ranging from 0% to 100%. A higher value indicates a larger adherent area and higher growth density of stem cells on the bottom of the culture container. Through step S5, the SCSA-Net segmentation network completes the entire process from raw image input to final quantitative output. This method transforms the complex stem cell image segmentation problem into end-to-end automated computation, and the final output confluence value is accurate and objective, which can be directly used for cell culture status assessment, experimental comparison, or process monitoring.
[0094] Example 2: See Figure 2 As shown, this embodiment is a stem cell image confluence calculation system based on a U-Net network, including: The image acquisition and preprocessing module is used to acquire stem cell images and preprocess the stem cell images to obtain preprocessed stem cell images. The network management module is used to construct a stem cell image segmentation network and input the preprocessed stem cell images into the stem cell image segmentation network. The encoding execution module is used to downsample the preprocessed stem cell image, extract initial features, and extract anisotropic features through a strip channel spatial attention structure to obtain an enhanced encoded feature map. The decoding execution module is used to upsample the enhanced encoded feature map through the decoder and perform weighted processing through a striped ternary attention structure to obtain the enhanced decoded feature map. The confluence calculation module is used to generate a binary confluence mask image based on the enhanced decoded feature map, and calculate the pixel ratio of the stem cell region based on the binary confluence mask image to obtain the confluence of the stem cell image.
[0095] Example 3: See Figure 8 As shown in (a)-(c), to verify the effectiveness of the stem cell image confluence calculation method based on U-Net network proposed in this invention, this embodiment compares the segmentation effect of the designed SCSA-Net method with U-Net model, U-Net++ model, Attention U-Net model, SegNet model, Fcn32s model, Fcn8s model, CE-Net model, R2U-Net model and LinkNet model on labeled images (the labeled content of the images is the stem cells to be identified).
[0096] Table 1. Comparison of segmentation results of this method with other network models for stem cell growth images.
[0097] As shown in Table 1 above, the SCSA-Net model achieved the best performance in all three metrics: DSC, Precision, and Accuracy. In particular, its Precision reached 84.12%, outperforming the U-Net, U-Net++, AttentionU-Net, SegNet, Fcn32s, Fcn8s, CE-Net, R2U-Net, and LinkNet models by 5.29%, 5.21%, 6.46%, 7.47%, 14.93%, 25.58%, 8.78%, 0.48%, and 12.04%, respectively. This demonstrates that the SCSA-Net model can effectively suppress background noise interference, thus effectively identifying real stem cell regions even in situations with dense stem cell distribution and complex backgrounds.
[0098] The core technical principle of this invention lies in the construction of an anisotropic feature enhancement mechanism targeting the spindle-shaped morphology and high-density distribution characteristics of stem cells through the deep coupling and synergistic driving of the SCSA module, SDSE module and STA module.
[0099] Geometric perception and progressive calibration of the SCSA and SDSE modules: Utilizing the unique three-path parallel topology of the SCSA module, the network first breaks through the limitations of the conventional square receptive field by using striped modeling. This allows for precise geometric adaptation to the elongated anisotropic features of stem cells, establishing global long-distance dependencies. Building upon this, the SDSE module, nested within it, plays a role in feature purification. Through the cascaded design of parts A and B, progressive channel recalibration is achieved: strip pooling restores the spatial topology, and global max pooling enhances significant texture and boundary information, thereby dynamically suppressing artifact noise in complex culture medium backgrounds and achieving preliminary refinement of target features.
[0100] Cross-dimensional interaction and boundary semantic focusing of the STA module: After feature purification by the encoder, the STA module on the decoder side further enhances its ability to focus on dense cell boundaries through its unique three-branch cross-dimensional interaction architecture. The STA module utilizes a rotation tensor mechanism to jointly establish channel-space dependencies without losing dimensionality, and introduces hybrid compressed pooling and depthwise separable strip convolutions, effectively suppressing the smoothing effect produced by traditional large-size convolutions. This design can accurately identify narrow gaps between adjacent stem cells in high-density late-stage growth images, effectively solving the technical challenge of traditional convolutions' difficulty in separating boundaries from dense cells due to limited receptive fields.
[0101] This collaborative mechanism, where SCSA / SDSE handles "geometric morphology adaptation and progressive channel purification," and STA handles "cross-dimensional boundary focusing and spatial detail restoration," combined with a 6-layer deepened U-Net architecture, residual learning structure to protect feature propagation, and deep supervision and decoder cross-layer fusion techniques to guide multi-scale gradients, significantly enhances the network's ability to express stem cell morphology. Ultimately, this invention achieves pixel-level automated segmentation and convergence quantification analysis with high boundary clarity, structural consistency, and regional accuracy even in complex background interference and highly densely distributed scenes.
[0102] Example 4: This embodiment provides a computer device, including a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the steps in the above-described method embodiments.
[0103] This computer device can be a server, and its internal structure diagram can be as follows: Figure 9 As shown, the computer device includes a processor, memory, and a network interface connected via a system bus. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores server data. The network interface communicates with external terminals via a network connection. When executed by the processor, the computer program implements a stem cell image confluence calculation method based on a U-Net network.
[0104] Those skilled in the art will understand that Figure 9 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0105] Example 5: This embodiment provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps in the above-described method embodiments.
[0106] If the functions implemented by the method are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art or the current technical solution, can be embodied in the form of a software product. This current computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0107] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-including system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device.
[0108] More specific examples of computer-readable media (a non-exhaustive list) include: electrical connections (electronic devices) having one or more wires, portable computer disk drives (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which the program can be printed, because the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.
[0109] The technical features of this invention not described can be implemented by or using existing technology, and will not be repeated here. Of course, the above description is not a limitation of this invention, and this invention is not limited to the examples above. Any changes, modifications, additions or substitutions made by those skilled in the art within the scope of this invention should also be within the protection scope of this invention.
Claims
1. A method for calculating the confluence of stem cell images based on U-Net networks, characterized in that, Includes the following steps: S1. Acquire stem cell images and preprocess them to obtain preprocessed stem cell images; S2. Construct a stem cell image segmentation network and input the preprocessed stem cell image into the stem cell image segmentation network; the stem cell image segmentation network adopts an improved U-Net network, which includes several layers of symmetrical encoders and decoders; the encoder embeds a strip channel spatial attention structure, and the decoder embeds a strip ternary attention structure; S3. The preprocessed stem cell image is downsampled and initial features are extracted by the encoder, and anisotropic features are extracted by the strip channel spatial attention structure to obtain an enhanced encoded feature map. S4. The enhanced encoded feature map is upsampled by the decoder and weighted by the striped ternary attention structure to obtain the enhanced decoded feature map. S5. Based on the enhanced decoding feature map, generate a binary confluence mask image, and calculate the pixel ratio of the stem cell region according to the binary confluence mask image to obtain the confluence of the stem cell image.
2. The method for calculating the confluence of stem cell images based on U-Net network according to claim 1, characterized in that, The preprocessing in step S1 includes: region cropping, grayscale conversion and normalization, noise reduction and size standardization.
3. The method for calculating the confluence of stem cell images based on U-Net network according to claim 2, characterized in that, The convolutional blocks in the encoder of the stem cell image segmentation network are replaced with residual blocks, which include batch normalization layers, activation functions, convolutional layers, and identity mapping connections. The stem cell image segmentation network employs a composite loss function, which includes weighted binary cross-entropy loss and weighted cross-union ratio loss.
4. The method for calculating the confluence of stem cell images based on U-Net network according to claim 3, characterized in that, The strip-shaped channel spatial attention structure includes a first encoding path, a second encoding path, and a third encoding path. The third encoding path includes a first sub-branch, a second sub-branch, and a third sub-branch. The first encoding path, the second encoding path, and the third encoding path are parallel path structures. The extraction of anisotropic features through the strip-channel spatial attention structure in step S3 includes: In the first encoding path, the feature map is strip pooled along the horizontal and vertical directions respectively, and then fused after one-dimensional convolution and upsampling. Then, the encoding spatial attention weights are generated by convolution and Sigmoid activation. The encoding spatial attention weights are multiplied element-wise with the preprocessed stem cell image to output the first encoding enhancement feature. In the second encoding path, the feature map is directly output as an identity mapping and used as the second encoding enhancement feature; In the third encoding path, a spatial weight map is generated through the first sub-branch, the spatial features of the second sub-branch are extracted through the second sub-branch, and the features are extracted through the third sub-branch. Then, the channel is recalibrated through a cascaded double SE structure to obtain the output of the third sub-branch. The spatial weight map is then multiplied element-wise with the spatial features of the second sub-branch, and then matrix multiplied with the output of the third sub-branch to output the third encoding enhancement feature. The first, second, and third coding enhancement features are added and fused element by element to obtain the enhanced coding feature map; The cascaded dual-SE structure includes a first part and a second part. The first part is used to complete channel recalibration guided by spatial topology through strip pooling combined with global average pooling, and the second part is used to complete channel recalibration guided by significant detail through global max pooling.
5. The method for calculating the confluence of stem cell images based on U-Net network according to claim 4, characterized in that, The striped ternary attention structure includes a first decoding path, a second decoding path, and a third decoding path, wherein the first decoding path, the second decoding path, and the third decoding path are parallel path structures; The weighted processing using the strip ternary attention structure in step S4 includes: In the first decoding path, the feature map is rotated 90° counterclockwise along the image height axis. Global max pooling and global average pooling are performed on the rotated feature map and concatenated along the channel. The first weight is generated by convolution and Sigmoid activation. The first weight is multiplied element-wise with the rotated feature map and then rotated 90° clockwise along the height axis to output the first weighted feature. In the second decoding path, the feature map is rotated 90° counterclockwise along the image width axis. Global max pooling and global average pooling are performed on the rotated feature map and concatenated along the channel. The second weight is generated by convolution and Sigmoid activation. The second weight is multiplied element-wise with the rotated feature map and then rotated 90° clockwise along the width axis to output the second weighted feature. In the third decoding path, bar pooling and max pooling are performed in parallel on the feature map. The pooling results are concatenated and then processed by convolution and depthwise separable bar convolution to output the third weighted feature. After the first, second, and third weighted features are fused element-wise, the decoding space attention weights are generated by the Sigmoid activation function. The decoding space attention weights are then multiplied element-wise with the decoding feature map to obtain the enhanced decoding feature map.
6. The method for calculating the confluence of stem cell images based on U-Net network according to claim 5, characterized in that, In step S4, the decoder also includes a cross-layer splicing mechanism, which comprises: The output of the i-th decoder level is concatenated with the output of the (i+1)-th decoder level along the channel, and then fused by convolutional dimensionality reduction to obtain the fused result. The fused result is used as the input of the i-th decoder level.
7. The method for calculating the confluence of stem cell images based on U-Net network according to claim 6, characterized in that, The generation of the binary convergence mask image in step S5 includes: The enhanced decoded feature map is mapped to a probability map through an activation function. The probability value of each pixel in the probability map is compared with a preset threshold. When the probability value is greater than or equal to the preset threshold, it is determined to be a stem cell region and assigned a first pixel value. Otherwise, it is determined to be a background region and assigned a second pixel value, thus obtaining a binary convergence mask image.
8. A stem cell image confluence calculation system based on U-Net network, characterized in that, The steps for performing the stem cell image confluence calculation method based on the U-Net network according to any one of claims 1 to 7 include: The image acquisition and preprocessing module is used to acquire stem cell images and preprocess the stem cell images to obtain preprocessed stem cell images. The network management module is used to construct a stem cell image segmentation network and input the preprocessed stem cell images into the stem cell image segmentation network. The encoding execution module is used to downsample the preprocessed stem cell image, extract initial features, and extract anisotropic features through a strip channel spatial attention structure to obtain an enhanced encoded feature map. The decoding execution module is used to upsample the enhanced encoded feature map through the decoder and perform weighted processing through a striped ternary attention structure to obtain the enhanced decoded feature map. The confluence calculation module is used to generate a binary confluence mask image based on the enhanced decoded feature map, and calculate the pixel ratio of the stem cell region based on the binary confluence mask image to obtain the confluence of the stem cell image.
9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the stem cell image confluence calculation method based on U-Net network as described in any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the stem cell image confluence calculation method based on U-Net network as described in any one of claims 1 to 7.