An unsupervised cell segmentation method based on Transformer

By employing an unsupervised cell segmentation method based on Transformer, combined with multimodal alignment and low-rank attention mechanisms, the problem of dependence on labeled data in traditional methods is solved, achieving efficient segmentation in complex cellular environments, especially for effective segmentation of untrained cell types.

CN120047460BActive Publication Date: 2026-06-30HANGZHOU DIANZI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU DIANZI UNIV
Filing Date
2024-11-21
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Traditional supervised learning methods perform poorly in segmenting rare or highly heterogeneous cell types, mainly due to their reliance on large amounts of labeled data, which leads to insufficient segmentation capabilities under unsupervised conditions.

Method used

We employ an unsupervised cell segmentation method based on Transformer, combined with the multimodal alignment Owl-vit model and low-rank attention mechanism. We achieve zero-lens feature detection and adaptive segmentation through the MMX module, and perform efficient segmentation by combining it with the SAM segmentation module.

Benefits of technology

It significantly improves segmentation performance in complex cellular environments under unsupervised conditions, especially when dealing with cell types that have not been directly trained.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120047460B_ABST
    Figure CN120047460B_ABST
Patent Text Reader

Abstract

This invention relates to an unsupervised cell segmentation method based on Transformer. The multimodal text-image alignment module aims to effectively fuse text and image data, achieving high-level alignment of multimodal information through the Transformer architecture. A low-rank attention mechanism enhances the interrelationship of data, enabling more accurate extraction and alignment of multimodal features in an unsupervised environment. The matching matrix feature optimization module further processes the aligned feature data. Utilizing a unique matching matrix optimization algorithm, the accuracy of feature matching is significantly improved. Through the optimized matching matrix, the parameters for segmenting cells are extracted and adjusted, providing more accurate initial cues for subsequent segmentation. The optimized features are input into the SAM segmentation module. SAM leverages its powerful segmentation capabilities to achieve high-precision cell segmentation. This module fully utilizes the advantages of the low-rank attention mechanism and matching matrix optimization to ensure the accuracy and robustness of the segmentation results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to an unsupervised cell segmentation method based on Transformer, belonging to the field of medical image segmentation technology. Background Technology

[0002] Traditional supervised learning has made some progress in cell segmentation, but its effectiveness is limited when dealing with rare or highly heterogeneous cell types. This is mainly because these methods heavily rely on large amounts of well-annotated data. Our solution to this problem is an innovative Transformer-based unsupervised cell segmentation framework that combines the benefits of multimodal alignment with the open-domain applicability of the Owl-vit model. A pre-trained encoder is directly fed into the image's encoder with high-dimensional features as input. In addition to improving the model's ability to handle complex cellular environments, the MMX module enables the framework to segment cell types that were not directly trained on due to its adaptive zero-shot segmentation mechanism. Furthermore, a low-rank attention mechanism is implemented within this framework, improving computational efficiency and optimizing the feature recognition process.

[0003] In summary, how to obtain high-quality complex reasoning datasets at low cost is a topic worthy of in-depth research. This special topic explores this issue from the perspective of thought chain and model self-enhancement, aiming to solve the difficulties and key points of current methods and form a complete unsupervised cell segmentation method based on Transformer. Summary of the Invention

[0004] To overcome the shortcomings of existing optimization models in unsupervised cell segmentation, this invention provides a Transformer-based unsupervised cell segmentation method. It utilizes a low-rank attention mechanism in a novel way and embeds an MMX module to optimize cell parameters after zero-shot feature detection. Comprehensive experimental verification shows that the framework performs better and can be generalized to different cell types in segmentation tasks. Especially when dealing with cell categories, it achieves significant improvements in segmentation performance on three public datasets compared to traditional excellent zero-shot segmentation models.

[0005] An unsupervised cell segmentation method based on Transformer includes a multimodal text-image alignment module, a MatchMatrix (MMX) feature optimization module, and a SAM segmentation module. It includes the following steps:

[0006] Step 1: Prepare the dataset, including three cell and tissue datasets using hematoxylin and eosin (H&E) and immunohistochemical staining (IHC);

[0007] Step 2: Perform data preprocessing, which includes preprocessing of images and text content. The main purpose is to enhance the features of cell images, increase data diversity, and obtain text word vectors.

[0008] Step 3: After preprocessing, perform multimodal alignment of image-text pairs. The initial data is passed through a transformer encoder to obtain high-dimensional features of text and vision. The projection layer is used to project the text and vision features onto a shared multimodal representation space. The contrast loss function is used to keep unrelated images and texts far apart in the representation space and to bring semantically related images and texts closer together.

[0009] Step 4: After data alignment, the feature vectors are mapped to the detection heads, including the classification head and the regression head. For images, the classification head is responsible for obtaining the confidence scores of cell categories, and the regression head is responsible for linear regression of the initial target detection box vector. For text, the text embedding information of the text is obtained.

[0010] Step 5: Obtain the regression target detection location as input to the Matching Matrix Feature Optimization (MMX) module. Extract local image features based on the detection box, mainly for areas with many cells. Use the visual transformer encoder to obtain a new round of target locations. Combine the initial target detection box with the non-maximum suppression (NMS) algorithm to eliminate overlapping bounding boxes in the object detection task, and obtain the final target boundary that is fed into the SAM model for segmentation.

[0011] Step Six: After obtaining the final target boundary, use it as the SAM cue box and use SAM's powerful segmentation capabilities to segment out the text-constrained cells.

[0012] The data preparation steps in step one are as follows:

[0013] 1.1: Prepare three publicly available cell datasets: the Kaggle dataset, the Kumar dataset, and the Lizard dataset. The three datasets contain 665, 226, and 30 stained images, respectively.

[0014] 1.2: For each dataset, the photos are uniformly formatted as PNG to facilitate subsequent image feature extraction.

[0015] 1.3: The original images were labeled in RLE, mat, and csv formats, and were output as csv format.

[0016] The data preprocessing method in step two is as follows:

[0017] 2.1: The image is scaled to a uniform size of 1024×1024, and the weighted average of the three RGB channels is used to obtain a single-channel grayscale image. This removes color redundancy, improves robustness, and reduces computational complexity.

[0018] 2.2: Gaussian filtering is applied to the grayscale image to remove image noise and enhance the edge features of the cell image. Then, normalization is applied to scale the pixel values ​​of the filtered image to the standard range [0,1] to improve model efficiency.

[0019] 2.3: The stained Kaggle (H&E), Kumar (H&E), and Lizard (IHC) can be used to add word segmentation and word vectors to the text during segmentation by close observation and machine recognition of their shape and color.

[0020] The multimodal alignment method in step three is as follows:

[0021] Dataset D N There are n images I N,n These images will be divided into M×M patches of fixed size P×P. N,ni ∈R M×M×d (where d is the dimension of the image feature vector). Each block is flattened into a one-dimensional vector, standardized, and then a corresponding position encoding is added. i To preserve the spatial information of the image, these image patches are grouped into batch data P for easier processing by the Transformer. N,n The vector of the i-th image patch is represented as P. N,n,i .

[0022] P N,n,i =Flatten(Patch) N,ni +PositionEncoding i

[0023] P N,n ={P N,n,i |i∈{1,2,...,M×M}

[0024] Text E corresponding to the image N,n The word vectors are embedded, and corresponding positional encodings are added to the word vectors. The j-th word in the n-th text... N,n,j ∈R l×d (where l is the number of words in the nth text, and d is the dimension of the feature vector), E N,n,j Specifically, it is expressed as follows:

[0025] E N,n,j =WordEmbedding(WordN,n,j +PositionEncoding j

[0026] Image block vectors and word vectors are used to generate subsequent image feature sequences I n,M×M and text feature sequence T n,l .

[0027] I n,M×M =[P N,n,1 ,P N,n,2 ,…,P N,n,M×M ]

[0028] T n,l =[E N,n,1 E N,n,2 ,…,E N,n,l ].

[0029] This method mainly focuses on extracting and optimizing specific features that need to be processed in each image.

[0030] The method for obtaining the initial coordinate frame and its confidence score in step four is as follows:

[0031] After the input image and text feature sequences are processed by the self-attention and feedforward neural network of the Transformer encoder layer, the image feature output F is obtained. N,n,i Text feature output N,n,i . The feature F N,n,i Mapped to a detection head, which includes a classification head and a regression head. The classification head predicts the target's category and then generates a confidence score C. N,n,i The regression head is used to predict the bounding box B of the target. N,n,i .

[0032] F N,n,i =ViT(I n,M×M )

[0033] Text N,n,i =TTE(T) n,l )

[0034]

[0035] c is the number of categories for each image patch, W c The weight matrix of the classification branch. (F N,n,i ·W c ) k Represents the category score vector F N,n,i ·W c The k-th element, which is the score of the k-th category.

[0036] Predict the bounding box parameters BP (center coordinates x, y and width and height w, h) for each image patch, and the output of the regression branch is B. N,n,i This maps back to the coordinate system of the original image.

[0037] BP=F N,n,i ·W R

[0038]

[0039] W R It is the weight matrix of the regression branch.

[0040] The Matching Matrix Feature Optimization (MMX) module method in step five is as follows:

[0041] The MMX module uses B... N,n The bounding box is used as the initial mapping to crop local image features I Tailor N,n,i .

[0042]

[0043] Where i represents the i-th bounding box in the bounding box set, B N,n, The set of bounding boxes output by the regression branch.

[0044] The cropped local image features are scaled and then filled to obtain the local image R. N,n,i It is then unfolded into W×W P×P patch blocks (16×16 pixel blocks are used here), and the patch vector of each image patch t is flattened into a one-dimensional vector and a position code is added.

[0045] R N,N,i =Pad(Resize(I Tailor N,n,i ))

[0046] Patch N,n,i,j =Patchify(R) N,N,i )

[0047] P (t) N,n,i,j =Flatten(Patch) N,n,i,j )+PositionRncoding j

[0048] P (t) N,n,i ={P (t) N,n,i,j |j∈{1,2,...,W×W}

[0049] PatchN,n,i,j It is the image block vector of the j-th block. It is the set of block vectors of the flattened image, D Patch This represents the vector dimension after the image patch is flattened.

[0050] P (t) N,n,i,j The high-dimensional feature vectors of the linear projection are mapped to a fixed feature space and used as the input features of the Transformer to obtain the output features F of the second round. (t) N,n,i .

[0051] Proj N,n,i =W p ·P (t) N,n,i +b p

[0052] F (t) N,n,i =ViT(Proj N,n,i )

[0053] It is the high-dimensional feature vector after projection. It is the weight matrix of the linear projection. It is the bias vector, and D is the dimension after projection.

[0054] F (t) N,n,i Mapping to the detection head yields the corresponding category score C. (t) N,n,i and bounding box parameter B (t) N,n,i .

[0055]

[0056] c is the number of categories for each image patch, W c The weight matrix W of the classification branch R It is the weight matrix of the regression branch, (F (t) N,n,i ·W c ) k Represents the category score vector F (t) N,n,i ·W c The k-th element, which is the score of the k-th category.

[0057] The SAM segmentation method in step six:

[0058] By using cueing cues in image segmentation tasks, SAM leverages cueing strategies from the field of Natural Language Processing (NLP) to rapidly segment arbitrary targets and adapts them for different downstream tasks.

[0059] The preprocessed image is passed through a convolutional base to divide it into fixed-size patches. After standardization, each patch is flattened into a one-dimensional vector and a corresponding position encoding is added. i ), thus obtaining the above vector set P N,n Then, its feature vector is input into the Transformer to obtain the high-dimensional feature output F. SAM N,n,i .

[0060] P N,n ={P N,n,i |i∈{1,2,...,M×M}

[0061] F SAM N,n,i =ViT(P N,n,i )

[0062] The Prompt Encoder converts each bounding box into a feature vector (BB). N,n,j .

[0063] BB N,n,j =PromptEncoder(B (t) N,n,i )

[0064] By combining image embedding and cue embedding, a segmentation mask M is generated using a Mask Decoder. N,n,j .

[0065] M N,n,j =MaskDecoder(F SAM N,n,i ,BB N,n,j )

[0066] To obtain a more accurate prediction mask, further improvements can be made since the generated mask includes connectivity values. Compared to existing technologies, the advantages of this invention are:

[0067] This invention comprises three core modules: a multimodal text-image alignment module, a matching matrix feature optimization module (MMX), and a SAM segmentation module. First, the multimodal text-image alignment module aims to effectively fuse text and image data, achieving high-level alignment of multimodal information through a Transformer architecture. This process enhances the interrelationship of data through a low-rank attention mechanism, enabling more accurate extraction and alignment of multimodal features in an unsupervised environment. Second, the matching matrix feature optimization module further processes the aligned feature data. This module utilizes a unique matching matrix optimization algorithm to significantly improve the accuracy of feature matching. Through the optimized matching matrix, the parameters for segmenting cells are extracted and adjusted, providing more accurate initial cues for subsequent segmentation. Finally, the optimized features are input into the SAM segmentation module. SAM (Segmentation-Aware Module) leverages its powerful segmentation capabilities, achieving high-precision cell segmentation based on the input bounding box cues. This module fully utilizes the advantages of the low-rank attention mechanism and matching matrix optimization to ensure the accuracy and robustness of the segmentation results. Attached Figure Description

[0068] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0069] Figure 1 This is a schematic diagram of the structure of an unsupervised cell segmentation method based on Transformer according to the present invention;

[0070] Figure 2 The image shows the input sample (left) and a comparison of the cell segmentation result mask and the real label mask for the Transformer-based unsupervised cell segmentation method of the present invention (right). Detailed Implementation

[0071] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0072] An unsupervised cell segmentation method based on Transformer comprises three core modules: a multimodal text-image alignment module, a MatchMatrix (MMX) feature optimization module, and a SAM segmentation module. The method includes the following steps:

[0073] Step 1: Prepare the dataset, including three cell and tissue datasets using hematoxylin and eosin (H&E) and immunohistochemical staining (IHC);

[0074] Step 2: Perform data preprocessing, which includes preprocessing of images and text content. The main purpose is to enhance the features of cell images, increase data diversity, and obtain text word vectors.

[0075] Step 3: After preprocessing, perform multimodal alignment of image-text pairs. The initial data is passed through a transformer encoder to obtain high-dimensional features of text and vision. The projection layer is used to project the text and vision features onto a shared multimodal representation space. The contrast loss function is used to keep unrelated images and texts far apart in the representation space and to bring semantically related images and texts closer together.

[0076] Step 4: After data alignment, the feature vectors are mapped to the detection heads, including the classification head and the regression head. For images, the classification head is responsible for obtaining the confidence scores of cell categories, and the regression head is responsible for linear regression of the initial target detection box vector. For text, the text embedding information of the text is obtained.

[0077] Step 5: Obtain the regression target detection location as input to the Matching Matrix Feature Optimization (MMX) module. Extract local image features based on the detection box, mainly for areas with many cells. Use the visual transformer encoder to obtain a new round of target locations. Combine the initial target detection box with the non-maximum suppression (NMS) algorithm to eliminate overlapping bounding boxes in the object detection task, and obtain the final target boundary that is fed into the SAM model for segmentation.

[0078] Step Six: After obtaining the final target boundary, use it as the SAM cue box and use SAM's powerful segmentation capabilities to segment out the text-constrained cells.

[0079] The data preparation steps in step one are as follows:

[0080] 1.1: Prepare three publicly available cell datasets: the Kaggle dataset, the Kumar dataset, and the Lizard dataset. The three datasets contain 665, 226, and 30 stained images, respectively.

[0081] 1.2: For each dataset, the photos are uniformly formatted as PNG to facilitate subsequent image feature extraction.

[0082] 1.3: The original images were labeled in RLE, mat, and csv formats, and were output as csv format.

[0083] The data preprocessing method in step two is as follows:

[0084] 2.1: The image is scaled to a uniform size of 1024×1024, and the weighted average of the three RGB channels is used to obtain a single-channel grayscale image. This removes color redundancy, improves robustness, and reduces computational complexity.

[0085] 2.2: Gaussian filtering is applied to the grayscale image to remove image noise and enhance the edge features of the cell image. Then, normalization is applied to scale the pixel values ​​of the filtered image to the standard range [0,1] to improve model efficiency.

[0086] 2.3: The stained Kaggle (H&E), Kumar (H&E), and Lizard (IHC) can be used to add word segmentation and word vectors to the text during segmentation by close observation and machine recognition of their shape and color.

[0087] The multimodal alignment method in step three is as follows:

[0088] Dataset D N There are n images I N,n These images will be divided into M×M patches of fixed size P×P. N,ni ∈R M×M×d (where d is the dimension of the image feature vector). Each block is flattened into a one-dimensional vector, standardized, and then a corresponding position encoding is added. i To preserve the spatial information of the image, these image patches are grouped into batch data P for easier processing by the Transformer. N,n The vector of the i-th image patch is represented as P. N,n,i .

[0089] P N,n,i =Flatten(Patch) N,ni +PositionEncoding i

[0090] P N,n ={P N,n,i |i∈{1,2,...,M×M}

[0091] Text E corresponding to the image N,n The word vectors are embedded, and corresponding positional encodings are added to the word vectors. The j-th word in the n-th text...N,n,j ∈R l×d (where l is the number of words in the nth text, and d is the dimension of the feature vector), E N,n,j Specifically, it is expressed as follows:

[0092] E N,n,j =WordEmbedding(Word N,n,j +PositionEncoding j

[0093] Image block vectors and word vectors are used to generate subsequent image feature sequences I n,M×M and text feature sequence T n,l .

[0094] I n,M×M =[P N,n,1 ,P N,n,2 ,…,P N,n,M×M ]

[0095] T n,l =[E N,n,1 E N,n,2 ,…,E N,n,l ].

[0096] This method mainly focuses on extracting and optimizing specific features that need to be processed in each image.

[0097] The method for obtaining the initial coordinate frame and its confidence score in step four is as follows:

[0098] After the input image and text feature sequences are processed by the self-attention and feedforward neural network of the Transformer encoder layer, the image feature output F is obtained. N,n,i Text feature output N,n,i . The feature F N,n,i Mapped to a detection head, which includes a classification head and a regression head. The classification head predicts the target's category and then generates a confidence score C. N,n,i The regression head is used to predict the bounding box B of the target. N,n,i .

[0099] F N,n,i =ViT(I n,M×M )

[0100] Text N,n,i =TTE(T) n,l )

[0101]

[0102] c is the number of categories for each image patch, W c The weight matrix of the classification branch. (F N,n,i ·Wc ) k Represents the category score vector F N,n,i ·W c The k-th element, which is the score of the k-th category.

[0103] Predict the bounding box parameters BP (center coordinates x, y and width and height w, h) for each image patch, and the output of the regression branch is B. N,n,i This maps back to the coordinate system of the original image.

[0104] BP=F N,n,i ·W R

[0105]

[0106] W R It is the weight matrix of the regression branch.

[0107] The Matching Matrix Feature Optimization (MMX) module method in step five is as follows:

[0108] The MMX module uses B... N,n The bounding box is used as the initial mapping to crop local image features I Tailor N,n,i .

[0109]

[0110] Where i represents the i-th bounding box in the bounding box set, B N,n, The set of bounding boxes output by the regression branch.

[0111] The cropped local image features are scaled and then filled to obtain the local image R. N,n,i It is then unfolded into W×W P×P patch blocks (16×16 pixel blocks are used here), and the patch vector of each image patch t is flattened into a one-dimensional vector and a position code is added.

[0112] R N,N,i =Pad(Resize(I Tailor N,n,i ))

[0113] Patch N,n,i,j =Patchify(R) N,N,i )

[0114] P (t) N,n,i,j =Flatten(Patch) N,n,i,j )+PositionRncoding j

[0115] P(t) N,n,i ={P (t) N,n,i,j |j∈{1,2,...,W×W}

[0116] Patch N,n,i,j It is the image block vector of the j-th block. It is the set of block vectors of the flattened image, D Patch This represents the vector dimension after the image patch is flattened.

[0117] P (t) N,n,i,j The high-dimensional feature vectors of the linear projection are mapped to a fixed feature space and used as the input features of the Transformer to obtain the output features F of the second round. (t) N,n,i .

[0118] Proj N,n,i =W p ·P (t) N,n,i +b p

[0119] F (t) N,n,i =ViT(Proj N,n,i )

[0120] It is the high-dimensional feature vector after projection. It is the weight matrix of the linear projection. It is the bias vector, and D is the dimension after projection.

[0121] F (t) N,n,i Mapping to the detection head yields the corresponding category score C. (t) N,n,i and bounding box parameter B (t) N,n,i .

[0122]

[0123] c is the number of categories for each image patch, W c The weight matrix W of the classification branch R It is the weight matrix of the regression branch, (F (t) N,n,i ·W c ) k Represents the category score vector F (t) N,n,i ·W c The k-th element, which is the score of the k-th category.

[0124] The SAM segmentation method in step six:

[0125] By using cueing cues in image segmentation tasks, SAM leverages cueing strategies from the field of Natural Language Processing (NLP) to rapidly segment arbitrary targets and adapts them for different downstream tasks.

[0126] The preprocessed image is passed through a convolutional base to divide it into fixed-size patches. After standardization, each patch is flattened into a one-dimensional vector and a corresponding position encoding is added. i ), thus obtaining the above vector set P N,n Then, its feature vector is input into the Transformer to obtain the high-dimensional feature output F. SAM N,n,i .

[0127] P N,n ={P N,n,i |i∈{1,2,...,M×M}

[0128] F SAM N,n,i =ViT(P N,n,i )

[0129] The Prompt Encoder converts each bounding box into a feature vector (BB). N,n,j .

[0130] BB N,n,j =PromptEncoder(B (t) N,n,i )

[0131] By combining image embedding and cue embedding, a segmentation mask M is generated using a Mask Decoder. N,n,j .

[0132] M N,n,j =MaskDecoder(F SAM N,n,i ,BB N,n,j )

[0133] To obtain a more accurate prediction mask, further improvements can be made since the generated mask contains connection values.

[0134] The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. For those skilled in the art, various changes, modifications, substitutions, and variations can be made to these embodiments without departing from the principles and spirit of the present invention, and these variations still fall within the protection scope of the present invention.

Claims

1. An unsupervised cell segmentation method based on Transformer, characterized in that: Includes the following steps: Step 1: Prepare the dataset, including three cell and tissue datasets stained with hematoxylin, eosin, and immunohistochemical staining. Step 2: Perform data preprocessing, including preprocessing of image and text content; Step 3: After preprocessing, perform multimodal alignment of image-text pairs. The initial data is passed through a transformer encoder to obtain high-dimensional features of text and vision. The projection layer is used to project the text and vision features onto a shared multimodal representation space. The contrast loss function is used to keep unrelated images and texts far apart in the representation space and to bring semantically related images and texts closer together. Step 4: After data alignment, the feature vectors are mapped to the detection heads, including the classification head and the regression head. For images, the classification head is responsible for obtaining the confidence scores of cell categories, and the regression head is responsible for linear regression of the initial target detection box vector. For text, the text embedding information of the text is obtained. Step 5: Obtain the regression target detection position as input to the matching matrix feature optimization module, extract local image features based on the detection box, use the visual transformer encoder to obtain a new round of target positions, combine the initial target detection box and use the non-maximum suppression algorithm to eliminate overlapping bounding boxes in the object detection task, and obtain the final target boundary sent to the SAM model for segmentation; Step 6: Use the final target boundary as the cue box of the SAM model, and use the SAM model to segment out the cells of text constraints.

2. The Transformer-based unsupervised cell segmentation method according to claim 1, characterized in that: The dataset preparation steps in step one are as follows: 1.1: Prepare three publicly available cell datasets: the Kaggle dataset, the Kumar dataset, and the Lizard dataset; 1.2: Convert all photos in each dataset to PNG format; 1.3: Output the original image's corresponding label format as a CSV file.

3. The Transformer-based unsupervised cell segmentation method according to claim 1, characterized in that: The data preprocessing method in step two is as follows: 2.1: Scale the image to a uniform size of 1024×1024 pixels, and perform a weighted average of the three RGB channels to obtain a single-channel grayscale image; 2.2: Gaussian filtering is applied to the grayscale image to remove image noise and enhance the edge features of the cell image. Then, normalization is applied to scale the pixel values ​​of the filtered image to the standard range [0, 1] to improve model efficiency. 2.3: The shaped and colored Kaggle, Kumar, and Lizard datasets were observed and their colors were identified by machine. Then, word segmentation and word vectors were added to the text during segmentation.

4. The Transformer-based unsupervised cell segmentation method according to claim 1, characterized in that: The multimodal alignment method in step three is as follows: Dataset There are n images The image will be divided into segments of a fixed size P. P's M M patches, Where d is the dimension of the image feature vector, each block is flattened into a one-dimensional vector, standardized, and then the corresponding positional encoding is added to preserve the spatial information of the image. This is then processed by the Transformer, and the image blocks are combined into batch data. The vector of the i-th image patch is represented as , ; ; Text corresponding to the image Embedded as word vectors, with corresponding positional encoding added to the word vectors, the j-th word in the n-th text Where l is the number of words in the nth text, and d is the dimension of the feature vector. Specifically, it is expressed as follows: ; Image patch vectors and word vectors are used to generate subsequent image feature sequences. and text feature sequences , =[ , ,…, ]; = , ,…, ]。 5. The Transformer-based unsupervised cell segmentation method according to claim 1, characterized in that: Step four specifically includes: The input image and text feature sequences are processed by the self-attention and feedforward neural network of the Transformer encoder layer to obtain the image feature output. and text feature output , will feature Mapped to a detection head, which includes a classification head and a regression head, the classification head predicts the target's category and then generates a confidence score. The regression head is used to predict the bounding box of the target. , =ViT( ); =TTE( ); =softmax ; c is the number of categories for each image patch. The weight matrix of the classification branch, Represents the category score vector The k-th element, i.e., the score of the k-th category, Predict the bounding box parameters (BP) for each image patch; the output of the regression branch is... Map back to the coordinate system of the original image. BP= ; =( , ); It is the weight matrix of the regression branch.

6. The Transformer-based unsupervised cell segmentation method according to claim 1, characterized in that: The matching matrix feature optimization module method in step five is as follows: MMX module uses The bounding box is used as an initial mapping to crop local image features. , ; Where i represents the i-th bounding box in the bounding box set. The set of bounding boxes output by the regression branch. The cropped local image features are scaled and then filled to obtain the local image. To display it as indivual The patch blocks are formed by flattening the patch vector of each image patch t into a one-dimensional vector and adding positional encoding. ; ; ; ; in It is the image block vector of the j-th block. It is a set of block vectors of the flattened image. The vector dimension after the image patch is flattened. Will The high-dimensional feature vectors obtained by linear projection are mapped to a fixed feature space and used as the input features of the Transformer to obtain the output features of the second round. , ; ; It is the high-dimensional feature vector after projection. It is the weight matrix of the linear projection. It is the bias vector, and D is the dimension after projection. Will Mapping to the detection head yields the corresponding category score. and bounding box parameters , =softmax ; = =( , ); c is the number of categories for each image patch. The weight matrix of the classification branch, It is the weight matrix of the regression branch. Represents the category score vector The k-th element, which is the score of the k-th category.

7. The Transformer-based unsupervised cell segmentation method according to claim 1, characterized in that: The SAM model segmentation method in step six specifically includes: By using cueing techniques in image segmentation tasks, the SAM model leverages cueing strategies from the natural language processing domain to rapidly segment any target, adapting to different downstream tasks. The preprocessed image is passed through a convolutional base to divide it into fixed-size patches. After standardization, each patch is flattened into a one-dimensional vector and its corresponding position is encoded. The above vector set is obtained. The feature vector is input into the Transformer to obtain the high-dimensional feature output. , ; ; The Prompt Encoder converts each bounding box into a feature vector. , ; By combining image embedding and cue embedding, a segmentation mask is generated using a Mask Decoder. , 。