A thyroid cell sample grading method based on multi-modal fusion

By employing a multimodal fusion-based thyroid cell sample grading method, this approach utilizes the Yolov9 and Chinese CLIP models to identify and match suspicious regions in thyroid cells, and combines this with the XGBoost model for grading. This method addresses the issues of accuracy and subjectivity in existing thyroid cell pathological diagnosis, achieving more efficient and accurate pathological grading.

CN120953987BActive Publication Date: 2026-06-19WUHAN LANDING INTELLIGENCE MEDICAL CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
WUHAN LANDING INTELLIGENCE MEDICAL CO LTD
Filing Date
2025-07-31
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In current thyroid cell pathology diagnosis, the detection accuracy of suspicious areas is low, multi-category classification relies on human experience, making it difficult to quickly and accurately distinguish between benign and malignant lesions of papillary carcinoma and medullary carcinoma. Furthermore, the grading results are highly subjective and lack multi-dimensional correlation analysis.

Method used

A multimodal fusion approach was adopted, using the YOLOv9 target detection model to perform multi-scale scanning to identify suspicious areas of thyroid cells. The Chinese CLIP model was used to extract image features and match them with Chinese pathological text labels. The XGBoost model was combined to classify the samples, and multi-dimensional features were integrated for quantitative evaluation.

Benefits of technology

It improves the accuracy of thyroid cell sample grading, reduces the need for repeated punctures, decreases reliance on human experience, and provides more objective pathological diagnostic support.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120953987B_ABST
    Figure CN120953987B_ABST
Patent Text Reader

Abstract

This invention provides a thyroid cell sample grading method based on multimodal fusion, relating to the fields of medical image processing and artificial intelligence-assisted diagnosis. The method comprises the following steps: S10, acquiring and preprocessing annotated panoramic images of thyroid cell samples; S20, using a trained YOLOv9 object detection model to perform multi-scale scanning on the preprocessed panoramic images of thyroid cells; identifying 34 types of suspicious thyroid cell regions based on the multi-scale scanning and outputting these 34 types of suspicious thyroid cell regions; S30, using a trained Chinese CLIP image retrieval model to extract image features from the 34 types of suspicious thyroid cell regions output in step S20. Through the coordination of the above structures, compared with existing methods, it has the following advantages: First, it improves the recognition accuracy of suspicious thyroid cell regions by leveraging the multi-scale detection capability of YOLOv9; second, it achieves accurate matching of image features with Chinese pathological text labels using Chinese CLIP; and third, it integrates multi-dimensional features through the XGBoost model for grading determination.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of medical image processing and artificial intelligence-assisted diagnosis, and in particular to a thyroid cell sample grading method based on multimodal fusion. Background Technology

[0002] In the pathological diagnosis of thyroid cells, the detection accuracy of suspicious areas is low, and the classification of multiple categories relies on human experience, making it difficult to quickly and accurately distinguish between benign and malignant lesions such as papillary carcinoma and medullary carcinoma.

[0003] Existing thyroid cell grading methods rely on single morphological or textual descriptions and lack multi-dimensional correlation analysis of lesion features. This results in highly subjective qualitative results at the final sample level, insufficient ability to identify complex morphologies, poor grading consistency, and difficulty in accurately distinguishing between papillary carcinoma and follicular tumors. Summary of the Invention

[0004] To address the shortcomings of the existing technologies, the technical problem to be solved by this invention is to provide a thyroid cell sample grading method based on multimodal fusion, which can achieve accurate localization and quantitative correlation analysis of suspicious areas through multi-scale detection and cross-modal semantic fusion, thereby improving grading accuracy and reducing unnecessary repeated puncture operations.

[0005] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows: The present invention provides a thyroid cell sample grading method based on multimodal fusion, comprising the following steps:

[0006] S10. Obtain and preprocess a panoramic view of the labeled thyroid cell sample;

[0007] S20. The trained Yolov9 target detection model is used to perform multi-scale scanning on the panoramic image of thyroid cells after preprocessing in step S10.

[0008] Based on multi-scale scanning, 34 types of suspicious thyroid cell regions were identified, and the 34 types of suspicious thyroid cell regions were output.

[0009] S30. Use the trained Chinese CLIP image retrieval model to extract the image features of the 34 types of suspicious thyroid cell regions output in step S20.

[0010] The extracted image features are semantically matched with 34 preset categories of Chinese pathological text tags to generate fine-grained feature vectors.

[0011] Based on fine-grained feature vectors, semantic associations between image features and text tags are established to obtain retrieval results;

[0012] S40. Based on the search results obtained in step S30 for each thyroid cell sample, the XGBoost model is used to determine the sample grading, and the ratio is calculated according to the preset rules to determine the positive / negative and severity of the sample.

[0013] In a preferred embodiment, in step S10, each type of suspicious region includes bounding box coordinates and the number of categories; the suspicious regions include one or more of the following: papillary arrangement, psammomatous body, follicular mass, or multinucleated giant cell.

[0014] In a preferred embodiment, step S40 further includes the following step: counting the number of categories for each of the 34 suspicious regions in the thyroid cell sample;

[0015] The above statistical results are input into the trained XGBoost model, and the sample classification result is determined based on the number distribution of 34 suspicious regions in the sample. The sample classification result is one of benign, papillary carcinoma, medullary carcinoma, AUS or follicular tumor.

[0016] In the preferred scheme, let the number of suspicious regions of type i be . ( Then the total number of suspicious areas The calculation formula is: ;

[0017] in, This represents the number of suspicious regions of type i; This represents the total number of all suspicious areas in the sample.

[0018] In the preferred embodiment, step S40 further includes the following steps: normalizing the number of the 34 types of suspicious regions to construct the input feature vector of the XGBoost model. .

[0019] Normalized value of the i-th type of suspicious region The calculation formula is: ;

[0020] in, It is a very small positive number, used to avoid The denominator is zero when the time is zero; This represents the proportion of the number of suspicious areas of type i to the total number.

[0021] In the preferred scheme, the constructed feature vector Inputting the trained XGBoost model yields the preliminary hierarchical probability distribution of the samples. ;

[0022] in, This represents the probability that a sample belongs to the j-th class of classification results;

[0023] Based on XGBoost, the Softmax function is used to convert the predicted score into a probability, with the following formula:

[0024] ;

[0025] in, denoted as the predicted score for class j.

[0026] The final grading result is determined using the argmax function. The formula is:

[0027] ;

[0028] in, Let be the probability that a sample belongs to the j-th class. This is the final classification result output by the model.

[0029] In the preferred scheme, Corresponding to benign, It is papillary carcinoma. It is medullary carcinoma. For AUS, It is a follicular tumor.

[0030] In the preferred scheme, the total number of malignant lesion categories in the 34 suspected areas is calculated. According to the feature vector Define the index set corresponding to the malignant category. The formula is:

[0031] ;

[0032] Calculate the proportion of malignant cases The formula is: ;

[0033] Based on the proportion of malignant cases Severity classification:

[0034] It was determined to be relatively serious;

[0035] like It was determined to be serious;

[0036] like It was determined to be extremely serious;

[0037] in, This represents the total number of malignant association categories. This represents the proportion of malignantly related regions to the total region.

[0038] In a preferred embodiment, the present invention provides a computer non-transitory readable storage medium storing a computer program / instruction thereon, characterized in that the computer program / instruction, when executed by a processor, implements the steps of the thyroid cell sample grading method based on multimodal fusion described above.

[0039] In a preferred embodiment, the present invention further provides a computer program product, including a computer program / instructions, characterized in that, when the computer program / instructions are executed by one or more processors, they implement the steps of the thyroid cell sample grading method based on multimodal fusion described above.

[0040] This invention provides a thyroid cell sample grading method based on multimodal fusion. Compared with existing methods, this method has the following advantages through the cooperation between the above-mentioned structures:

[0041] First, by leveraging the multi-scale detection capabilities of YOLOv9, the accuracy of identifying suspicious areas of thyroid cells can be improved, reducing missed and false detections.

[0042] Second, Chinese CLIP is used to achieve accurate matching between image features and Chinese pathological text labels, reducing the reliance on human experience in multi-category classification and improving classification consistency.

[0043] Third, the XGBoost model integrates multi-dimensional features for grading and combines the proportion of lesion features to achieve a quantitative assessment of severity, effectively reducing the subjectivity of grading results and providing more reliable auxiliary support for thyroid cell pathology diagnosis. Attached Figure Description

[0044] The present invention will be further described below with reference to the accompanying drawings and embodiments:

[0045] Figure 1 This is a main view diagram of the process structure of this invention;

[0046] Figure 2 This is a flowchart illustrating step S40 of the present invention;

[0047] Figure 3 This is a panoramic view of the thyroid cell sample from the top of this invention;

[0048] Figure 4 This is a diagram of the follicle clusters of the present invention;

[0049] Figure 5 This is a diagram of a multinucleated giant cell of the present invention;

[0050] Figure 6 This is a diagram of the nipple arrangement of the present invention;

[0051] Figure 7 This is a diagram of the gravel body of the present invention;

[0052] Figure 8 This is a schematic diagram of the structure of the computer device of the present invention. Detailed Implementation

[0053] To better understand the purpose, system architecture, and functional implementation of this embodiment, the embodiments and features in the embodiments of this application can be combined with each other without conflict. The exemplary embodiments disclosed in this application will be described below with reference to the accompanying drawings, which include specific technical details disclosed in this embodiment to aid understanding; however, these details should be considered exemplary rather than restrictive. Therefore, those skilled in the art should understand that various improvements and adjustments can be made to the embodiments described herein without departing from the scope and core ideas of the invention. Similarly, for clarity, detailed descriptions of well-known technologies, functions, and structures (such as standard image processing algorithms and common communication protocols) are omitted in the following description.

[0054] In the field of thyroid cytopathology diagnostic technology, thyroid cytopathology diagnosis is an important means of clinical disease screening. Traditional diagnostic methods mainly rely on pathologists to manually interpret the morphological characteristics of cell samples, classifying them by observing morphological features such as nuclear atypia, nuclear grooves, and psammoma bodies. However, such methods have significant limitations: on the one hand, manual marking of suspicious areas is easily affected by subjective experience and observational errors, resulting in a false negative rate of up to 23% and a false positive rate of approximately 17% for tiny lesions (<0.5 mm); on the other hand, the accuracy rate for distinguishing similar lesions such as papillary thyroid carcinoma (PTC) and follicular neoplasm (FN) is less than 85%, and the error rate for judging intermediate states such as atypical lesions (AUS) is as high as 30%. In addition, existing technologies are mostly based on single features (such as morphology or text description) for grading, lacking multi-dimensional correlation analysis of 34 key suspicious area features, resulting in highly subjective grading results that are difficult to meet the clinical demand for rapid and accurate diagnosis.

[0055] With the development of artificial intelligence technology, thyroid pathology analysis has gradually expanded from single image recognition to automated grading combining multiple features. An ideal assisted diagnostic system should be able to integrate image features and semantic information to achieve objective and accurate grading of complex lesions (such as distinguishing between papillary carcinoma and follicular tumors). In high-level automated grading tasks of thyroid cell samples, the system typically needs to correlate and analyze multiple types of lesion features (such as the morphology of suspicious areas in the image, the distribution of multiple categories of pathological features, and the statistical ratios of key pathological indicators) to achieve reliable sample-level qualitative analysis. However, existing automated methods usually focus only on single-modality feature extraction (such as relying solely on morphological features or solely on text label classification), making it difficult to effectively integrate and correlate multi-dimensional information from image detection results and semantic classification statistics.

[0056] When implementing thyroid cell sample grading, related technologies have limitations in the analysis dimensions of lesion features: in terms of feature source dimension, they are usually limited to directly extracting morphological features from the original image or relying on limited text information labeled manually; in terms of feature association dimension, there is a lack of a mechanism to effectively integrate and analyze fine-grained suspicious region detection results, multi-category semantic classification statistics, and key pathological index ratios.

[0057] Therefore, the methods for grading thyroid cell samples in related technologies are difficult to overcome the shortcomings of insufficient detection accuracy, reliance on experience for multi-category classification, and strong subjectivity of the final grading results. This makes it difficult to meet the needs of high-precision, high-efficiency, and objective thyroid pathology diagnosis.

[0058] Example 1

[0059] This embodiment 1 addresses the aforementioned problems by proposing a multimodal fusion method for grading thyroid cell samples. Through a three-level processing architecture of "detection → matching → grading," it achieves intelligent diagnosis throughout the entire process, from detection of suspicious areas to pathological grading. The specific steps are as follows:

[0060] S10. Dataset Collection and Annotation: Obtain thyroid cell sample image datasets annotated by professional pathologists from medical institutions.

[0061] The samples must cover a variety of pathological states, including benign, papillary carcinoma, medullary carcinoma, atypical lesions of indeterminate significance (AUS), and follicular tumors. Additionally, each sample must have suspicious thyroid cell regions (34 categories in total) labeled in the images.

[0062] Approximately 4,100 thyroid cell sample images were collected and divided into 90% samples (approximately 3,690) for model training and 10% samples (approximately 410) for model testing and validation.

[0063] S20. Data preprocessing: The input thyroid cell sample images are resized, cropped, and standardized to adapt to the model input requirements and improve the model robustness.

[0064] Specifically, size adjustment: unify the image scale to fit the size of the subsequent Yolov9 model input, and eliminate detection bias caused by differences in image size;

[0065] Center cropping: Preserves the core area of ​​the image (including the main thyroid cell structure), removes redundant background at the edges, and reduces interference from irrelevant information;

[0066] Standardization: Normalize the pixel values ​​of an image (e.g., map pixel values ​​to the [0,1] range) to eliminate pixel value fluctuations caused by differences in lighting and devices;

[0067] The diagnostic results (benign / malignant type) and suspicious area category labeling information provided by professional pathologists are converted into the digital label format required for model training.

[0068] S30, Detection of suspicious regions in thyroid cells: The YOLOv9 target detection algorithm is used. Its advantages lie in higher detection accuracy, faster speed, stronger cross-scale detection capability, and optimized network structure, which can effectively adapt to complex cell image scenarios.

[0069] The preprocessed sample images are used for binary classification object detection training. The detection target is the "suspicious region", where:

[0070] Benign suspicious areas are labeled 0; non-benign suspicious areas are labeled 1.

[0071] Training settings:

[0072] The loss function is the cross-entropy loss function;

[0073] The optimizer is stochastic gradient descent (SGD);

[0074] Training parameters: batch size = 32, learning rate = 0.001, training epochs = 200, weight decay = 0.001;

[0075] The output is a trained YOLOv9 detection model that can accurately locate and select all suspicious cell regions from the input sample image (distinguishing between benign and non-benign).

[0076] S40. Classification and matching of suspicious regions in thyroid cells: The Chinese_CLIP image-text retrieval algorithm is used. This algorithm integrates a visual base model (processing image modalities) and a large language model (processing text modalities), excels at learning the association between images and Chinese text descriptions, and can effectively solve multi-class image classification problems.

[0077] Based on the dataset from step S10, professional pathologists marked a total of approximately 200,000 specific suspicious cell regions (regions located by the detection model in step S30) in the sample images, and labeled each region with Chinese pathological text labels for 34 types of suspicious thyroid cell regions, such as "papillary arrangement" and "psammoma".

[0078] The Chinese_CLIP model was used for training, with the goal of enabling the model to learn to accurately match detected suspicious region image patches to their corresponding 34 categories of Chinese pathological text labels.

[0079] Training settings:

[0080] The loss function is the cross-entropy loss function.

[0081] The optimizer is momentum gradient descent (SGD with Momentum).

[0082] Training parameters: batch size = 1024, learning rate = 0.00001, epochs = 30, weight decay = 0.001

[0083] The output is the trained Chinese_CLIP matching model, which can output the specific category (i.e. Chinese pathological text label) of the 34 types that best match each detected suspicious region image patch.

[0084] S50, Qualitative grading of thyroid cell samples: Gradient Boosting Decision Tree (XGBoost) algorithm was used. This ensemble learning algorithm combines multiple weak classifiers (decision trees) to progressively optimize the model, reduce errors, and significantly improve classification accuracy, making it suitable for processing structured feature data.

[0085] According to feature engineering, the specific steps of multimodal fusion are as follows:

[0086] S51. Using the Chinese_CLIP matching model trained in step S40, classify all suspicious regions detected in step S30 in the entire sample image to obtain the specific type label (one of 34 categories) for each region.

[0087] S52. Count the number of each of the 34 suspicious region categories that appear in the sample image.

[0088] S53. Calculation of key pathological feature ratios: For specific categories with grading significance, such as "papillary arrangement" or "papillary body", calculate their proportion (ratio) in the total number of all suspicious areas in the sample.

[0089] For example: Papillary arrangement percentage = (Number of papillary arrangement regions) / (Total number of suspicious regions)

[0090] Similarly, calculate the proportion of sand and gravel.

[0091] Input features: The number of 34 suspicious regions obtained in step 2 and the proportion of key categories calculated in step 3 are combined to form a structured feature vector, which is used as the input to the XGBoost model.

[0092] Grading rule definition (based on ratio): In model training and final grading, the ratio of key features is used to assist in grading the severity of the disease. Integrating this rule into the model logic, then:

[0093] When the ratio is <0.3: it is defined as the "more severe" level.

[0094] When 0.3 ≤ ratio < 0.5: defined as "severe" level.

[0095] When the ratio is ≥0.5: it is defined as the "very serious" level.

[0096] It should be noted that the "ratio" here refers to the ratio of specific key features, such as the proportion of papillary arrangement. The specific ratio to be applied needs to be determined according to the pathological significance. The rule can be integrated into XGBoost decision-making or used as a post-processing rule.

[0097] Use XGBoost to train a classifier. The input is the structured feature vector mentioned above, which integrates multi-class quantitative statistics and key ratios. The output is the final pathological grade label of the thyroid cell sample, such as benign, papillary carcinoma, medullary carcinoma, AUS, follicular tumor, and / or severity level (more severe, severe, very severe).

[0098] The output is a trained XGBoost grading model that can objectively and accurately classify the pathological type and assess the severity of thyroid cell samples based on the global statistical features and key pathological feature ratios of various suspicious regions within the sample.

[0099] Example 2

[0100] The specific application of the thyroid cell sample grading method based on multimodal fusion provided in Example 1 will be described below. The specific application of the thyroid cell sample grading method based on multimodal fusion described below will be further explained in conjunction with Example 1.

[0101] like Figure 1 , 2 As shown, Figure 1 , 2 The thyroid cell sample grading method based on multimodal fusion provided in this application includes the following steps:

[0102] Step S10: Acquisition and preprocessing of panoramic images of thyroid cell samples, the specific steps are as follows:

[0103] S11: Obtain the raw sample: Extract labeled panoramic images of thyroid cell samples from medical institution databases or storage systems. These images typically contain a large number of cellular structures and may be accompanied by problems such as noise, uneven lighting, or inconsistent sizes.

[0104] S12: Perform preprocessing: Perform preprocessing operations on the acquired panoramic image of the thyroid cell sample. Preprocessing includes image resizing, center cropping, and standardization to eliminate noise and enhance key features for subsequent analysis. Preprocessing includes the following sub-steps:

[0105] S121: Image Normalization: Adjust the image size and resolution to a uniform standard (e.g., 224x224 pixels) to ensure that all input images are of the same scale and avoid errors in subsequent models due to size differences.

[0106] S122: Noise Removal: Apply Gaussian filtering or median filtering algorithms to reduce image noise (e.g., background impurities between cells). This step is derived from the noise characteristics that may exist in the original image (such as artifacts introduced by medical imaging equipment).

[0107] S123: Contrast Enhancement: Histogram equalization is used to enhance the contrast of cell regions, making cell boundaries clearer. This step is based on the normalized image derivation to ensure that key features (such as cell nuclei) are easier to identify in subsequent detection.

[0108] S124: Color Correction: Converts the image to RGB or grayscale format (depending on model requirements), corrects uneven lighting, and derives the image state after noise removal to eliminate the influence of environmental factors on cell color.

[0109] S13: Output: Preprocessed panoramic view of thyroid cells (image data).

[0110] Among them, size adjustment is used to unify the image scale to adapt to the input requirements of subsequent detection models, center cropping is used to preserve the core region of the image to reduce interference from irrelevant background, and standardization is used to eliminate feature deviations caused by differences in image pixel values.

[0111] Specifically, image quality is optimized through preprocessing to remove redundant information, providing standardized, high-quality image input for subsequent multi-scale suspicious region detection, thus ensuring the stability and accuracy of the detection model.

[0112] Step S20: Multi-scale suspicious region detection based on the Yolov9 model, the specific steps are as follows:

[0113] S21: Multi-scale scan initialization: A pre-trained Yolov9 object detection model is used to perform multi-scale scanning on the input image. Multi-scale scanning is achieved by generating image pyramids of different resolutions (e.g., scaling factors of 0.5, 1.0, and 1.5). This step is based on the preprocessed image derivation in S10, as preprocessing ensures uniformity in image size and quality, making multi-scale scanning more efficient.

[0114] S22: Suspicious Region Identification: The preprocessed panoramic image is scanned at multiple scales using a trained Yolov9 target detection model; the Yolov9 model, after training, has the ability to extract features across scales and can accurately identify 34 types of suspicious thyroid cell regions in the image;

[0115] Low-probability areas are filtered out based on a confidence threshold, while 34 types of suspicious areas with high confidence are retained. Preferably, the confidence threshold is 0.5.

[0116] Statistical information for each type of suspicious area includes: bounding box coordinates, category label, quantity, and area percentage;

[0117] Bounding box coordinates ( , , , (), indicating the location of the region in the image.

[0118] Class Count refers to the number of instances of each type of suspicious region in the sample. For example, the "papillary carcinoma" category detected 5 regions.

[0119] The Yolov9 model uses a sliding window across multiple scales to detect and identify 34 types of suspicious regions in thyroid cells, such as normal cells, cancer cells, and inflammatory cells.

[0120] The model output includes the bounding box coordinates and class label for each detection region. Multi-scale scanning covers cellular structures of different sizes (such as small cell clusters or large cell groups), avoiding missed detections caused by single-scale scanning.

[0121] S23: Outputs the detection results of suspicious regions in 34 types of thyroid cells.

[0122] Specifically, by using multi-scale detection to cover lesion areas of different sizes and shapes, the system achieves accurate localization and preliminary identification of 34 types of suspicious areas, providing specific lesion area objects for subsequent cross-modal feature fusion, which is the basis for subsequent feature analysis.

[0123] Step S30: Image-text cross-modal feature fusion and retrieval based on the Chinese CLIP model. The specific steps are as follows:

[0124] S31: Image Feature Extraction: Using a pre-trained Chinese CLIP image-text retrieval model, feature extraction is performed on each suspicious region output from S20. Specifically:

[0125] S311: Based on the bounding box coordinates, crop out a small image of each suspicious area from the original panoramic image.

[0126] S312: The Chinese CLIP model (based on the ViT architecture) extracts the depth feature vector (e.g., a 512-dimensional vector) for each small image. This step is based on the bounding box coordinate derivation in S20, and the coordinate information is used to accurately locate the cropping region.

[0127] S32: Text tag preparation: 34 types of Chinese pathological text tags are preset, such as "papillary carcinoma cells" and "benign follicular cells". These tags correspond one-to-one with the 34 types of suspicious areas and serve as a reference for semantic matching.

[0128] S34: Semantic Matching and Feature Generation

[0129] S341: Input the extracted image features and the preset 34 categories of Chinese pathological text labels into the Chinese CLIP model, and calculate the image-text similarity score using cosine similarity.

[0130] S342: Generate a fine-grained feature vector based on the similarity score. This vector encodes the association strength between image features and text labels. For example, a 34-dimensional vector, where each element represents the matching confidence of the corresponding category.

[0131] S343: Establish semantic association: If the similarity score is higher than the threshold, preferably 0.7, then confirm that the region matches the text label;

[0132] Otherwise, it is considered a low-confidence region. This step is based on the derivation of the number of categories in S20, because the initial category information for each region is used to guide the matching process.

[0133] S35: Search Result Generation: Summarize all matching results, output the final category label and confidence score for each suspicious region, and form a search result list. For example, a region was initially detected as "suspicious type A" by YOLOv9, but was confirmed as "papillary carcinoma" through semantic matching.

[0134] S36: Output the search results, including the fine-grained feature vector for each suspicious region, the confirmed category label, and the confidence level.

[0135] Input: Suspicious thyroid cell regions of type 34 output from step S20.

[0136] Specifically, by using cross-modal fusion of images and text, the visual features of images are associated with the semantic features of text, enabling fine-grained feature descriptions of suspicious areas. This provides a feature basis that integrates visual and semantic information for subsequent classification and judgment, thereby improving the richness and accuracy of feature expression.

[0137] Step S40: Based on the output of S30, derive statistical decision and classification judgment. The specific steps are as follows:

[0138] Step S41: Count the number of suspicious areas

[0139] Based on the search results of step S30, the suspicious regions of 34 types of thyroid cells are traversed, and the number of each type of region in the current sample is counted.

[0140] Let the number of suspicious regions of type i be . ( Then the total number of suspicious areas The calculation formula is: ;

[0141] in, This represents the number of suspicious regions of type i, used to quantify the distribution of a single type of region in the sample; This represents the total number of all suspicious areas in the sample and is the basic parameter for subsequent calculations of the proportion.

[0142] Specifically, the total quantity is obtained through this formula, which provides denominator data for subsequent calculations of the proportion of each category.

[0143] Step S42: Construct feature vectors

[0144] The number of suspicious regions in 34 categories was normalized to construct the input feature vector of the XGBoost model. .

[0145] Normalized value of the i-th type of suspicious region The calculation formula is: ;

[0146] in, It is a very small positive number, preferably. for Used to avoid The problem of the denominator being zero at the time; This represents the proportion of the number of suspicious regions of type i to the total number, eliminating the influence of sample size differences on the features.

[0147] Specifically, quantitative features are converted into relative proportion features, making the features of different samples comparable and facilitating the model's learning of patterns.

[0148] Step S43: XGBoost Model Inference

[0149] The feature vector constructed in step S42 Inputting the trained XGBoost model yields the preliminary hierarchical probability distribution of the samples. ,in, This represents the probability that a sample belongs to the j-th class of classification results;

[0150] XGBoost uses the Softmax function to convert predicted scores into probabilities:

[0151] ;

[0152] in, denoted as the predicted score for class j.

[0153] The rank with the highest probability is determined by the argmax function, and the final rank result is obtained. The formula for determining this is: ;

[0154] in, Let be the probability that a sample belongs to the j-th class. This is the final classification result output by the model.

[0155] further, Corresponding to benign, It is papillary carcinoma. It is medullary carcinoma. For AUS, It is a follicular tumor.

[0156] Specifically, automatic classification is achieved by determining the most likely classification level of a sample through the principle of maximizing probability.

[0157] Step S44: Determination of Malignancy and Calculation of Severity

[0158] Calculate the total number of malignant lesion categories in 34 suspicious areas. According to the feature vector Define the index set corresponding to the malignant category. ,but: ;

[0159] Calculate the proportion of malignant cases : ;

[0160] Based on the proportion of malignant cases Severity classification:

[0161] It was determined to be relatively serious;

[0162] like It was determined to be serious;

[0163] like It has been determined to be extremely serious.

[0164] in, This represents the total number of malignant association categories. This represents the proportion of malignantly related regions to the total region.

[0165] Specifically, by quantifying the proportion of malignant areas, a precise assessment of the severity of the lesions can be achieved.

[0166] In one feasible approach, such as Figures 3 to 7 As shown, the input is as follows Figure 3 A panoramic view of the thyroid cell sample shown;

[0167] Obtain panoramic images from medical institutions, annotated by pathologists, showing the sample status (benign / papillary carcinoma / medullary carcinoma / AUS / follicular tumor) and the location of 34 types of suspicious areas.

[0168] The panoramic images of thyroid cell samples were uniformly scaled to 2048×2048 pixels; invalid areas at the edges were removed; and pixel values ​​were normalized to the range of [0,1].

[0169] Suspicious area detection and identification: pre-processed panoramic image;

[0170] The YOLOv9 detection output and multi-scale scanning results are shown in Table 1 below:

[0171] Table 1

[0172]

[0173] Total identified: 89 suspicious areas (12+14+60+3).

[0174] Multimodal semantic association construction, Chinese_CLIP processing flow:

[0175] Input S20 outputs 4 types of suspicious region images ( Figures 4-7 );

[0176] The follicle cluster image matches the text label "follicle clusters of varying sizes";

[0177] The image of the nipple pattern matches the text label "nipple pattern";

[0178] The output is a fine-grained feature vector, i.e., the similarity of the nipple-shaped arrangement vectors is 98.7%.

[0179] Lesion grading and severity assessment, XGBoost qualitative diagnosis: Input the distribution of the number of suspicious areas in 34 categories, follicular clusters: 12, multinucleated giant cells: 14, papillary arrangement: 60, psammoma bodies: 3;

[0180] The output result is papillary carcinoma (a malignant type among the five types of lesions).

[0181] Formula for the proportion of malignant features:

[0182] ;

[0183] Grading determination:

[0184] Lesion nature: percentage If the percentage is greater than 30%, it is considered a malignant lesion.

[0185] Severity: If the percentage of total malignant cases is greater than 0.5, it is considered very severe.

[0186] Example 3

[0187] This embodiment uses a multimodal fusion-based thyroid cell sample grading method, selecting 30 typical samples from 4389 data points. All samples underwent rigorous manual review by pathology experts to ensure their authority and accuracy.

[0188] During the experiment, the pre-trained model of Example 1, the algorithm of Example 2, and the system of Example 3 were used to automatically analyze the above samples. Simultaneously, the analysis results of this scheme were compared with the judgment results of traditional methods to comprehensively evaluate the effectiveness and reliability of this scheme in medical image quality assessment.

[0189] The following section will elaborate on the specific implementation process of this scheme in medical image quality assessment, based on specific experimental data, and present the corresponding experimental results to provide data support and practical basis for the practical application of this scheme. As shown in Table 2 below, the panoramic image of thyroid cells clinically labeled as papillary carcinoma shows typical "ground-glass nuclei," "nuclear grooves," and papillary structures in the pathological section, which is a typical malignant sample for algorithm verification.

[0190] Table 2

[0191]

[0192] Based on the content of Table 2, select 3-5 abnormal / typical cases for detailed explanation, as follows:

[0193] In this embodiment, case LD0003871 is papillary carcinoma, which is very serious.

[0194] Original data characteristics: M=76, N=86, R=0.88 (far higher than the 0.5 threshold);

[0195] The output classification is papillary carcinoma, with a severity level of "very severe";

[0196] The manual review was marked as "yes". Pathological microscopy revealed numerous nuclear grooves and ground-glass nuclei, consistent with typical characteristics of papillary carcinoma.

[0197] The algorithm accurately identifies cases with high R values ​​(extremely high proportion of malignant cases), demonstrating the advantages of multi-model collaboration. YOLOv9 accurately captures highly correlated regions, XGBoost quickly locks the subtype based on feature vectors, and ChineseCLIP semantic matching enhances feature correlation.

[0198] In this embodiment, case LD0001056 is a severe case of papillary carcinoma.

[0199] Original data features: M=27, N=90, R=0.30 (just reaching the 0.3 threshold);

[0200] The output classification is papillary carcinoma, with a severity level of "severe".

[0201] The manual review was marked as "No". Manual microscopic examination revealed that some areas contained inflammatory cells, and the actual malignancy rate was slightly less than 0.3%.

[0202] The algorithm exhibits bias at the threshold critical point, with the deficiency being its failure to distinguish between "malignant correlation regions" and "inflammatory interference regions." This is attributed to YOLOv9's insufficient detection accuracy in low-contrast regions, leading to an inflated M-statistic.

[0203] In this embodiment, case LD0002901 is AUS, which is more serious.

[0204] Original data characteristics: M=16, N=55, R=0.291 (close to the 0.3 threshold).

[0205] The output rating is AUS, with a severity level of "severe".

[0206] The manual review was marked as "yes," and the proportion of atypical cells was manually confirmed to meet the AUS diagnostic criteria.

[0207] The algorithm accurately identifies AUS cases near the threshold, demonstrating the advantages of ChineseCLIP semantic matching. By precisely associating "atypical" text labels with image features, it avoids misjudging cases as malicious.

[0208] In this embodiment, case LD0000349 is a follicular tumor, which is severe.

[0209] Original data features: M=15, N=50, R=0.30 (just reaching the 0.3 threshold);

[0210] The output classification is follicular tumor, with a severity level of "severe".

[0211] The manual review is marked as "yes", indicating that the proportion of abnormal areas in the follicle structure meets the standard.

[0212] The algorithm is stable in threshold judgment for follicular tumors. Its advantage lies in the specialized training of the XGBoost model for the feature of "abnormal follicular structure", which improves the recognition accuracy of rare subtypes.

[0213] Key case conclusions: The algorithm performs well in high R-value malicious cases (advantage), but is susceptible to interference in the threshold critical area (defect), and the anti-interference ability of Yolov9 and the feature weight allocation of XGBoost need to be optimized.

[0214] Table 3: Comparison of Recognition Accuracy

[0215]

[0216] In this embodiment, the accuracy of this scheme is significantly higher than that of traditional methods, with lower false negative and false positive rates. Its advantage lies in the fact that multi-model collaboration compensates for the feature extraction deficiencies of a single model, and semantic matching reduces subjective human error.

[0217] Table 4: Comparison of Processing Efficiency

[0218]

[0219] In this embodiment, the processing efficiency of this solution is close to that of a single model and far exceeds that of manual microscopic examination. Although the training time is slightly longer, a balance between efficiency and accuracy is achieved through multi-model fusion, making it suitable for clinical batch testing scenarios. Areas for improvement lie in optimizing the model's lightweight design to further shorten the processing time per sample.

[0220] Table 5: Abnormal Cell Detection Capability Table

[0221]

[0222] In this embodiment, the detection accuracy for benign, medullary carcinoma, and follicular tumors reached 100%, while papillary carcinoma and AUS showed slight errors (such as one case being missed by AUS). Overall, the detection accuracy met the preset threshold (≥80%), and the detection capability was reliable.

[0223] Table 6: Processing Efficiency Analysis Table

[0224]

[0225] In this embodiment, the processing efficiency far exceeds the preset threshold, making it suitable for high-throughput clinical scenarios, and the automated process significantly reduces labor costs.

[0226] Table 7: Validation Table of Human Intervention Rate and Pathological Gold Standard

[0227]

[0228] In this embodiment, both the rate of human intervention and the consistency with the gold standard met the requirements, indicating that the algorithm reduced unnecessary human intervention while ensuring diagnostic reliability.

[0229] Example 4

[0230] Further explanation in conjunction with Example 1, such as Figure 8 The structure shown. Figure 8 A schematic diagram of the structure of a computer device provided in an embodiment of this application. The computer device includes:

[0231] Processor, memory, communication bus, and computer programs stored in memory that can run on the processor.

[0232] The processor can call the computer program in the memory to implement the thyroid cell sample grading method based on multimodal fusion provided in the above embodiments when executing the program. The method includes: S10, acquiring and preprocessing the labeled panoramic image of the thyroid cell sample; S20, using a trained YOLOv9 target detection model to perform multi-scale scanning on the thyroid cell panoramic image after preprocessing in step S10; identifying 34 types of suspicious thyroid cell regions based on the multi-scale scanning and outputting the 34 types of suspicious thyroid cell regions; S30, using a trained Chinese CLIP image retrieval model to extract the image features of the 34 types of suspicious thyroid cell regions output in step S20; performing semantic matching between the extracted image features and preset 34 types of Chinese pathological text tags to generate fine-grained feature vectors; establishing semantic association between image features and text tags based on the fine-grained feature vectors to obtain retrieval results; S40, based on the retrieval results obtained in step S30 for each thyroid cell sample, using the XGBoost model to determine the sample grading, and calculating the ratio according to preset rules to determine the positive / negative and severity of the sample.

[0233] Furthermore, computer equipment also includes:

[0234] The Communications Interface (CI) is used for communication between the memory and the processor.

[0235] The memory may include high-speed RAM, and may also include non-volatile memory, such as at least one disk drive.

[0236] If the memory, processor, and communication interface are implemented independently, they can be interconnected via a bus to communicate with each other. The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of representation, Figure 8 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.

[0237] Furthermore, when the logical instructions in the aforementioned memory can be implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0238] A processor may include one or more processing units, such as an application processor (AP), an application-specific integrated circuit (ASIC), a modem processor, a central processing unit (CPU), an image signal processor (ISP), a controller, memory, a video codec, a digital signal processor (DSP), a baseband processor, and / or a neural network processing unit (NPU). Different processing units may be independent devices or integrated into one or more processors. The controller may serve as a central nervous system and command center. The controller generates operation control signals based on instruction opcodes and timing signals to control instruction fetching and execution. The processor may also include memory for storing instructions and data. In some embodiments, the memory in the processor is a cache memory. This memory can store instructions or data that the processor has recently used or that is used repeatedly. If the processor needs to reuse the instruction or data, it can directly retrieve it from the memory. This avoids repeated access, reduces processor waiting time, and thus improves system efficiency.

[0239] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0240] Display devices are used to display images, videos, etc. Display devices may include display panels, which may employ liquid crystal displays (LCDs), organic light-emitting diodes (OLEDs), active-matrix organic light-emitting diodes (AMOLEDs), flexible light-emitting diodes (FLEDs), MiniLEDs, MicroLEDs, Micro-OLEDs, quantum dot light-emitting diodes (QLEDs), etc.

[0241] Alternatively, in a specific implementation, if the memory, processor, and communication interface are integrated on a single chip, then the memory, processor, and communication interface can communicate with each other through an internal interface.

[0242] On the other hand, this application embodiment also provides a computer non-transitory readable storage medium storing a computer program. When executed by a processor, the program implements the above-mentioned thyroid cell sample grading method based on multimodal fusion. The method includes: S10, acquiring and preprocessing a labeled panoramic image of thyroid cell samples; S20, using a trained YOLOv9 target detection model to perform multi-scale scanning on the thyroid cell panoramic image preprocessed in step S10; identifying 34 types of suspicious thyroid cell regions based on the multi-scale scanning and outputting the 34 types of suspicious thyroid cell regions; S30, using a trained Chinese CLIP image retrieval model to extract image features of the 34 types of suspicious thyroid cell regions output in step S20; performing semantic matching between the extracted image features and preset 34 types of Chinese pathological text tags to generate fine-grained feature vectors; establishing semantic association between image features and text tags based on the fine-grained feature vectors to obtain retrieval results; S40, based on the retrieval results obtained in step S30 for each thyroid cell sample, using an XGBoost model to determine the sample grading, and calculating a ratio according to preset rules to determine the positive / negative and severity of the sample.

[0243] In another aspect, embodiments of this application also provide a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. The computer program can execute computer instructions. When the computer program is executed by a processor, the computer can execute the thyroid cell sample grading method based on multimodal fusion provided by the above methods. This method includes: S10, acquiring and preprocessing annotated panoramic images of thyroid cell samples; S20, using a trained YOLOv9 target detection model to perform multi-scale scanning on the thyroid cell panoramic images preprocessed in step S10; identifying 34 types of suspicious thyroid cell regions based on the multi-scale scanning, and outputting the 34 types of suspicious thyroid cell regions; S30, using a trained Chinese... The CLIP image retrieval model extracts image features of 34 types of suspicious thyroid cell regions output in step S20; the extracted image features are semantically matched with 34 types of preset Chinese pathological text tags to generate fine-grained feature vectors; the semantic association between image features and text tags is established based on the fine-grained feature vectors to obtain retrieval results; S40, based on the retrieval results obtained in step S30 for each thyroid cell sample, the XGBoost model is used to determine the sample grading, and the ratio is calculated according to preset rules to determine the positive / negative and severity of the sample.

[0244] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus or device (such as a computer-based system, a processor-included system or other system that can fetch and execute instructions from, an instruction execution system, apparatus or device).

[0245] For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit a program for use in or in conjunction with an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. Furthermore, a computer-readable medium can even be paper or other suitable media on which the program can be printed, since the program can be obtained electronically by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in a computer memory.

[0246] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0247] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0248] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0249] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other.

[0250] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0251] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions, and variations to the above embodiments within the scope of this application.

Claims

1. A multi-modal fusion-based thyroid cell sample grading method, characterized in that, Includes the following steps: S10. Obtain and preprocess a panoramic view of the labeled thyroid cell sample; S20. The trained Yolov9 target detection model is used to perform multi-scale scanning on the panoramic image of thyroid cells after preprocessing in step S10. Based on multi-scale scanning, 34 types of suspicious thyroid cell regions were identified, and the 34 types of suspicious thyroid cell regions were output. S30. Use the trained Chinese CLIP image retrieval model to extract the image features of the 34 types of suspicious thyroid cell regions output in step S20. The extracted image features are semantically matched with 34 preset categories of Chinese pathological text tags to generate fine-grained feature vectors. The semantic association between image features and text labels is established based on fine-grained feature vectors to obtain retrieval results. The retrieval results include the fine-grained feature vector of each suspicious region, the confirmed category label, and the confidence level. S40. Based on the retrieval results obtained in step S30 for each thyroid cell sample, the XGBoost model is used to determine the sample classification. The number of each of the 34 suspicious regions in the thyroid cell sample is counted, and the statistical results are input into the trained XGBoost model. The sample classification result is determined according to the number distribution of the 34 suspicious regions in the sample, and the ratio is calculated according to the preset rules to determine the positive / negative and severity of the sample.

2. The method of claim 1, wherein the method is based on multi-modal fusion of thyroid cell samples. In step S10, each type of suspicious region includes bounding box coordinates and the number of categories; suspicious regions include one or more of the following: papillary arrangement, psammomatous body, follicular mass, or multinucleated giant cell. 3.The method of claim 1, wherein, In step S40, the sample grading result is one of benign, papillary carcinoma, medullary carcinoma, AUS, or follicular tumor.

4. The method of claim 3, wherein the method further comprises: Let the number of the ith suspicious region be , then the total number of suspicious regions is calculated by the formula: ;​ wherein, represents the number of suspicious regions of the i-th type; represents the total number of suspicious regions in the sample.

5. The thyroid cell sample grading method based on multimodal fusion according to any one of claims 4, characterized in that, Step S40 also includes the following steps: normalizing the number of 34 types of suspicious regions to construct the input feature vector of the XGBoost model. ; Normalized value of the i-th type of suspicious region The calculation formula is: ; in, It is a very small positive number, used to avoid The denominator is zero when the time is zero; This represents the proportion of the number of suspicious areas of type i to the total number.

6. The method of claim 5, wherein the method further comprises: The constructed feature vector Inputting the trained XGBoost model yields the preliminary hierarchical probability distribution of the samples. ; wherein, denotes the probability that the sample belongs to the j-th class of grading results; Based on XGBoost, the Softmax function is used to convert the predicted score into a probability, with the following formula: ; wherein, is the predicted score for the jth class; Determining final grading results by argmax function Then the formula is: ; in, Let be the probability that a sample belongs to the j-th class. This is the final classification result output by the model.

7. The method of claim 6, wherein the method further comprises: Corresponding to benign, It is papillary carcinoma. It is medullary carcinoma. For AUS, It is a follicular tumor.

8. The method of claim 5, wherein the method further comprises: Calculate the total number of malignant lesion categories in 34 suspicious areas. According to the feature vector Define the index set corresponding to the malignant category. The formula is: ; The formula for the proportion of malignancy is: The formula is: ; Based on the proportion of malignancy Classification of severity: , determined to be more severe; If , determine as severe; If , determine as very severe; in, This represents the total number of malignant association categories. This represents the proportion of malignantly correlated regions in the total region.

9. A computer non-transitory readable storage medium having stored thereon computer programs / instructions, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the thyroid cell sample grading method based on multimodal fusion as described in any one of claims 1-8.

10. A computer program product comprising computer programs / instructions, characterized in that, When the computer program / instructions are executed by one or more processors, they implement the steps of the thyroid cell sample grading method based on multimodal fusion as described in any one of claims 1-8.