Multimodal endoscopic laryngeal cancer infiltration depth and lymph node metastasis intelligent prediction system
The multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis solves the problem of insufficient integration of optical morphology and deep structural features in preoperative assessment of laryngeal cancer, and realizes synchronous intelligent prediction of laryngeal cancer invasion depth and lymph node metastasis, thus improving the accuracy and efficiency of assessment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- WEST CHINA HOSPITAL SICHUAN UNIV
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-19
AI Technical Summary
Existing preoperative assessment techniques for laryngeal cancer cannot achieve the synergistic integration of the surface optical morphology and deep tomographic structural features of laryngeal cancer lesions, nor can they simultaneously complete the integrated intelligent prediction of laryngeal cancer invasion depth and lymph node metastasis. This results in insufficient accuracy and efficiency of the assessment, and the reliance on manual interpretation leads to issues of subjectivity and poor consistency.
A multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis was adopted. Through image acquisition, preprocessing, annotation, enhancement, single-modal feature extraction and multimodal feature fusion, multimodal features were generated. Finally, the prediction of laryngeal cancer invasion depth and lymph node metastasis was performed based on the intelligent prediction network.
This method achieves the synergistic fusion of surface optical and deep tomographic imaging features of laryngeal cancer lesions, improving the accuracy and efficiency of preoperative assessment and supporting precision clinical diagnosis and treatment.
Smart Images

Figure CN122243978A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of medical information technology, specifically to a multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis. Background Technology
[0002] Laryngeal cancer is a common malignant tumor of the head and neck, which seriously threatens the life and health of patients. It can also impair key physiological functions such as vocalization and swallowing, significantly reducing the quality of life of patients.
[0003] Currently, routine preoperative assessment of laryngeal cancer in clinical practice largely relies on endoscopists' manual interpretation of optical images such as white light endoscopy and narrow-band imaging endoscopy, combined with the results of external imaging examinations such as computed tomography and magnetic resonance imaging. Existing intelligent analysis technologies for laryngeal cancer endoscopic images mostly focus on basic tasks such as lesion area identification and benign / malignant classification of lesions in single-modality endoscopic images. The few related studies on tumor invasion assessment are also mostly based on feature extraction and analysis of single-type image data, and have not formed a systematic multi-dimensional assessment scheme.
[0004] On the one hand, existing technologies cannot achieve the synergistic integration and utilization of the surface optical morphological features and deep tomographic structural features of laryngeal cancer lesions. Most analyses are based on single-modal images, which can only obtain limited information about the lesion at a local and single level. At the same time, they cannot simultaneously complete the integrated intelligent prediction of laryngeal cancer invasion depth and lymph node metastasis, resulting in insufficient accuracy and comprehensiveness of preoperative assessment, making it difficult to meet the core needs of precise clinical diagnosis and treatment. On the other hand, the current manual interpretation mode of endoscopic images is highly dependent on the clinical experience and professional level of physicians, which has the problems of strong subjectivity, poor consistency between and within groups, and low assessment efficiency. It is difficult to achieve standardized and quantitative extraction and analysis of lesion features, which can easily lead to underestimation or overestimation of preoperative staging, affecting the rationality of clinical diagnosis and treatment decisions. Summary of the Invention
[0005] The present invention aims to at least partially solve the technical problems in the above-mentioned technologies.
[0006] Therefore, this invention discloses a multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis, comprising:
[0007] The image acquisition module is used to receive surface optical color images and tomographic grayscale images of lesions obtained by endoscopy.
[0008] An image preprocessing module is used to preprocess the surface optical color image and the tomographic grayscale image;
[0009] The image annotation module is used to manually annotate the target lesion areas in the preprocessed surface optical color image and the tomographic grayscale image;
[0010] The image enhancement module is used to enhance the manually annotated surface optical color image and the tomographic grayscale image;
[0011] A single-modal feature extraction module is used to extract features of the enhanced surface optical color image and features of the tomographic grayscale image, respectively.
[0012] A multimodal feature fusion module is used to fuse the features of the surface optical color image and the features of the tomographic grayscale image to generate multimodal features;
[0013] The diagnostic module is used to generate a prediction result based on the multimodal features.
[0014] The multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis disclosed in this invention can achieve the synergistic fusion of surface optical and deep tomographic imaging features of laryngeal cancer lesions, and simultaneously complete the integrated intelligent prediction of invasion depth and lymph node metastasis, significantly improving the accuracy and efficiency of preoperative assessment and supporting precise clinical diagnosis and treatment.
[0015] In addition, the multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis disclosed in this invention may also have the following additional technical features:
[0016] Furthermore, in the image preprocessing module, the surface optical color image is preprocessed, specifically as follows:
[0017] S1.1: Perform privacy desensitization processing on the surface optical color image, and shield and remove the patient privacy information and device-independent watermark embedded in the surface optical color image to obtain a desensitized image;
[0018] S1.2: Perform color space correction and white balance calibration on the desensitized image to unify the color reference of the surface optical color image and obtain a color-standardized image;
[0019] S1.3: Perform edge-preserving and noise reduction processing on the color-normalized image. Use a bilateral filtering algorithm to filter out Gaussian noise and salt-and-pepper noise generated during image acquisition, while retaining the lesion edges, mucosal texture and microvascular features in the surface optical color image to obtain the denoised image.
[0020] S1.4: Perform contrast normalization processing on the denoised image and use the limited contrast adaptive histogram equalization algorithm to obtain a contrast-enhanced image;
[0021] S1.5: The contrast-enhanced image is subjected to size and resolution normalization processing, and the image is scaled to a uniform input size using a bicubic interpolation algorithm to obtain the preprocessed surface optical color image.
[0022] Furthermore, in the image preprocessing module, the tomographic grayscale image is preprocessed, specifically as follows:
[0023] S2.1: Perform privacy desensitization processing on the grayscale image of the tomographic imaging, and mask and remove the patient privacy information and device-independent watermark embedded in the grayscale image of the tomographic imaging to obtain the desensitized tomographic image;
[0024] S2.2: The desensitized tomographic image is subjected to speckle noise suppression processing. An anisotropic diffusion filtering algorithm is used to filter out the inherent speckle noise of tomographic imaging, while preserving the laryngeal wall tissue layer boundary and lesion infiltration contour features to obtain a denoised tomographic image.
[0025] S2.3: Perform window width and window level normalization processing on the denoised tomographic image to map the image grayscale values to the standard grayscale range, thereby obtaining a grayscale normalized tomographic image;
[0026] S2.4: For the grayscale normalized tomographic images acquired in a continuous sequence, perform inter-layer rigid registration processing to obtain the registered tomographic images;
[0027] S2.5: The registered tomographic image is subjected to size and resolution standardization processing. A bicubic interpolation algorithm is used to scale the image to a uniform input size that matches the surface optical color image, thereby obtaining the preprocessed tomographic grayscale image.
[0028] Furthermore, in the image enhancement module, the enhancement of the manually annotated surface optical color image specifically includes:
[0029] S3.1: Based on the target lesion region in the surface optical color image completed by manual annotation, generate the corresponding lesion foreground binary mask;
[0030] S3.2: Based on the binary mask of the lesion foreground, perform foreground feature enhancement processing on the target lesion region. The mucosal texture, microvascular morphology and lesion edge details in the target lesion region are processed by an adaptive texture enhancement operator to obtain a foreground enhanced image.
[0031] S3.3: Perform background feature suppression processing on the background area outside the foreground binary mask of the lesion to obtain a background suppressed image;
[0032] S3.4: Perform pixel value range normalization processing on the background suppression image to obtain the enhanced surface optical color image.
[0033] Furthermore, in the image enhancement module, the enhancement of the manually annotated tomographic grayscale image specifically involves:
[0034] S4.1: Based on the target lesion region in the tomographic grayscale image completed by manual annotation, and combined with the spatial location information of the continuous tomographic sequence, generate a binary mask of the lesion foreground at the corresponding tomographic level.
[0035] S4.2: Based on the binary mask of the lesion foreground, perform edge structure enhancement processing on the target lesion region, and process the lesion infiltration boundary and the laryngeal wall tissue layering interface through the edge-preserving gradient enhancement operator to obtain a foreground enhanced tomographic image;
[0036] S4.3: Perform background grayscale normalization processing on the background area outside the foreground binary mask of the lesion to obtain a background suppression tomographic image;
[0037] S4.4: For the background-suppressed tomographic images acquired in a continuous sequence, perform inter-layer feature consistency enhancement processing to obtain the enhanced tomographic grayscale image.
[0038] Furthermore, in the single-modal feature extraction module, the features of the enhanced surface optical color image are extracted, specifically as follows:
[0039] S5.1: Using the enhanced surface optical color image as input, and combining it with the lesion foreground binary mask corresponding to the surface optical color image, determine the target lesion region as the effective feature input;
[0040] S5.2: The target lesion region is hierarchically encoded by a convolutional neural network encoding backbone network. First, the bottom-level visual features of the target lesion region are extracted by a shallow network. The bottom-level visual features include lesion edge contour, mucosal texture, microvascular morphology, and gray-level gradient features.
[0041] S5.3: The deep network of the backbone network encoded by the convolutional neural network is used to further encode the underlying visual features and extract the high-dimensional semantic features of the target lesion area. The high-dimensional semantic features include the lesion morphology irregularity, the range of mucosal invasion, and the distribution features of abnormal lesion areas.
[0042] S5.4: Perform multi-scale feature aggregation on the extracted low-level visual features and the high-dimensional semantic features to generate an initial surface feature set corresponding to the surface optical color image;
[0043] S5.5: Perform feature normalization and dimensionality reduction optimization on the initial surface feature set to generate the features of the surface optical color image.
[0044] Furthermore, in the single-modal feature extraction module, the features of the enhanced tomographic grayscale image are extracted, specifically as follows:
[0045] S6.1: Using the enhanced tomographic grayscale image as input, and combining it with the lesion foreground binary mask corresponding to the tomographic grayscale image, determine the target lesion region as an effective feature input, and simultaneously associate it with the adjacent layer images of the continuous tomographic sequence in which the target lesion region is located.
[0046] S6.2: Using a tomographic image feature coding network, structural features are extracted from the target lesion area on a single layer to obtain the gray-scale distribution features of the lesion, the tissue layer boundary features, the infiltration contour morphology features, and the gray-scale difference features between the lesion and the surrounding normal tissue within the single layer.
[0047] S6.3: Through the sequence coding branch of the tomographic image feature coding network, the adjacent layer features of the continuous tomographic sequence are correlated and coded to extract the cross-layer spatial features of the target lesion in the longitudinal direction. The cross-layer spatial features include the lesion infiltration depth extension range, tissue layer invasion continuity, and deep structure destruction degree features.
[0048] S6.4: Perform multi-dimensional feature aggregation on the extracted single-layer structural features and cross-layer spatial features to generate an initial deep feature set corresponding to the tomographic grayscale image;
[0049] S6.5: Perform feature normalization and dimensionality reduction optimization on the initial deep feature set to generate the features of the tomographic grayscale image.
[0050] Furthermore, in the multimodal feature fusion module, the features of the surface optical color image and the features of the tomographic grayscale image are fused to generate the multimodal features, specifically:
[0051] S7.1: Receive the features of the surface optical color image and the features of the tomographic grayscale image output by the single-modal feature extraction module, and complete the association and alignment of the feature spatial coordinates based on the spatial registration relationship of the target lesion area corresponding to the two types of features. At the same time, perform dimensional adaptation of the two types of features through a fully connected layer to generate surface feature vectors and deep feature vectors with unified dimensions.
[0052] S7.2: Input the surface feature vector and the deep feature vector into the cross-modal feature interaction coding network, model the complementary relationship between the two types of features through the cross attention mechanism, construct the corresponding mapping relationship between the surface boundary information of the lesion in the features of the surface optical color image and the deep infiltration information of the lesion in the features of the tomographic grayscale image, and generate a cross-modal interactive feature set;
[0053] S7.3: Based on the dual task objective of predicting the depth of laryngeal cancer invasion and lymph node metastasis, an adaptive dynamic weighting process is performed on the cross-modal interactive feature set. The weight coefficients of surface-related features and deep-related features are adjusted according to the feature contribution of different prediction tasks to generate a weighted optimized feature set.
[0054] S7.4: Perform global feature aggregation and normalization on the weighted optimized feature set to remove feature redundancy and noise interference, and generate the multimodal features.
[0055] Furthermore, in the diagnostic module, the prediction result is generated based on the multimodal features, specifically as follows:
[0056] S8.1: Receive the multimodal features output by the multimodal feature fusion module, input the multimodal features into the dual-task intelligent prediction network, and set up a laryngeal cancer invasion depth prediction branch and a lymph node metastasis prediction branch respectively in the dual-task intelligent prediction network;
[0057] S8.2: Through the laryngeal cancer invasion depth prediction branch, decode and reason about the multimodal features to generate the submucosal invasion depth value, tissue invasion level, and laryngeal cartilage invasion risk value of the target lesion area, and simultaneously map them to the AJCC laryngeal cancer TNM staging standard to output the corresponding T stage prediction result.
[0058] S8.3: Through the lymph node metastasis prediction branch, decode and reason about the multimodal features to generate the probability of cervical lymph node metastasis risk, location information of suspicious metastatic lymph nodes, and the grading result of metastasis range. At the same time, map it to the AJCC laryngeal cancer TNM staging standard and output the corresponding N-stage prediction result.
[0059] S8.4: Perform probability calibration processing on the T-stage prediction results and the N-stage prediction results, and correct the inference probability deviation through the ordinal regression model to obtain the calibrated core prediction indicators;
[0060] S8.5: Integrate the core predictive indicators to generate the final intelligent predictive results of laryngeal cancer invasion depth and lymph node metastasis.
[0061] Furthermore, it also includes:
[0062] S9.1: Based on the intelligent prediction results, combined with the original images of the surface optical color image and the tomographic grayscale image, the infiltration boundary, invasion layer, and suspected metastatic lymph node location of the target lesion area are superimposed and mapped onto the corresponding original endoscopic image in the form of visual annotations to generate a lesion visual annotation map.
[0063] S9.2: Extract the core quantitative indicators, staging results, and risk warning information from the intelligent prediction results, and combine them with the lesion visualization annotation map to generate a standardized clinical report on intelligent prediction of endoscopic laryngeal cancer;
[0064] S9.3: Based on the feature attribution algorithm, locate and quantify the core contributing features that affect the prediction results among the multimodal features, generate an interpretable description of the prediction results, and simultaneously incorporate it into the clinical report.
[0065] Additional features and advantages of this invention will be set forth in the description which follows, or may be learned by practicing the invention. Attached Figure Description
[0066] The technical solution and beneficial effects of the present invention will become apparent and readily understood from the following description in conjunction with the accompanying drawings, wherein:
[0067] Figure 1 This is a flowchart of the multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis according to the present invention.
[0068] Figure 2 This is another flowchart of the multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis according to the present invention;
[0069] Figure 3 This is another flowchart of the multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis according to the present invention;
[0070] Figure 4 This is another flowchart of the multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis according to the present invention;
[0071] Figure 5 This is another flowchart of the multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis according to the present invention;
[0072] Figure 6 This is another flowchart of the multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis according to the present invention;
[0073] Figure 7 This is another flowchart of the multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis according to the present invention;
[0074] Figure 8 This is another flowchart of the multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis according to the present invention;
[0075] Figure 9 This is another flowchart of the multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis according to the present invention;
[0076] Figure 10 This is another flowchart of the multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis according to the present invention. Detailed Implementation
[0077] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments.
[0078] The multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis disclosed in this invention will now be described with reference to the accompanying drawings.
[0079] Example 1
[0080] like Figure 1 As shown, a multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis includes:
[0081] The image acquisition module is used to receive surface optical color images and tomographic grayscale images of lesions obtained by endoscopy.
[0082] The image preprocessing module is used to preprocess the surface optical color image and the tomographic grayscale image;
[0083] The image annotation module is used for manually annotating the target lesion areas in the preprocessed surface optical color image and tomographic grayscale image;
[0084] The image enhancement module is used to enhance manually annotated surface optical color images and tomographic grayscale images;
[0085] A single-modal feature extraction module is used to extract features of the enhanced surface optical color image and the features of the tomographic grayscale image, respectively.
[0086] A multimodal feature fusion module is used to fuse features from surface optical color images and features from tomographic grayscale images to generate multimodal features;
[0087] The diagnostic module is used to generate prediction results based on multimodal features.
[0088] Specifically, the execution sequence of the system's modules is as follows: First, the image acquisition module synchronously receives and parses the formats of two types of image data. Then, the image preprocessing module performs standardization processing, the image annotation module manually annotates the target lesion area, the image enhancement module enhances lesion features, the single-modal feature extraction module extracts layered features specific to the two modalities, and the multimodal feature fusion module performs collaborative fusion of cross-modal features. Finally, the diagnostic module performs dual-task intelligent reasoning and outputs the prediction result. The image annotation module allows clinicians to annotate lesion areas using polygons, rectangles, and pen-style drawing through a user-friendly interface. The annotation results are synchronously linked to the corresponding surface optical color image and tomographic grayscale image of the same lesion, ensuring spatial consistency in subsequent feature extraction and fusion. This system is compatible with mainstream medical imaging and common image formats such as DICOM, BMP, JPG, and PNG, meeting the application needs of various scenarios including routine clinical examinations, precise preoperative assessment, and real-time intraoperative auxiliary judgment.
[0089] Example 2
[0090] like Figure 2 and Figure 3 As shown, in the image preprocessing module:
[0091] The surface optical color image is preprocessed as follows:
[0092] S1.1: Perform privacy desensitization processing on the surface optical color image, and mask and remove the patient privacy information and device-independent watermark embedded in the surface optical color image to obtain the desensitized image;
[0093] S1.2: Perform color space correction and white balance calibration on the desensitized image to unify the color reference of the surface optical color image and obtain a color-standardized image;
[0094] S1.3: The color-normalized image is subjected to edge-preserving denoising processing. The bilateral filtering algorithm is used to filter out Gaussian noise and salt-and-pepper noise generated during the image acquisition process, while preserving the lesion edges, mucosal texture and microvascular features in the surface optical color image to obtain the denoised image.
[0095] S1.4: Perform contrast normalization processing on the denoised image and use the contrast-limited adaptive histogram equalization algorithm to obtain a contrast-enhanced image;
[0096] S1.5: The contrast-enhanced image is normalized in size and resolution, and the image is scaled to a uniform input size using a bicubic interpolation algorithm to obtain a preprocessed surface optical color image.
[0097] The grayscale image of the tomographic imaging is preprocessed as follows:
[0098] S2.1: Perform privacy desensitization processing on the grayscale image of the tomographic imaging, and mask and remove the patient privacy information and device-independent watermark embedded in the grayscale image of the tomographic imaging to obtain the desensitized tomographic image;
[0099] S2.2: The desensitized tomographic images are subjected to speckle noise suppression processing. An anisotropic diffusion filtering algorithm is used to filter out the inherent speckle noise of tomographic imaging, while preserving the laryngeal wall tissue layer boundary and lesion infiltration contour features, to obtain the denoised tomographic images.
[0100] S2.3: Perform window width and window level normalization processing on the denoised tomographic image to map the image grayscale values to the standard grayscale range and obtain a grayscale normalized tomographic image.
[0101] S2.4: Perform rigid inter-slice registration processing on grayscale normalized tomographic images acquired in a continuous sequence to obtain registered tomographic images;
[0102] S2.5: The registered tomographic image is standardized in size and resolution. The image is scaled to a uniform input size that matches the surface optical color image using a bicubic interpolation algorithm to obtain a preprocessed tomographic grayscale image.
[0103] Specifically, the preprocessing of surface optical color images includes: S1.1 Privacy Desensitization Processing: An OCR text detection algorithm is used to locate embedded privacy information such as patient name, gender, age, examination number, equipment number, and examination time, as well as the endoscope equipment manufacturer's logo and irrelevant watermark areas. This information is masked and removed using solid color pixel filling, while a desensitization processing log is generated to ensure patient data privacy throughout the process; S1.2 Color Space Correction and White Balance Calibration: Images are uniformly converted to the sRGB standard color space. Based on the normal laryngeal mucosa area in the endoscopic image, a grayscale world algorithm is used to complete white balance calibration, eliminating color deviations from different equipment and examination environments, and unifying the color baseline of all images; S1.3 Edge Preservation and Denoising Processing: The core parameters of the bilateral filtering algorithm are set to... The spatial domain kernel size is 15, and the grayscale kernel size is 25. This effectively filters out Gaussian and salt-and-pepper noise generated during image acquisition while preserving key diagnostic features such as lesion edges, mucosal texture, and microvascular morphology to the greatest extent. In S1.4 contrast normalization, the parameters of the contrast adaptive histogram equalization (CLAHE) algorithm are set to: block size 8×8 and contrast limit threshold 2.0. This improves the contrast between the lesion area and the normal mucosal area while avoiding noise amplification caused by over-enhancement. In S1.5 size and resolution standardization, a bicubic interpolation algorithm is used to uniformly scale all images to an input size of 512×512 pixels, while unifying the image resolution to 300 dpi, which fully matches the input requirements of the subsequent feature extraction network.
[0104] For the preprocessing of the grayscale images from tomographic imaging: S2.1 The privacy desensitization process is consistent with the surface optical color image, completing the full masking and removal of patient privacy information and irrelevant watermarks; In S2.2 speckle noise suppression, the Perona-Malik anisotropic diffusion filtering algorithm is used, with the number of iterations set to 10 and the conduction coefficient set to 50. While filtering out the inherent speckle noise of tomographic imaging, it accurately preserves the tissue layer boundaries of the laryngeal wall mucosa, submucosa, muscle layer, and cartilage layer, as well as the contour features of lesion infiltration; In S2.3 window width and window level normalization, based on the imaging grayscale characteristics of the laryngeal wall tissue, the window width is set to 200 HU and the window level is set to 50 HU, linearly mapping the grayscale values of the original image. The standard 8-bit grayscale range of 0-255 eliminates the impact of differences in window width and level settings caused by different devices and operators; in S2.4 rigid registration processing, for tomographic images acquired in a continuous sequence, the intermediate layer image of the sequence is used as the reference, and a rigid registration algorithm based on mutual information is used to complete the translation and rotation registration of adjacent layer images, eliminating the inter-layer displacement deviation caused by the patient's swallowing and breathing movements during the examination, and ensuring the spatial continuity of the lesion in the depth direction; in S2.5 size and resolution standardization processing, a bicubic interpolation algorithm is used to uniformly scale the tomographic images to 512×512 pixels, which is completely matched with the input size of the surface optical color image, laying the foundation for subsequent spatial registration and fusion of cross-modal features.
[0105] Example 3
[0106] like Figure 4 and Figure 5 As shown, in the image enhancement module:
[0107] The enhanced surface optical color image after manual annotation is as follows:
[0108] S3.1: Based on the target lesion region in the manually annotated surface optical color image, generate the corresponding lesion foreground binary mask;
[0109] S3.2: Based on the binary mask of the lesion foreground, perform foreground feature enhancement processing on the target lesion area. The mucosal texture, microvascular morphology and lesion edge details in the target lesion area are processed by an adaptive texture enhancement operator to obtain the foreground enhanced image.
[0110] S3.3: Perform background feature suppression processing on the background region outside the binary mask of the lesion foreground to obtain a background suppressed image;
[0111] S3.4: Perform pixel value range normalization on the background suppression image to obtain the fully enhanced surface optical color image.
[0112] The enhanced tomographic grayscale image after manual annotation is as follows:
[0113] S4.1: Based on the target lesion region in the manually annotated grayscale image of tomography, and combined with the spatial location information of the continuous tomographic sequence, generate a binary mask of the lesion foreground at the corresponding tomographic level.
[0114] S4.2: Based on the binary mask of the lesion foreground, edge structure enhancement processing is performed on the target lesion area. The gradient enhancement operator that preserves the edge is used to process the lesion infiltration boundary and the laryngeal wall tissue layering interface to obtain the foreground enhanced tomographic image.
[0115] S4.3: Perform background grayscale normalization processing on the background area outside the binary mask of the lesion foreground to obtain a background suppression tomographic image;
[0116] S4.4: For background-suppressed tomographic images acquired in a continuous sequence, perform inter-layer feature consistency enhancement processing to obtain an enhanced tomographic grayscale image.
[0117] Specifically, regarding the enhancement processing of surface optical color images: In S3.1, during the generation of the binary mask for the lesion foreground, a binary mask image with the same size as the original image is generated based on the manually labeled contour coordinates of the target lesion region. The pixel values of the labeled lesion foreground region are assigned a value of 1, while the pixel values of the non-lesion background region are assigned a value of 0. The mask region completely overlaps with the labeled lesion region, with no offset or omissions. In S3.2, during the foreground feature enhancement processing, an adaptive texture enhancement operator based on guided filtering is used. The kernel size of the guided filter is set to 5×5, and the regularization parameter is 0.01. For the lesion foreground region defined by the mask, key features such as the fine structure of the mucosal texture, the morphology and course of microvessels, and the contour gradient of the lesion edge are enhanced to address internal... The clinical pain point of blurred lesion edges and insufficient identification of subtle features in microscopic images; in S3.3 background feature suppression processing, Gaussian blur algorithm is used to smooth the background area outside the mask, and the kernel size is set to 3×3 to reduce the texture complexity of the background area and interference from irrelevant features, and further highlight the core diagnostic features of the foreground area of the lesion; in S3.4 pixel value range normalization processing, the RGB three-channel pixel values of the image are uniformly normalized to the [0,1] interval, and the pixel value centering processing is completed at the same time. The mean is set to [0.485,0.456,0.406] and the standard deviation is set to [0.229,0.224,0.225], which perfectly matches the input data distribution requirements of the subsequent convolutional neural network.
[0118] For the enhancement processing of grayscale images in tomographic imaging: In S4.1, in the generation of binary masks for the lesion foreground, based on the manually annotated target lesion region on a single layer, and combined with spatial location information such as slice thickness and interslice spacing of continuous tomographic sequences, spatial mapping of lesion regions on adjacent layers is completed, generating a binary mask for the lesion foreground corresponding to each layer of tomographic image, ensuring the spatial continuity of the mask in the depth direction of the lesion; In S4.2, in the edge structure enhancement processing, an edge-preserving gradient enhancement operator is used. First, the Canny edge detection algorithm is used to locate the lesion infiltration boundary and the layering interface of the laryngeal wall tissue. Then, gradient amplitude enhancement is performed on the edge region, with the enhancement coefficient set to 1.8, to enhance the boundary differences and distinctions between the lesion and normal tissue. The interface features of the tissue layers accurately highlight the infiltration contours of the lesions; in the background gray-level normalization processing in S4.3, the gray-level values of the background areas outside the mask are uniformly normalized to a narrow gray-level range of [50, 100] to reduce gray-level fluctuations and irrelevant interference in the background areas, while avoiding the loss of normal tissue layer information caused by excessive suppression of the background areas; in the inter-layer feature consistency enhancement processing in S4.4, for continuous tomographic sequence images, a three-dimensional Gaussian smoothing kernel is used to complete the feature smoothing processing of adjacent layers. The kernel size is set to 3×3×3 to eliminate random noise interference in a single layer, while ensuring the continuity of the infiltration features of the lesions in the depth direction, laying the foundation for subsequent cross-layer spatial feature extraction.
[0119] Example 4
[0120] like Figure 6 and Figure 7 As shown, in the single-modal feature extraction module:
[0121] The features of the enhanced surface optical color image are extracted as follows:
[0122] S5.1: Using the enhanced surface optical color image as input, and combining it with the lesion foreground binary mask corresponding to the surface optical color image, the target lesion region is determined as the effective feature input;
[0123] S5.2: The target lesion area is encoded in layers by using a convolutional neural network to encode the backbone network. First, the bottom visual features of the target lesion area are extracted through the shallow network. The bottom visual features include the lesion edge contour, mucosal texture, microvascular morphology, and gray-level gradient features.
[0124] S5.3: By encoding the deep network of the backbone network through a convolutional neural network, the underlying visual features are further encoded to extract high-dimensional semantic features of the target lesion area. The high-dimensional semantic features include the irregularity of the lesion shape, the extent of mucosal invasion, and the distribution characteristics of abnormal lesion areas.
[0125] S5.4: Perform multi-scale feature aggregation on the extracted low-level visual features and high-dimensional semantic features to generate the initial surface feature set corresponding to the surface optical color image;
[0126] S5.5: Perform feature normalization and dimensionality reduction optimization on the initial surface feature set to generate features for the surface optical color image.
[0127] The features of the enhanced tomographic grayscale image are extracted as follows:
[0128] S6.1: Using the enhanced tomographic grayscale image as input, and combining it with the binary mask of the lesion foreground corresponding to the tomographic grayscale image, the target lesion region is determined as an effective feature input, and the adjacent layer images of the continuous tomographic sequence in which the target lesion region is located are associated.
[0129] S6.2: Using a tomographic image feature coding network, structural features are extracted from the target lesion area on a single layer to obtain the gray-scale distribution features of the lesion, the tissue layer boundary features, the infiltration contour morphology features, and the gray-scale difference features between the lesion and the surrounding normal tissue within the single layer.
[0130] S6.3: Through the sequence coding branch of the tomographic image feature coding network, the features of adjacent layers of continuous tomographic sequences are correlated and encoded to extract the cross-layer spatial features of the target lesion in the longitudinal direction. The cross-layer spatial features include the extent of lesion infiltration depth, the continuity of tissue layer invasion, and the degree of damage to deep structures.
[0131] S6.4: Perform multi-dimensional feature aggregation on the extracted single-layer structural features and cross-layer spatial features to generate the initial deep feature set corresponding to the tomographic grayscale image;
[0132] S6.5: Perform feature normalization and dimensionality reduction optimization on the initial deep feature set to generate features for the tomographic grayscale image.
[0133] Specifically, for feature extraction of surface optical color images: In S5.1, effective feature input determination, the enhanced surface optical color image is used as the main input, combined with the corresponding binary mask of the lesion foreground. By multiplying the mask pixels, only the image features of the lesion foreground region are retained, while the pixel values of the background region are set to zero to shield the interference of irrelevant background features, ensuring that the network only encodes features for the target lesion region; In S5.2, low-level visual feature extraction, ResNet50, pre-trained on a head and neck tumor endoscopic image dataset, is used as the convolutional neural network encoding backbone. conv1, conv2_x, and conv3_x of the network are taken as shallow encoding networks, with output strides of 4, 8, and 16, respectively. Low-level visual features such as edge contours, mucosal texture, microvascular morphology, and gray-level gradients are extracted from the target lesion region to accurately capture subtle morphological changes of the lesion; In S5.3, high-dimensional semantic feature extraction, conv4_x and conv5_x of the ResNet50 backbone network are used as deep encoding networks, with output strides of 32, respectively. 64. Further high-level semantic encoding is performed on the low-level visual features output by the shallow network to extract high-dimensional semantic features such as lesion morphology irregularity, mucosal invasion range, and distribution of abnormal lesion areas, capturing the overall pathological characteristics of the lesions; In S5.4 multi-scale feature aggregation, a feature pyramid network (FPN) structure is adopted to perform top-down multi-scale fusion of the low-level visual features output by the shallow network and the high-dimensional semantic features output by the deep network. Features at different levels are upsampled, skip connections are made, and channels are concatenated to generate an initial surface feature set containing multi-scale information, solving the feature extraction adaptation problem for lesions of different sizes; In S5.5 feature normalization and dimensionality reduction optimization, batch normalization (BatchNorm) is first performed on the initial surface feature set to unify the feature distribution to a standard distribution with a mean of 0 and a variance of 1. Then, global average pooling (GAP) is used to convert the two-dimensional feature map into a one-dimensional feature vector. Finally, the feature vector is reduced to 256 dimensions through a fully connected layer to generate the final features corresponding to the surface optical color map, eliminating feature redundancy and improving the efficiency of subsequent fusion and inference.
[0134] For feature extraction from tomographic grayscale images: In S6.1, effective feature input determination, the enhanced single-layer tomographic grayscale image is used as the core input. Foreground features are selected by combining the corresponding lesion foreground binary mask. Simultaneously, continuous tomographic images of three adjacent layers above and below this layer are associated to construct a three-dimensional spatial input sequence for the lesion region, ensuring complete extraction of infiltration features in the depth direction. In S6.2, single-layer structural feature extraction, a dedicated 2D convolutional coding network is built as a single-layer coding branch of the tomographic image feature coding network. The network includes... Four convolutional pooling modules, with a kernel size of 3×3 and a stride of 1, extract grayscale distribution features, tissue layer boundary features, infiltration contour morphology features, and grayscale difference features between lesions and surrounding normal tissues within a single layer through layer-by-layer encoding, capturing the structural abnormalities of lesions in cross-sections. In S6.3 cross-layer spatial feature extraction, a sequence encoding branch combining 3D convolution and ConvLSTM is built in the tomographic image feature encoding network, with a kernel size of 3×3×3. Local spatial correlations between adjacent layers are first extracted through 3D convolution. The features are then temporally correlated and encoded using a ConvLSTM network to extract cross-level spatial features such as the extent of lesion invasion in the longitudinal direction, the continuity of tissue invasion, and the degree of damage to deep structures, capturing the three-dimensional invasion characteristics of the lesion. In S6.4, multi-dimensional feature aggregation, a multi-branch feature fusion structure is adopted to splice and fuse single-level structural features and cross-level spatial features along the channel dimension. At the same time, the channel attention mechanism is used to adaptively allocate the weights of features of different dimensions, strengthening the features that contribute more to the prediction of invasion depth, and generating an initial deep feature set containing cross-sectional structure and longitudinal spatial information. In S6.5, feature normalization and dimensionality reduction optimization, layer normalization (LayerNorm) is first performed on the initial deep feature set to unify the feature distribution. Then, three-dimensional global average pooling is used to convert the features into one-dimensional vectors. Finally, the feature vectors are reduced to 256 dimensions through a fully connected layer to generate the final features corresponding to the grayscale image of the tomographic imaging, which maintains the same dimensionality as the surface optical color image features, preparing for subsequent cross-modal fusion.
[0135] Example 5
[0136] like Figure 8 As shown, in the multimodal feature fusion module, features of the surface optical color image and features of the tomographic grayscale image are fused to generate multimodal features, specifically:
[0137] S7.1: Receive the features of the surface optical color image and the features of the tomographic grayscale image output by the single-modal feature extraction module. Based on the spatial registration relationship of the target lesion area corresponding to the two types of features, complete the association and alignment of the feature spatial coordinates. At the same time, the two types of features are adapted to the dimensions through a fully connected layer to generate surface feature vectors and deep feature vectors with unified dimensions.
[0138] S7.2: Input the surface feature vector and the deep feature vector into the cross-modal feature interaction coding network, model the complementary relationship between the two types of features through the cross attention mechanism, construct the corresponding mapping relationship between the surface boundary information of the lesion in the features of the surface optical color image and the deep infiltration information of the lesion in the features of the tomographic grayscale image, and generate the cross-modal interactive feature set;
[0139] S7.3: Based on the dual task objective of predicting the depth of laryngeal cancer invasion and lymph node metastasis, adaptive dynamic weighting processing is performed on the cross-modal interaction feature set. The weight coefficients of surface-related features and deep-related features are adjusted according to the feature contribution of different prediction tasks to generate a weighted optimized feature set.
[0140] S7.4: Perform global feature aggregation and normalization on the weighted optimized feature set, remove feature redundancy and noise interference, and generate multimodal features.
[0141] Specifically, in the S7.1 feature space alignment and dimensional adaptation, firstly, based on the anatomical spatial correspondence between the same target lesion region marked in the surface optical color image and the tomographic grayscale image, the spatial coordinate association and alignment of the two types of features are completed. This ensures that the lesion boundary points in the surface features and the lesion infiltration boundaries in the deep features correspond one-to-one in the anatomical space, eliminating the fusion deviation caused by spatial misalignment. At the same time, a dimensional adaptation network consisting of two fully connected layers is built to encode and map the input 256-dimensional surface optical features and 256-dimensional tomographic imaging features respectively, and uniformly generate surface feature vectors and deep feature vectors with a dimension of 256, completing the dimensional unification of the two types of features and ensuring the feasibility of subsequent interactive encoding.
[0142] In S7.2 cross-modal feature interaction coding, a cross-modal feature interaction coding network based on the cross-attention mechanism is constructed. The network contains 3 layers of cross-attention Transformer coding blocks, each with 8 attention heads and a dropout ratio of 0.1. The surface feature vector and the deep feature vector are used as the query and key-value inputs to the cross-attention layer, respectively. Through attention weight calculation, the complementary relationship between the surface mucosal lesion features and the deep tissue infiltration features is modeled, accurately constructing the corresponding mapping relationship between the abnormal surface boundary of the lesion and the deep infiltration range. At the same time, the information interaction and complementary enhancement of the two types of features are completed, generating a cross-modal interactive feature set that integrates surface-deep bidirectional information, thus solving the limitation that single-modal features can only reflect information at a single level of the lesion.
[0143] In the S7.3 adaptive dynamic weighted processing, a task-driven dynamic weighted network is built based on the dual objectives of predicting laryngeal cancer invasion depth and lymph node metastasis. The training loss of the two tasks is used as a supervision signal. Lightweight fully connected layers learn the contribution of different features to different prediction tasks and automatically generate adaptive weight coefficients. Specifically, for the invasion depth prediction task, the weight coefficients related to deep tomographic features are increased, with a weight benchmark set of 0.6, and the weight benchmark set for surface optical features is set to 0.4. For the lymph node metastasis prediction task, the weights of the two types of features are balanced, with the weight benchmarks for both surface optical features and deep tomographic features set to 0.5. At the same time, the weights are dynamically adjusted according to the actual situation of the lesion features to generate a weighted optimized feature set optimized for the two tasks, maximizing the adaptability of the fused features to the two core prediction tasks.
[0144] In the global feature aggregation and normalization process of S7.4, global feature splicing and aggregation are first performed on the weighted optimized feature set. Then, some neurons are randomly deactivated through the Dropout layer with a deactivation ratio of 0.2 to remove feature redundancy and noise interference and avoid model overfitting. Finally, through layer normalization, the distribution of the fused features is unified to the standard range to generate a final multimodal feature with a dimension of 512, which serves as the input to the dual-task prediction network of the subsequent diagnostic module.
[0145] Example 6
[0146] like Figure 9 and Figure 10 As shown, in the diagnostic module, a prediction result is generated based on multimodal features, specifically as follows:
[0147] S8.1: Receive the multimodal features output by the multimodal feature fusion module, input the multimodal features into the dual-task intelligent prediction network, and set the laryngeal cancer invasion depth prediction branch and lymph node metastasis prediction branch respectively in the dual-task intelligent prediction network;
[0148] S8.2: Through the laryngeal cancer invasion depth prediction branch, decode and reason about multimodal features to generate the submucosal invasion depth value, tissue invasion level, and laryngeal cartilage invasion risk value of the target lesion area. At the same time, map it to the AJCC laryngeal cancer TNM staging standard and output the corresponding T stage prediction result.
[0149] S8.3: Through the lymph node metastasis prediction branch, decode and reason about multimodal features to generate the probability of cervical lymph node metastasis risk, location information of suspicious metastatic lymph nodes, and the grading results of metastasis range. At the same time, it is mapped to the AJCC laryngeal cancer TNM staging standard and outputs the corresponding N-stage prediction results.
[0150] S8.4: Perform probability calibration on the prediction results of T-stage and N-stage, and correct the inference probability bias through the ordinal regression model to obtain the calibrated core prediction indicators;
[0151] S8.5: Integrates core predictive indicators to generate the final intelligent predictive results of laryngeal cancer invasion depth and lymph node metastasis.
[0152] Also includes:
[0153] S9.1: Based on the intelligent prediction results, combined with the original images of surface optical color map and tomographic grayscale map, the infiltration boundary, invasion level and suspected metastatic lymph node location of the target lesion area are superimposed and mapped onto the corresponding original endoscopic image in the form of visual annotation, generating a lesion visual annotation map;
[0154] S9.2: Extract the core quantitative indicators, staging results, and risk warning information from the intelligent prediction results, and combine them with the lesion visualization annotation map to generate a standardized clinical report on intelligent prediction of endoscopic laryngeal cancer;
[0155] S9.3: Based on the feature attribution algorithm, the core contributing features that affect the prediction results in multimodal features are located and quantified, and interpretable descriptions of the prediction results are generated and simultaneously included in the clinical report.
[0156] Specifically, for the dual-task prediction process of the diagnostic module: In the construction of the S8.1 dual-task intelligent prediction network, a network architecture with a shared feature encoding layer and dual-branch independent prediction heads is adopted. The shared encoding layer contains two fully connected layers with 256 and 128 neurons respectively, which are used to perform unified deep encoding of the input multimodal features. After the shared encoding layer, two independent parallel branches are set up, namely the laryngeal cancer invasion depth prediction branch and the lymph node metastasis prediction branch. The two branches do not interfere with each other and complete the inference calculation synchronously to achieve integrated prediction of the two core tasks.
[0157] In the S8.2 branch inference for predicting the invasion depth of laryngeal cancer, the branch network contains three fully connected layers, coupled with a ReLU activation function and a Dropout layer, ultimately outputting three core results: First, the regression head outputs the submucosal invasion depth of the target lesion region in millimeters, retained to two decimal places; second, the multi-classification head outputs the tissue invasion level classification results of the lesion, including five categories: mucosal layer, submucosal layer, laryngeal muscle layer, laryngeal cartilage layer, and extralaryngeal tissue invasion, while also outputting the predicted probability of each category; third, the binary classification head outputs the risk value of laryngeal cartilage invasion, ranging from 0 to 1, with higher values indicating a higher risk of cartilage invasion; simultaneously, based on the output invasion depth, tissue invasion level, and cartilage invasion status, it is strictly mapped to the T-staging standard of laryngeal cancer TNM staging, outputting the T-staging prediction results corresponding to Tis, T1, T2, T3, T4a, and T4b, as well as the confidence level of the corresponding stage.
[0158] In the S8.3 lymph node metastasis prediction branch inference, the branch network also contains 3 fully connected layers, combined with ReLU activation function and Dropout layer, and finally outputs three core results: First, the binary classification head outputs the overall risk probability of cervical lymph node metastasis, with a value range of 0-1. When the probability is ≥0.5, it is judged as high risk of lymph node metastasis. Second, the localization and classification head outputs the cervical anatomical partition location information of suspicious metastatic lymph nodes, covering cervical regions I-VI, and outputs the metastasis risk probability of each partition, as well as the predicted number of metastatic lymph nodes. Third, the grading head outputs the grading results of the lymph node metastasis range, corresponding to the N0, N1, N2, and N3 grading standards of N staging; at the same time, it strictly maps to the N staging standard of AJCC 8th edition TNM staging of laryngeal cancer, and outputs the corresponding N staging prediction results and confidence levels.
[0159] In the S8.4 probability calibration process, for the T-stage and N-stage prediction results and corresponding predicted probabilities output by the two branches, an ordinal-preserving regression model pre-trained on a multi-center clinical validation dataset is used to calibrate the probability values of the model's original inference. This corrects the bias of overconfidence or underestimation of probability caused by neural network inference, making the output probability values more in line with the actual clinical risk of disease and ensuring the clinical reliability of the prediction results.
[0160] In the final prediction result integration of S8.5, core indicators such as calibrated invasion depth, tissue invasion level, cartilage invasion risk, T staging result, lymph node metastasis risk, suspected metastatic lymph node location, and N staging result are structurally integrated to generate standardized intelligent prediction results of laryngeal cancer invasion depth and lymph node metastasis, providing quantitative and referable core basis for clinical diagnosis and treatment decisions.
[0161] Regarding the visualization and clinical report generation process for the predicted results: In the generation of the S9.1 lesion visualization annotation map, based on the intelligent prediction results, the boundary contour and mucosal invasion range of the lesion are superimposed on the original surface optical color image, and the infiltration boundary, invaded tissue layer, and infiltration depth measurement line of the lesion are superimposed on the original tomographic grayscale image. At the same time, the location and partition of the suspected metastatic lymph nodes are marked on the cervical lymph node partition diagram. All annotations use a combination of semi-transparent masks, colored outlines, and text annotations to avoid obscuring the lesion details of the original image, making it convenient for clinicians to intuitively view the lesion condition and prediction results.
[0162] In the S9.2 standardized clinical report generation process, a pre-set clinical report template for intelligent prediction of endoscopic laryngeal cancer, conforming to clinical diagnosis and treatment guidelines, is provided. The template includes eight modules: basic patient information, examination equipment information, original endoscopic images, lesion visualization annotation, core prediction quantitative indicators, TNM staging prediction results, risk warning information, and interpretability explanation. The core indicators and visualization annotations in the prediction results are automatically extracted and filled into the corresponding positions in the report template to automatically generate a standardized clinical report. The report supports exporting to PDF format, direct printing, and integration with the hospital's electronic medical record system (EMR) and image archiving and communication system (PACS).
[0163] In the generation of interpretability explanations for the S9.3 prediction results, the Gradient Weighted Class Activation Mapping (Grad-CAM) feature attribution algorithm is used. For the two prediction tasks of invasion depth and lymph node metastasis, the core contributing features that affect the final prediction results in the multimodal features are located and quantified, generating feature heatmaps that clearly mark the key areas in the lesion region that contribute the most to the prediction results. At the same time, a textual interpretability explanation is generated, which clearly explains the core basis of the prediction results, including key lesion morphological features, invasion features, abnormal areas, etc., to solve the black box problem of deep learning models, improve clinicians' trust in the prediction results, and finally incorporate the heatmap and textual explanations into the standardized clinical report.
[0164] In summary, the multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis disclosed in this invention can achieve the synergistic fusion of surface optical and deep tomographic imaging features of laryngeal cancer lesions, and simultaneously complete the integrated intelligent prediction of invasion depth and lymph node metastasis, significantly improving the accuracy and efficiency of preoperative assessment and supporting precise clinical diagnosis and treatment.
[0165] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention.
Claims
1. A multimodal endoscopic intelligent prediction system for laryngeal cancer invasion depth and lymph node metastasis, characterized in that, include: The image acquisition module is used to receive surface optical color images and tomographic grayscale images of lesions obtained by endoscopy. An image preprocessing module is used to preprocess the surface optical color image and the tomographic grayscale image; The image annotation module is used to manually annotate the target lesion areas in the preprocessed surface optical color image and the tomographic grayscale image; The image enhancement module is used to enhance the manually annotated surface optical color image and the tomographic grayscale image; A single-modal feature extraction module is used to extract features of the enhanced surface optical color image and features of the tomographic grayscale image, respectively. A multimodal feature fusion module is used to fuse the features of the surface optical color image and the features of the tomographic grayscale image to generate multimodal features; The diagnostic module is used to generate a prediction result based on the multimodal features.
2. The intelligent prediction system for multimodal endoscopic laryngeal cancer invasion depth and lymph node metastasis as described in claim 1, characterized in that, In the image preprocessing module, the surface optical color image is preprocessed, specifically as follows: S1.1: Perform privacy desensitization processing on the surface optical color image, and shield and remove the patient privacy information and device-independent watermark embedded in the surface optical color image to obtain a desensitized image; S1.2: Perform color space correction and white balance calibration on the desensitized image to unify the color reference of the surface optical color image and obtain a color-standardized image; S1.3: Perform edge-preserving and noise reduction processing on the color-normalized image. Use a bilateral filtering algorithm to filter out Gaussian noise and salt-and-pepper noise generated during image acquisition, while retaining the lesion edges, mucosal texture and microvascular features in the surface optical color image to obtain the denoised image. S1.4: Perform contrast normalization processing on the denoised image and use the limited contrast adaptive histogram equalization algorithm to obtain a contrast-enhanced image; S1.5: The contrast-enhanced image is subjected to size and resolution normalization processing, and the image is scaled to a uniform input size using a bicubic interpolation algorithm to obtain the preprocessed surface optical color image.
3. The intelligent prediction system for multimodal endoscopic laryngeal cancer invasion depth and lymph node metastasis as described in claim 2, characterized in that, In the image preprocessing module, the tomographic grayscale image is preprocessed, specifically as follows: S2.1: Perform privacy desensitization processing on the grayscale image of the tomographic imaging, and mask and remove the patient privacy information and device-independent watermark embedded in the grayscale image of the tomographic imaging to obtain the desensitized tomographic image; S2.2: The desensitized tomographic image is subjected to speckle noise suppression processing. An anisotropic diffusion filtering algorithm is used to filter out the inherent speckle noise of tomographic imaging, while preserving the laryngeal wall tissue layer boundary and lesion infiltration contour features to obtain a denoised tomographic image. S2.3: Perform window width and window level normalization processing on the denoised tomographic image to map the image grayscale values to the standard grayscale range, thereby obtaining a grayscale normalized tomographic image; S2.4: For the grayscale normalized tomographic images acquired in a continuous sequence, perform inter-layer rigid registration processing to obtain the registered tomographic images; S2.5: The registered tomographic image is subjected to size and resolution standardization processing. A bicubic interpolation algorithm is used to scale the image to a uniform input size that matches the surface optical color image, thereby obtaining the preprocessed tomographic grayscale image.
4. The intelligent prediction system for multimodal endoscopic laryngeal cancer invasion depth and lymph node metastasis as described in claim 3, characterized in that, In the image enhancement module, the enhancement of the manually annotated surface optical color image specifically includes: S3.1: Based on the target lesion region in the surface optical color image completed by manual annotation, generate the corresponding lesion foreground binary mask; S3.2: Based on the binary mask of the lesion foreground, perform foreground feature enhancement processing on the target lesion region. The mucosal texture, microvascular morphology and lesion edge details in the target lesion region are processed by an adaptive texture enhancement operator to obtain a foreground enhanced image. S3.3: Perform background feature suppression processing on the background area outside the foreground binary mask of the lesion to obtain a background suppressed image; S3.4: Perform pixel value range normalization processing on the background suppression image to obtain the enhanced surface optical color image.
5. The intelligent prediction system for multimodal endoscopic laryngeal cancer invasion depth and lymph node metastasis as described in claim 4, characterized in that, In the image enhancement module, the enhancement of the manually annotated tomographic grayscale image specifically includes: S4.1: Based on the target lesion region in the tomographic grayscale image completed by manual annotation, and combined with the spatial location information of the continuous tomographic sequence, generate a binary mask of the lesion foreground at the corresponding tomographic level. S4.2: Based on the binary mask of the lesion foreground, perform edge structure enhancement processing on the target lesion region, and process the lesion infiltration boundary and the laryngeal wall tissue layering interface through the edge-preserving gradient enhancement operator to obtain a foreground enhanced tomographic image; S4.3: Perform background grayscale normalization processing on the background area outside the foreground binary mask of the lesion to obtain a background suppression tomographic image; S4.4: For the background-suppressed tomographic images acquired in a continuous sequence, perform inter-layer feature consistency enhancement processing to obtain the enhanced tomographic grayscale image.
6. The intelligent prediction system for multimodal endoscopic laryngeal cancer invasion depth and lymph node metastasis as described in claim 5, characterized in that, In the single-modal feature extraction module, the features of the enhanced surface optical color image are extracted, specifically as follows: S5.1: Using the enhanced surface optical color image as input, and combining it with the lesion foreground binary mask corresponding to the surface optical color image, determine the target lesion region as the effective feature input; S5.2: The target lesion region is hierarchically encoded by a convolutional neural network encoding backbone network. First, the bottom-level visual features of the target lesion region are extracted by a shallow network. The bottom-level visual features include lesion edge contour, mucosal texture, microvascular morphology, and gray-level gradient features. S5.3: The deep network of the backbone network encoded by the convolutional neural network is used to further encode the underlying visual features and extract the high-dimensional semantic features of the target lesion area. The high-dimensional semantic features include the lesion morphology irregularity, the range of mucosal invasion, and the distribution features of abnormal lesion areas. S5.4: Perform multi-scale feature aggregation on the extracted low-level visual features and the high-dimensional semantic features to generate an initial surface feature set corresponding to the surface optical color image; S5.5: Perform feature normalization and dimensionality reduction optimization on the initial surface feature set to generate the features of the surface optical color image.
7. The intelligent prediction system for multimodal endoscopic laryngeal cancer invasion depth and lymph node metastasis as described in claim 6, characterized in that, In the single-modal feature extraction module, the features of the enhanced tomographic grayscale image are extracted, specifically as follows: S6.1: Using the enhanced tomographic grayscale image as input, and combining it with the lesion foreground binary mask corresponding to the tomographic grayscale image, determine the target lesion region as an effective feature input, and simultaneously associate it with the adjacent layer images of the continuous tomographic sequence in which the target lesion region is located. S6.2: Using a tomographic image feature coding network, structural features are extracted from the target lesion area on a single layer to obtain the gray-scale distribution features of the lesion, the tissue layer boundary features, the infiltration contour morphology features, and the gray-scale difference features between the lesion and the surrounding normal tissue within the single layer. S6.3: Through the sequence coding branch of the tomographic image feature coding network, the adjacent layer features of the continuous tomographic sequence are correlated and coded to extract the cross-layer spatial features of the target lesion in the longitudinal direction. The cross-layer spatial features include the lesion infiltration depth extension range, tissue layer invasion continuity, and deep structure destruction degree features. S6.4: Perform multi-dimensional feature aggregation on the extracted single-layer structural features and cross-layer spatial features to generate an initial deep feature set corresponding to the tomographic grayscale image; S6.5: Perform feature normalization and dimensionality reduction optimization on the initial deep feature set to generate the features of the tomographic grayscale image.
8. The intelligent prediction system for multimodal endoscopic laryngeal cancer invasion depth and lymph node metastasis as described in claim 7, characterized in that, In the multimodal feature fusion module, the features of the surface optical color image and the features of the tomographic grayscale image are fused to generate the multimodal features, specifically as follows: S7.1: Receive the features of the surface optical color image and the features of the tomographic grayscale image output by the single-modal feature extraction module, and complete the association and alignment of the feature spatial coordinates based on the spatial registration relationship of the target lesion area corresponding to the two types of features. At the same time, perform dimensional adaptation of the two types of features through a fully connected layer to generate surface feature vectors and deep feature vectors with unified dimensions. S7.2: Input the surface feature vector and the deep feature vector into the cross-modal feature interaction coding network, model the complementary relationship between the two types of features through the cross attention mechanism, construct the corresponding mapping relationship between the surface boundary information of the lesion in the features of the surface optical color image and the deep infiltration information of the lesion in the features of the tomographic grayscale image, and generate a cross-modal interactive feature set; S7.3: Based on the dual task objective of predicting the depth of laryngeal cancer invasion and lymph node metastasis, an adaptive dynamic weighting process is performed on the cross-modal interactive feature set. The weight coefficients of surface-related features and deep-related features are adjusted according to the feature contribution of different prediction tasks to generate a weighted optimized feature set. S7.4: Perform global feature aggregation and normalization on the weighted optimized feature set to remove feature redundancy and noise interference, and generate the multimodal features.
9. The intelligent prediction system for multimodal endoscopic laryngeal cancer invasion depth and lymph node metastasis as described in claim 8, characterized in that, In the diagnostic module, the prediction result is generated based on the multimodal features, specifically as follows: S8.1: Receive the multimodal features output by the multimodal feature fusion module, input the multimodal features into the dual-task intelligent prediction network, and set up a laryngeal cancer invasion depth prediction branch and a lymph node metastasis prediction branch respectively in the dual-task intelligent prediction network; S8.2: Through the laryngeal cancer invasion depth prediction branch, decode and reason about the multimodal features to generate the submucosal invasion depth value, tissue invasion level, and laryngeal cartilage invasion risk value of the target lesion area, and simultaneously map them to the AJCC laryngeal cancer TNM staging standard to output the corresponding T stage prediction result. S8.3: Through the lymph node metastasis prediction branch, decode and reason about the multimodal features to generate the probability of cervical lymph node metastasis risk, location information of suspicious metastatic lymph nodes, and the grading result of metastasis range. At the same time, map it to the AJCC laryngeal cancer TNM staging standard and output the corresponding N-stage prediction result. S8.4: Perform probability calibration processing on the T-stage prediction results and the N-stage prediction results, and correct the inference probability deviation through the ordinal regression model to obtain the calibrated core prediction indicators; S8.5: Integrate the core predictive indicators to generate the final intelligent predictive results of laryngeal cancer invasion depth and lymph node metastasis.
10. The intelligent prediction system for multimodal endoscopic laryngeal cancer invasion depth and lymph node metastasis as described in claim 9, characterized in that, Also includes: S9.1: Based on the intelligent prediction results, combined with the original images of the surface optical color image and the tomographic grayscale image, the infiltration boundary, invasion layer, and suspected metastatic lymph node location of the target lesion area are superimposed and mapped onto the corresponding original endoscopic image in the form of visual annotations to generate a lesion visual annotation map. S9.2: Extract the core quantitative indicators, staging results, and risk warning information from the intelligent prediction results, and combine them with the lesion visualization annotation map to generate a standardized clinical report on intelligent prediction of endoscopic laryngeal cancer; S9.3: Based on the feature attribution algorithm, locate and quantify the core contributing features that affect the prediction results among the multimodal features, generate an interpretable description of the prediction results, and simultaneously incorporate it into the clinical report.