A multimodal ultrasonic thyroid lesion intelligent identification method and precise diagnosis and treatment system

By performing three-dimensional spatial registration and temporal alignment on multimodal ultrasound images of the thyroid region, and extracting and fusing depth features, the problems of spatiotemporal alignment and feature fusion in multimodal image analysis were solved, enabling accurate lesion identification and multi-task diagnosis.

CN122289134APending Publication Date: 2026-06-26CHONGQING JIULONGPO DISTRICT HOSPITAL OF TRADITIONAL CHINESE MEDICINE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHONGQING JIULONGPO DISTRICT HOSPITAL OF TRADITIONAL CHINESE MEDICINE
Filing Date
2026-02-27
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies in thyroid ultrasound image analysis have failed to achieve high-precision spatiotemporal alignment and cross-modal depth feature fusion of multimodal images, resulting in insufficient ability to distinguish multiple attributes of lesions and an inability to simultaneously meet multiple clinical decision-making needs such as lesion localization, benign or malignant assessment, and risk grading.

Method used

By acquiring grayscale ultrasound image sequences, elastography maps, and contrast time-intensity curve parameter maps of the thyroid region, three-dimensional spatial registration and temporal alignment are performed. Depth features of multimodal images are extracted and cross-modal fusion is carried out. A multi-task learning head is used to perform parallel inference of lesion segmentation, benign and malignant probability values, and risk classification.

Benefits of technology

It achieves high-precision spatiotemporal alignment and cross-modal depth feature fusion of multimodal ultrasound images, significantly improving the boundary accuracy of lesion segmentation, the ability to capture multidimensional correlation patterns for benign and malignant differentiation, and the overall reliability of multi-task output results, thereby improving the accuracy and efficiency of intelligent diagnosis of thyroid lesions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289134A_ABST
    Figure CN122289134A_ABST
Patent Text Reader

Abstract

This application provides a multimodal ultrasound intelligent identification method and precision diagnosis system for thyroid lesions. The method determines image data blocks with multimodal spatial registration and grayscale-contrast temporal alignment. Based on the grayscale ultrasound image sequence in the image data blocks, a first depth feature map of the target user's thyroid region is determined. A second depth feature map of the target user's thyroid region is determined based on the elastography map in the image data blocks. The hemodynamic spatial distribution characteristics of the parametric map in the image data blocks are captured to determine a third depth feature map of the target user's thyroid region. Cross-modal fusion is performed on the first, second, and third depth feature maps, and then the lesion segmentation mask, benign / malignant probability value, and risk classification of the target user's thyroid region are inferred and output in parallel. Using the scheme of this application, high-precision spatiotemporal alignment and cross-modal depth feature fusion of multimodal ultrasound images can be achieved to complete end-to-end intelligent identification of thyroid lesions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of smart medical technology, and more specifically, to a multimodal ultrasound intelligent identification method for thyroid lesions and a precision diagnosis and treatment system. Background Technology

[0002] Clinical diagnosis of thyroid diseases relies heavily on ultrasound imaging technology. With the development of medical imaging modalities, multimodal imaging such as grayscale ultrasound, elastography, and contrast-enhanced ultrasound have been gradually applied to the detection and qualitative analysis of thyroid lesions. Grayscale ultrasound can clearly present the morphological characteristics of lesions, elastography can reflect the stiffness properties of tissues, and contrast-enhanced ultrasound can reveal the hemodynamic state of lesions.

[0003] Currently, the analysis of thyroid ultrasound images is mostly based on single-modality or simple multimodal stitching. Common practices include manual feature extraction from grayscale images or using convolutional neural networks for lesion segmentation and classification. Some studies have attempted to introduce elastic or contrast images as auxiliary information. However, such methods often ignore the misalignment between multimodal images in three-dimensional space and time, resulting in inaccurate spatial and temporal matching of morphological, stiffness, and functional information. In addition, existing methods usually adopt a simple fusion or concatenation approach after processing the features of each modality independently, failing to achieve deep, adaptive, and cross-modal feature interaction and synergistic enhancement, thus limiting the model's ability to jointly discriminate multiple attributes of lesions. At the same time, most systems only support single output tasks, such as segmentation or classification, and cannot complete multiple clinical decision-making needs such as lesion localization, benign or malignant determination, and risk grading in parallel in a single inference, resulting in low diagnostic efficiency and a lack of systematic results. Therefore, how to achieve high-precision spatiotemporal alignment and cross-modal deep feature fusion of multimodal ultrasound images to complete end-to-end intelligent identification of thyroid lesions has become a challenge for the industry. Summary of the Invention

[0004] This application provides a multimodal ultrasound intelligent identification method for thyroid lesions and a precision diagnosis and treatment system, which can achieve high-precision spatiotemporal alignment of multimodal ultrasound images and cross-modal depth feature fusion to complete end-to-end intelligent identification of thyroid lesions.

[0005] Firstly, this application provides a multimodal ultrasound intelligent identification method for thyroid lesions, including: The system acquires grayscale ultrasound image sequences, elastography maps, and parametric maps of contrast time-intensity curves at the same anatomical location in the thyroid region of the target user, thereby obtaining multimodal image data. The system then performs three-dimensional spatial registration and temporal alignment on the multimodal image data to obtain image data blocks with multimodal spatial registration and grayscale-contrast temporal alignment. Spatiotemporally dependent lesion morphological features are extracted from grayscale ultrasound image sequences in the image data block to obtain the first depth feature map of the target user's thyroid region. Local feature enhancement and extraction are performed on the elastography map in the image data block to obtain the second depth feature map of the target user's thyroid region. The hemodynamic spatial distribution features of the parametric map in the image data block are captured to obtain the third depth feature map of the target user's thyroid region. Cross-modal fusion is performed on the first depth feature map, the second depth feature map and the third depth feature map to obtain a comprehensive feature map of the thyroid lesion region of the target user. The comprehensive feature map is input into a multi-task learning head for feature adaptation and optimization, and then the lesion segmentation mask, benign and malignant probability values ​​and risk classification categories of the target user's thyroid region are inferred and output in parallel.

[0006] In some embodiments, performing three-dimensional spatial registration and temporal alignment on the multimodal image data to obtain image data blocks with multimodal spatial registration and grayscale-contrast temporal alignment specifically includes: A three-dimensional spatial coordinate system is constructed using keyframes with anatomical landmarks within the grayscale ultrasound image sequence of the multimodal image data as spatial reference datum. Based on the three-dimensional spatial coordinate system, the elastic imaging map and parameter map in the multimodal image data are spatially transformed to obtain the spatially registered elastic imaging map and parameter map. Temporal interpolation and resampling are performed on the spatially registered parametric map to temporally align the grayscale ultrasound image sequence, resulting in a temporally aligned parametric map. Tensor stitching and dimensional reorganization are performed on grayscale ultrasound image sequences, time-aligned parametric maps, and spatially registered elastography maps to generate multimodal spatially registered and grayscale-contrast time-aligned image data blocks.

[0007] In some embodiments, extracting spatiotemporally dependent lesion morphological features from grayscale ultrasound image sequences in an image data block to obtain a first depth feature map of the target user's thyroid region specifically includes: The grayscale ultrasound image sequence in the image data block is input into a spatiotemporal feature coding network to extract continuous local texture and spatial structure features in the image sequence, generating a preliminary spatiotemporal feature map. Based on the spatiotemporal feature map, an enhanced feature map with global spatiotemporal context information is determined; The enhanced feature map is then subjected to feature projection and aggregation to generate a first depth feature map of the target user's thyroid region.

[0008] In some embodiments, performing local feature enhancement and extraction on the elastography map in the image data block to obtain a second depth feature map of the target user's thyroid region specifically includes: The elastic imaging map in the image data block is input into a convolutional network based on refocusing convolution kernels for local feature enhancement to generate an initial elastic feature map; By enhancing the lesion-related feature channels and suppressing irrelevant noise channels in the initial elastic feature map, a channel-weighted feature map is obtained. Multi-scale contextual information is captured in the channel-weighted feature map, thereby obtaining an elastic feature map containing multi-scale contextual information; The elastic feature map is upsampled to make its spatial resolution consistent with the first depth feature map, thereby obtaining a second depth feature map of the target user's thyroid region.

[0009] In some embodiments, capturing the hemodynamic spatial distribution characteristics of the parametric map in the image data block to obtain the third depth feature map of the target user's thyroid region specifically includes: The parametric map of the angiography time-intensity curve in the image data block is input into a multi-scale spatial convolutional network to extract the spatial distribution features of hemodynamic parameters and generate a parametric spatial feature map. Global average pooling of the spatial dimension is performed on the parameter space feature map to obtain global spatial aggregated quantized features; The global spatial aggregated quantization features are fused with the parameter spatial feature map at the channel level to obtain a fused feature vector that integrates hemodynamic spatial distribution and global quantization information. The fused feature vector is subjected to dimensional transformation and feature integration to generate a third deep feature map with the same spatial resolution as the first deep feature map for the thyroid region of the target user.

[0010] In some embodiments, cross-modal fusion of the first depth feature map, the second depth feature map, and the third depth feature map to obtain a comprehensive feature map of the thyroid lesion region of the target user specifically includes: The first depth feature map, the second depth feature map, and the third depth feature map are input into a cross-modal collaborative attention fusion network to determine the attention weight matrix between each depth feature map. Based on all attention weight matrices, collaborative attention fusion is performed on the first deep feature map, the second deep feature map, and the third deep feature map to form a high-dimensional fused feature map. The high-dimensional fusion feature map is reduced in dimension and deeply integrated to obtain a comprehensive feature map of the thyroid lesion region of the target user.

[0011] In some embodiments, the comprehensive feature map is input into a multi-task learning head for feature adaptation and optimization, and then inferred in parallel to output the lesion segmentation mask, benign / malignant probability value and risk classification category of the target user's thyroid region. Specifically, this includes: Channel adaptation and multi-scale context enhancement are performed on the comprehensive feature map to obtain an optimized feature map; The optimized feature map is input in parallel to the three task prediction heads of the multi-task learning head. Semantic segmentation is performed through the segmentation prediction head, and the lesion segmentation mask of the target user's thyroid region is output. The benign or malignant classification head is used to perform binary classification and output the benign or malignant probability value of the corresponding lesion; Multi-category classification is performed using the risk grading header, and the corresponding risk grading category of the lesion is output.

[0012] Secondly, this application provides a multimodal ultrasound precision diagnosis and treatment system for thyroid lesions, the system including a lesion intelligent recognition unit, the lesion intelligent recognition unit including: The acquisition module is used to acquire grayscale ultrasound image sequences, elastography maps, and parametric maps of contrast time-intensity curves at the same anatomical location in the thyroid region of the target user, thereby obtaining multimodal image data. The multimodal image data is then subjected to three-dimensional spatial registration and temporal alignment to obtain image data blocks with multimodal spatial registration and grayscale-contrast temporal alignment. The processing module is used to extract the temporal and spatial dependent lesion morphological features from the grayscale ultrasound image sequence in the image data block to obtain the first depth feature map of the target user's thyroid region, perform local feature enhancement and extraction on the elastography map in the image data block to obtain the second depth feature map of the target user's thyroid region, capture the hemodynamic spatial distribution features of the parametric map in the image data block, and then obtain the third depth feature map of the target user's thyroid region. The processing module is used to perform cross-modal fusion of the first depth feature map, the second depth feature map and the third depth feature map to obtain a comprehensive feature map of the thyroid lesion region of the target user. The execution module is used to input the comprehensive feature map into the multi-task learning head for feature adaptation and optimization, and then infer and output the lesion segmentation mask, benign and malignant probability value and risk classification of the thyroid region of the target user in parallel.

[0013] Thirdly, this application provides a computer device, the computer device including a memory and a processor, the memory storing code, and the processor being configured to acquire the code and execute the above-described multimodal ultrasound intelligent identification method for thyroid lesions.

[0014] Fourthly, this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described multimodal ultrasound intelligent identification method for thyroid lesions.

[0015] The technical solutions provided by the embodiments disclosed in this application have the following beneficial effects: The multimodal ultrasound intelligent identification method and precision diagnosis system for thyroid lesions provided in this application first acquires grayscale ultrasound image sequences, elastography maps, and parametric maps of contrast-enhanced time-intensity curves at the same anatomical location in the thyroid region of the target user, thereby obtaining multimodal image data. The multimodal image data is then subjected to three-dimensional spatial registration and temporal alignment to obtain image data blocks with multimodal spatial registration and grayscale-contrast temporal alignment. Temporally and spatially dependent lesion morphological features are extracted from the grayscale ultrasound image sequences in the image data blocks to obtain the first depth feature map of the thyroid region of the target user. The elastography image is used to enhance and extract local features to obtain a second depth feature map of the target user's thyroid region. The hemodynamic spatial distribution features of the parametric map in the image data block are captured to obtain a third depth feature map of the target user's thyroid region. The first, second, and third depth feature maps are fused across modally to obtain a comprehensive feature map of the target user's thyroid lesion region. The comprehensive feature map is input into a multi-task learning head for feature adaptation and optimization, and then the lesion segmentation mask, benign and malignant probability value, and risk classification of the target user's thyroid region are inferred and output in parallel.

[0016] Therefore, this application inputs the comprehensive feature map into a multi-task learning head for feature adaptation and optimization, and then infers and outputs the lesion segmentation mask, benign / malignant probability value, and risk classification category of the target user's thyroid region in parallel. First, determining the image data block can obtain a structured data unit integrated from multimodal images after spatial and temporal alignment. The determination of the image data block can construct a standardized data unit with strict spatiotemporal synchronization and information structure, eliminating geometric misalignment and temporal mismatch between multi-source data from the source, providing a high-quality and unambiguous input basis for subsequent deep feature extraction and cross-modal fusion, enabling the system to perform feature learning based on real and consistent anatomical locations and times, thereby significantly improving the boundary accuracy of lesion segmentation, the ability to capture multidimensional correlation patterns in benign / malignant discrimination, and the overall reliability and clinical interpretability of the final multi-task output results; then, determining the first The deep feature map yields a morphological feature map containing rich local details of the target user's thyroid region and possessing strong semantic representativeness. Determining the first deep feature map overcomes the limitations of existing technologies that extract static morphological features only from single frames or simple sequences. Through spatiotemporally dependent lesion morphological feature extraction, it fuses local textures, spatial structures, and their dynamic evolution over time (e.g., subtle movements of lesions with blood flow or swallowing) in grayscale ultrasound image sequences into a unified deep feature representation. This not only preserves key morphological details and dynamic information but also enhances semantic representativeness through global spatiotemporal context modeling, thus providing a stable and spatiotemporally semantically rich morphological foundation for subsequent high-precision cross-modal fusion with elasticity feature maps and hemodynamic feature maps. Determining the second deep feature map yields a fusion of contextual information at different spatial scales of the target user's thyroid region with a spatial resolution comparable to the first deep feature map. Figure 1 The determination of the elastic feature map and the second depth feature map ensures the precise correspondence between elastic and morphological features in anatomical location. Furthermore, by enhancing details of local hardness abnormalities related to lesions, suppressing noise interference, and fusing local-to-global contextual information, the discriminative information of the elastic modality is fully extracted, significantly improving the synergy and robustness of cross-modal feature fusion. This provides a stable and highly discriminative hardness attribute representation for subsequent comprehensive feature maps. Determining the third depth feature map allows for the acquisition of key phase pattern evolution representations of hemodynamics in the target user's thyroid region, with dimensions similar to the first depth feature map. Figure 1The determination of the two-dimensional feature map and the third depth feature map enables precise element-wise fusion of the temporal evolution of microvascular function in lesions with grayscale feature maps representing morphological structure and elastic feature maps representing tissue stiffness at the same spatial dimension and resolution. This achieves spatiotemporal unification and deep synergy of functional information with morphological and stiffness information at the feature level, providing high-quality, alignable functional modal input for subsequent cross-modal fusion networks. Finally, determining the comprehensive feature map yields a compact feature representation of the original complementary information and enhanced contextual information from different physical sources in the target user's thyroid region. The determination of the comprehensive feature map can be achieved through cross-modal collaborative attention. The mechanism adaptively and deeply fuses morphological features, tissue stiffness features, and hemodynamic features that have undergone three-dimensional spatial registration and temporal alignment. This effectively solves the problems of spatial and temporal misalignment of multimodal information and shallow isolation of feature interactions in existing technologies. It achieves precise alignment and complementary enhancement of morphological, stiffness, and functional information at the feature level, thereby supporting parallel and collaborative inference of lesion segmentation, benign / malignant differentiation, and risk grading. This significantly improves the overall accuracy, efficiency, and clinical applicability of intelligent diagnosis of thyroid lesions. In summary, based on the above scheme, high-precision spatiotemporal alignment of multimodal ultrasound images and cross-modal deep feature fusion can be achieved to complete end-to-end intelligent identification of thyroid lesions. Attached Figure Description

[0017] Figure 1 This is an exemplary flowchart of a multimodal ultrasound intelligent identification method for thyroid lesions according to some embodiments of this application; Figure 2 This is a flowchart illustrating the operation of determining image data blocks according to some embodiments of this application; Figure 3 This is an exemplary flowchart illustrating the determination of a second depth feature map according to some embodiments of this application; Figure 4 This is a schematic diagram of the structure of the intelligent lesion recognition unit according to some embodiments of this application; Figure 5 This is an internal structural diagram of a computer device for implementing a multimodal ultrasound intelligent identification method for thyroid lesions, according to some embodiments of this application. Detailed Implementation

[0018] To better understand the technical solution of this application, the technical solution of this application will be described in detail below with reference to the accompanying drawings and specific embodiments.

[0019] refer to Figure 1 The figure is an exemplary flowchart of a multimodal ultrasound intelligent identification method for thyroid lesions according to some embodiments of this application. The multimodal ultrasound intelligent identification method for thyroid lesions mainly includes the following steps: In step 101, grayscale ultrasound image sequences, elastography maps, and parametric maps of contrast time-intensity curves at the same anatomical location in the thyroid region of the target user are obtained to obtain multimodal image data. The multimodal image data is then subjected to three-dimensional spatial registration and temporal alignment to obtain image data blocks with multimodal spatial registration and grayscale-contrast temporal alignment.

[0020] It should be noted that, in this application, the grayscale ultrasound image sequence is a dynamic image characterizing the two-dimensional anatomical structure and morphological features of the thyroid region of the target user. This grayscale ultrasound image sequence can provide high-resolution baseline spatial information, clearly showing the location, boundary, internal echo (e.g., hypoechoic), and key morphological indicators such as the presence of microcalcifications of the lesion. The elastography image is an image showing the stiffness or strain of the local tissue of the thyroid of the target user. This elastography image can provide biomechanical property information related to histopathology, helping to identify areas of abnormally increased tissue stiffness caused by malignant tumor cell proliferation, fibrosis, etc., thereby compensating for the shortcomings of simple morphological diagnosis. The parametric map of the contrast time-intensity curve is an image formed by quantitative parameters characterizing the microvascular blood flow perfusion dynamics state inside the thyroid lesion of the target user. This parametric map can provide blood flow functional information of the lesion, such as perfusion rate and pattern, thereby revealing the abnormal neovascularization characteristics often accompanied by malignant tumors, which is a key functional basis for differentiating between benign and malignant tumors.

[0021] In specific implementation, obtaining grayscale ultrasound image sequences, elastography maps, and parametric maps of contrast-enhanced ultrasound time-intensity curves at the same anatomical location in the thyroid region of the target user, and thus obtaining multimodal image data, can be achieved in the following way: Grayscale ultrasound image sequences, elastography maps, and raw dynamic sequences of contrast-enhanced ultrasound at the same anatomical location in the thyroid region of the target user can be obtained from the hospital database through the standard data interface of the hospital communication system. The raw dynamic sequences of contrast-enhanced ultrasound are then fitted pixel-by-pixel with time-intensity curves to obtain the contrast-enhanced ultrasound time-intensity curve. Hemodynamic parameters, including onset time, peak time, peak intensity, and clearance rate, are quantified and extracted from the contrast-enhanced ultrasound time-intensity curve to generate a two-dimensional parametric map. The set of grayscale ultrasound image sequences, elastography maps, and parametric maps is then used as multimodal image data. The raw dynamic sequences of contrast-enhanced ultrasound are continuous frame images characterizing the dynamic changes in microvascular blood flow perfusion in thyroid lesions, and are the original data for generating the contrast-enhanced ultrasound time-intensity curve parametric map.

[0022] In some embodiments, reference Figure 2 The figure is a flowchart illustrating the operation of determining image data blocks according to some embodiments of this application. In this application, the three-dimensional spatial registration and temporal alignment of the multimodal image data to obtain image data blocks with multimodal spatial registration and grayscale-contrast temporal alignment can be achieved by the following steps: A three-dimensional spatial coordinate system is constructed using keyframes with anatomical landmarks within the grayscale ultrasound image sequence of the multimodal image data as spatial reference datum. Based on the three-dimensional spatial coordinate system, the elastic imaging map and parameter map in the multimodal image data are spatially transformed to obtain the spatially registered elastic imaging map and parameter map. Temporal interpolation and resampling are performed on the spatially registered parametric map to temporally align the grayscale ultrasound image sequence, resulting in a temporally aligned parametric map. Tensor stitching and dimensional reorganization are performed on grayscale ultrasound image sequences, time-aligned parametric maps, and spatially registered elastography maps to generate multimodal spatially registered and grayscale-contrast time-aligned image data blocks.

[0023] In specific implementation, the construction of a three-dimensional spatial coordinate system using keyframes with anatomical landmarks within the grayscale ultrasound image sequence of the multimodal image data as spatial reference benchmarks can be achieved in the following way: First, clearly visible anatomical landmarks in the grayscale ultrasound image sequence can be automatically identified using an edge detection algorithm based on image gradients, such as the boundary points of the thyroid capsule, the strong echo interface of the anterior tracheal wall, or the carotid artery wall. The image frame with the most anatomical landmarks is determined as the spatial reference benchmark frame. Then, with the upper left corner of the spatial reference benchmark frame as the origin, the row and column directions of the image are defined as the X-axis and Y-axis in the three-dimensional coordinate system, respectively, and the probe scanning direction perpendicular to the imaging plane is defined as the Z-axis, thereby establishing a three-dimensional Cartesian coordinate system corresponding to physical space; then, through continuous frame intervals... The displacement field estimated by the optical flow method maps the pixel positions of all image frames in the grayscale ultrasound image sequence, except for the spatial reference frame, to the three-dimensional Cartesian coordinate system, completing the three-dimensional spatial reconstruction of the entire grayscale ultrasound image sequence. The three-dimensional Cartesian coordinate system framework is then used as the three-dimensional spatial coordinate system. The three-dimensional spatial coordinate system is a unified geometric reference frame that precisely defines the position information of each pixel in three-dimensional space. This three-dimensional spatial coordinate system can provide a common and accurate spatial scale for grayscale, elastic, and contrast parameter images from different imaging principles and scanning times, thereby ensuring that all subsequent feature extraction and fusion analysis are based on completely consistent real anatomical positions, fundamentally eliminating spatial misalignment errors caused by patient movement or probe oscillation.

[0024] In specific implementation, the spatial transformation of the elastic imaging map and parametric map in the multimodal image data based on the three-dimensional spatial coordinate system to obtain the spatially registered elastic imaging map and parametric map can be achieved in the following way: Regions of interest with the same anatomical region as the spatial reference frame can be extracted from the elastic imaging map and parametric map respectively based on the three-dimensional spatial coordinate system. A non-rigid registration algorithm based on mutual information maximization and a B-spline free deformation model is used to accurately calculate the local deformation vector field from the elastic imaging map and contrast parametric map to the grayscale reference frame by optimizing the cost function that includes image similarity measurement (i.e., mutual information) and deformation field smoothing constraints. The calculated local deformation vector field is then applied to the original... The elastography and parametric maps were obtained by using bilinear interpolation to calculate pixel intensity values ​​for non-integer pixel positions after deformation and performing local intensity normalization based on Gaussian kernels. This resulted in spatially registered elastography and parametric maps. The spatially registered elastography ensures that the extracted tissue stiffness features can be spatially matched with the morphological features in the grayscale image, providing the possibility for accurate association between the stiffness abnormalities and morphological manifestations of lesions. The spatially registered parametric maps enable parameters characterizing hemodynamic function (such as perfusion velocity) to be accurately mapped to the correct anatomical location, thereby achieving precise superposition and comparative analysis of functional and morphological information of the lesion area at the spatial level.

[0025] In specific implementation, temporal interpolation and resampling are performed on the spatially registered parameter map to temporally align the grayscale ultrasound image sequence. The temporally aligned parameter map can be obtained in the following way: the timestamp of each image frame in the grayscale ultrasound image sequence can be extracted and used as the target alignment time axis. A temporal interpolation algorithm based on a third-order B-spline function is used to analyze the temporal sequence of parameter values ​​of each spatial pixel in the spatially registered parameter map at different time points to fit the trajectory of parameter change of each pixel over time. Then, this trajectory is used to resample on the target time axis to calculate a new set of parameter values ​​for each pixel that completely corresponds to the time of the grayscale image frame. Finally, these resampled parameter values ​​are reorganized according to time and space dimensions to generate a new parameter map as the time-aligned parameter map. The time-aligned parameter map can achieve precise spatiotemporal synchronization of hemodynamic functional information and anatomical morphological structural information. It strictly corresponds the parameters reflecting the microvascular perfusion characteristics of the lesion to each frame of the grayscale ultrasound image sequence in the time dimension. This ensures that in the subsequent multimodal fusion analysis, the morphological features of the lesion at a specified time can be accurately associated and matched with the blood flow state of the lesion at the same time. This overcomes the mismatch of functional and morphological information that may be caused by the misalignment of acquisition time and provides a reliable foundation for the subsequent construction of a comprehensive feature map with spatiotemporal consistency.

[0026] It should be noted that in this application, the image data block is a structured data unit formed by integrating multimodal images after spatial and temporal alignment. This image data block can construct a standardized data object containing complete information on morphology, stiffness, and function, and is spatiotemporally synchronized, thereby providing high-quality, unambiguous input for subsequent deep neural networks. It is a key data foundation for achieving high-performance, interpretable multimodal intelligent recognition. In specific implementation, tensor stitching and dimensionality reorganization are performed on the grayscale ultrasound image sequence, the time-aligned parametric map, and the spatially registered elastic imaging map to generate multimodal spatially registered and grayscale-contrast temporally aligned image data blocks. This can be achieved in the following way: The grayscale ultrasound image sequence, after temporal alignment, and the spatially registered elastography image are considered as four-dimensional tensors with time, height, width, and channel dimensions, respectively. The three four-dimensional tensors are then stitched together in the channel dimension to form a composite tensor with an increased number of channels. For example, if the grayscale sequence is a single channel, the elastography image is a single channel, and the parametric image has N parametric channels, the total number of channels after stitching is N+2. The composite tensor is then dimensionally reorganized, that is, the different modal channels are interleaved to enhance the model's ability to perceive cross-modal correlations in the early stages. The composite tensor after dimensional reorganization is used as an image data block for multimodal spatial registration and grayscale-contrast temporal alignment.

[0027] In step 102, the temporal and spatial dependent lesion morphological features of the grayscale ultrasound image sequence in the image data block are extracted to obtain the first depth feature map of the target user's thyroid region. The local feature enhancement and extraction of the elastography map in the image data block are performed to obtain the second depth feature map of the target user's thyroid region. The hemodynamic spatial distribution features of the parametric map in the image data block are captured to obtain the third depth feature map of the target user's thyroid region.

[0028] In some embodiments, extracting spatiotemporally dependent lesion morphological features from grayscale ultrasound image sequences in an image data block to obtain a first depth feature map of the target user's thyroid region can be achieved through the following steps: The grayscale ultrasound image sequence in the image data block is input into a spatiotemporal feature coding network to extract continuous local texture and spatial structure features in the image sequence, generating a preliminary spatiotemporal feature map. Based on the spatiotemporal feature map, an enhanced feature map with global spatiotemporal context information is determined; The enhanced feature map is then subjected to feature projection and aggregation to generate a first depth feature map of the target user's thyroid region.

[0029] In specific implementation, the grayscale ultrasound image sequence in the image data block is input into a spatiotemporal feature coding network to extract continuous local texture and spatial structure features from the image sequence, generating a preliminary spatiotemporal feature map. This can be achieved as follows: a three-dimensional tensor representing the grayscale ultrasound image sequence is extracted from the image data block, with dimensions including the image height, width, and time frame. This three-dimensional tensor is then input into the spatiotemporal feature coding network, which consists of multiple cascaded three-dimensional convolutional layers and feature projection and aggregation layers. Each three-dimensional convolutional kernel simultaneously performs sliding convolution operations on multiple adjacent consecutive frames, enabling the convolutional kernel to capture the spatial features of a pixel within a small neighborhood at once. The textural variations, such as echo intensity and edges, and the subtle dynamic evolution of these textures across consecutive frames, such as subtle movements with blood flow or pulsation, are ultimately transformed into high-dimensional feature maps containing local spatiotemporal patterns through nonlinear transformations and feature abstraction of multiple three-dimensional convolutional layers. This forms a preliminary spatiotemporal feature map. The spatiotemporal feature map is a primary feature representation that simultaneously encodes the local texture details of the target user's thyroid lesion and its subtle dynamic changes in the time dimension. This spatiotemporal feature map can transform two-dimensional static morphological information into three-dimensional spatiotemporal features containing dynamic cues, providing a basic analytical unit for capturing subtle movements such as lesion pulsation with blood flow or swallowing.

[0030] In specific implementation, the enhanced feature map with global spatiotemporal context information determined based on the spatiotemporal feature map can be implemented in the following way: the spatiotemporal feature map can be flattened and reorganized in the three dimensions of height, width and time, and converted into a sequence of feature vectors. The sequence is then input into a multi-head spatiotemporal self-attention module. For each feature vector position in the sequence, a set of query vectors, key vectors and value vectors are calculated in parallel. The dot product similarity between the query vector and key vector of any two feature vectors is calculated. After normalization, all dot product similarities are used to form a global attention weight matrix according to the sorting of feature vectors. Then, the weight matrix is ​​used to perform a weighted summation on all value vectors, so that the new feature vector at each position aggregates relevant context information from all positions in the entire sequence. Finally, an enhanced feature map with deep fusion of global spatiotemporal semantics is output. The enhanced feature map is a feature representation obtained by fusing global context information in the grayscale ultrasound image sequence. This enhanced feature map can overcome the local receptive field limitation of convolution operation, establish the spatiotemporal association between the lesion area and distant normal tissue, and thus more accurately understand the overall morphology and dynamic behavior pattern of the lesion.

[0031] In specific implementation, the enhancement feature map is projected and aggregated to generate the first deep feature map of the target user's thyroid region. This can be achieved by the following method: the enhancement feature map is input into the feature projection and aggregation layer of the spatiotemporal feature coding network. One-dimensional convolution is used to interact and compress the channel dimension of the feature map, thereby filtering and fusing the most discriminative information between different channels. Global average pooling is then performed on the feature map after channel projection in the spatial and temporal dimensions to obtain a global feature vector of fixed dimension. The global feature vector is then expanded in dimension through a fully connected layer to reconstruct a two-dimensional feature map. Finally, the reconstructed two-dimensional feature map is added element-wise to the original enhancement feature map after downsampling, achieving complementary fusion of high-level semantic information and low-level detail information, thereby obtaining the first deep feature map of the target user's thyroid region.

[0032] It should be noted that in this application, the first deep feature map is a morphological feature map that contains rich local details of the target user's thyroid region and has strong semantic representativeness. This first deep feature map not only retains key local details and dynamic information, but also integrates global semantics, forming a compact and powerful deep feature representation, which can be used as the core input for representing the morphological and dynamic dimensions in subsequent cross-modal fusion.

[0033] In some embodiments, reference Figure 3 The figure is an exemplary flowchart illustrating the determination of a second depth feature map according to some embodiments of this application. In this application, the second depth feature map of the target user's thyroid region is obtained by performing local feature enhancement and extraction on the elasticity map in the image data block using the following steps: In step 1021, the elastic imaging map in the image data block is input into a convolutional network based on refocusing convolution kernels for local feature enhancement to generate an initial elastic feature map; In step 1022, the feature channels related to the lesion in the initial elastic feature map are enhanced and irrelevant noise channels are suppressed to obtain a channel-weighted feature map; In step 1023, multi-scale contextual information in the channel weighted feature map is captured, thereby obtaining an elastic feature map containing multi-scale contextual information; In step 1024, the elastic feature map is upsampled to make its spatial resolution consistent with the first depth feature map, thereby obtaining a second depth feature map of the target user's thyroid region.

[0034] In specific implementation, the elasticity map in the image data block is input into a convolutional network based on refocusing convolutional kernels for local feature enhancement. The initial elasticity feature map can be generated as follows: A three-dimensional tensor representing the elasticity map is extracted from the image data block. Its dimensions include the image height, width, and time frame. This three-dimensional tensor is then input into a lightweight convolutional network specifically designed for medical images. The output feature map is used as the initial elasticity feature map. The first layer of this lightweight convolutional network uses a core refocusing convolutional layer. The basic convolutional kernel is obtained through pre-training on a large-scale natural image dataset, and a learnable refocusing transformation matrix is ​​applied to optimize the kernel parameters specifically for capturing subtle details in biological tissue stiffness images. To assess contrast variations and texture differences, the refocusing convolutional layer is followed by a normalization layer and a parameterized linear rectified activation function to accelerate training convergence and introduce nonlinear expressive power. This is followed by a bottleneck structure consisting of two consecutive standard convolutional layers. This bottleneck structure further refines and compresses features, reduces computational redundancy, and initially enhances the local texture and edge information related to tissue stiffness abnormalities in the original elastography image. The initial elastography feature map serves as an intermediate feature representation to enhance the local texture and edge information of tissue stiffness abnormalities in the target user's thyroid region. This initial elastography feature map provides a fundamental and rich source of detailed features for subsequent feature selection and fusion, thereby avoiding the difficulty of directly extracting high-level semantics from the original image.

[0035] In specific implementation, strengthening the lesion-related feature channels and suppressing irrelevant noise channels in the initial elastic feature map to obtain a channel-weighted feature map can be achieved in the following way: the initial elastic feature map can be input into an improved residual channel attention module. This residual channel attention module first performs global average pooling and global max pooling operations on the input feature map simultaneously, obtaining two one-dimensional vectors representing different global contexts. Then, these two one-dimensional vectors are added together and fed into a multilayer perceptron with shared weights. Through a bottleneck structure of a dimensionality reduction layer and an dimensionality increase layer, the nonlinear dependencies between each feature channel are learned and a channel attention weight vector is output. Finally, the channel attention weight vector is injected using the Sigmoid activation function. The attention weight vector is normalized to between 0 and 1 to obtain the scoring vector. The scoring vector is then multiplied channel-by-channel with the original initial elastic feature map to recalibrate the feature channels, resulting in a calibrated feature map. Finally, the calibrated feature map is joined with the input elastic feature map element-by-element by residual concatenation to output a channel-weighted feature map. The channel-weighted feature map is obtained by recalibrating the importance of each feature channel through an attention mechanism. This channel-weighted feature map can significantly enhance the response of feature channels related to the hardness pattern of suspicious lesions, while suppressing irrelevant feature channels generated by image noise or uniform background tissue, thereby significantly improving the discriminative power and robustness of feature expression.

[0036] In specific implementation, capturing the multi-scale contextual information in the channel-weighted feature map to obtain an elastic feature map containing multi-scale contextual information can be achieved in the following way: the channel-weighted feature map can be input into a multi-scale context aggregation module, which adopts a parallel branch structure. One branch uses ordinary convolution with a dilation rate of 1 to capture detailed local contextual information. The second branch uses dilated convolution with a larger dilation rate to obtain a larger receptive field without increasing parameters or reducing resolution, so as to capture the interaction information between the lesion region and the surrounding tissue. The third branch uses global average pooling followed by an upsampling layer to obtain image-level global contextual semantics. The feature maps obtained from these three parallel branches are concatenated along the channel dimension to form a composite feature map that integrates local, mesoscopic, and global information. The composite feature map is then compressed by a 1x1 convolutional layer to achieve efficient information integration. The resulting fused and compressed feature map is output as an elastic feature map containing multi-scale contextual information. This elastic feature map is an integrated feature representation that further integrates contextual information from different spatial scales (such as local details, regional contrast, and global background). This elastic feature map not only contains local point information of hardness abnormalities but also its relationship with surrounding tissues, thus forming a more comprehensive and stable characterization of the hardness attributes of lesions.

[0037] It should be noted that in this application, the second deep feature map is a fusion of contextual information at different spatial scales of the target user's thyroid region, and its spatial resolution is the same as that of the first deep feature map. Figure 1 The second depth feature map ensures that the refined multi-scale depth features derived from the elastic modality can participate in subsequent cross-modal fusion with the correct spatial correspondence, and is one of the key inputs for achieving accurate multimodal information collaboration. In specific implementation, the elastic feature map is upsampled to make its spatial resolution consistent with the first depth feature map. The second depth feature map of the target user's thyroid region can be obtained by the following method: the elastic feature map is input into the transposed convolutional upsampling layer, and by setting specific stride, kernel size and padding parameters, the reverse convolution operation is performed to expand the spatial size of the input feature map to a preset target size (e.g., the spatial size of the first depth feature map), and the upsampled feature map is output as the second depth feature map of the target user's thyroid region.

[0038] In some embodiments, capturing the hemodynamic spatial distribution characteristics of the parametric map in an image data block to obtain a third depth feature map of the target user's thyroid region can be achieved through the following steps: The parametric map of the angiography time-intensity curve in the image data block is input into a multi-scale spatial convolutional network to extract the spatial distribution features of hemodynamic parameters and generate a parametric spatial feature map. Global average pooling of the spatial dimension is performed on the parameter space feature map to obtain global spatial aggregated quantized features; The global spatial aggregated quantization features are fused with the parameter spatial feature map at the channel level to obtain a fused feature vector that integrates hemodynamic spatial distribution and global quantization information. The fused feature vector is subjected to dimensional transformation and feature integration to generate a third deep feature map with the same spatial resolution as the first deep feature map for the thyroid region of the target user.

[0039] In specific implementation, the parametric map of the contrast time-intensity curve in the image data block is input into a multi-scale spatial convolutional network to extract the spatial distribution features of hemodynamic parameters. The generation of the parameter spatial feature map can be achieved in the following way: a three-dimensional tensor representing the contrast time-intensity curve parametric map is extracted from the image data block. Its dimensions include the height, width, and parameter channels of the image. This three-dimensional tensor is input into a multi-scale spatial convolutional network, which consists of parallel 3×3 and 5×5 convolutional kernels and dilated convolutions. The spatial distribution features of hemodynamic parameters at different scales are extracted, such as the parameter gradient and parameter uniformity between the lesion area and normal tissue. The multi-scale features are concatenated in the channel dimension and then integrated through 1×1 convolution to generate the parameter spatial feature map. The parameter spatial feature map is a feature representation of the spatial distribution law of hemodynamic parameters in the thyroid anatomical region. This parameter spatial feature map can capture the spatial heterogeneity of blood perfusion in the lesion area and provide refined functional spatial features for subsequent fusion.

[0040] In specific implementation, the global average pooling of the spatial dimension of the parameter space feature map to obtain the global spatial aggregated quantization feature can be achieved in the following way: the global average pooling technique can be used to process the parameter space feature map, compressing the three-dimensional parameter space feature map into multiple one-dimensional feature sequences, and finally using the set of all one-dimensional feature sequences as the global spatial aggregated quantization feature; wherein, the global spatial aggregated quantization feature is a quantized statistical feature obtained by spatially aggregating the angiographic blood flow parameters of the target user's thyroid region. This feature can filter out local spatial noise and compress the multi-dimensional angiographic parameter map into a quantized vector representing the overall blood flow perfusion characteristics, providing global functional quantification information for subsequent fusion.

[0041] In specific implementation, the global spatial aggregation quantization feature and the parameter space feature map are fused at the channel level to obtain a fused feature vector that integrates hemodynamic spatial distribution and global quantization information. This can be achieved in the following way: the global spatial aggregation quantization feature can be dimensionally expanded to match the channel dimension of the parameter space feature map. Then, the global quantization information is integrated into the parameter space feature map by adding channels one by one to achieve channel-dimensional feature aggregation, resulting in a fused feature map as the fused feature vector that integrates hemodynamic spatial distribution and global quantization information. The fused feature vector is a dense vector that characterizes the hemodynamic spatial distribution and global quantization pattern of the target user's thyroid region. This vector can simultaneously encode the refined spatial features and overall quantization features of blood perfusion, achieving multi-level fusion of functional features.

[0042] In specific implementation, the dimensional transformation and feature integration of the fused feature vector to generate a third deep feature map with the same spatial resolution as the first deep feature map for the target user's thyroid region can be achieved in the following way: The fused feature vector can be input into a fully connected layer, and a learnable weight matrix can be used to perform linear transformation and non-linear activation on the input vector. The features are then integrated and the output dimension is adjusted to match the number of channels expected in subsequent fusion steps, such as the number of channels in the first deep feature map. The feature vector processed by the fully connected layer is then spatially reconstructed, that is, the feature vector is copied and expanded in spatial dimension to convert it into a two-dimensional feature map with the same spatial size as the first deep feature map. Finally, this two-dimensional feature map is used as the third deep feature map with the same spatial resolution as the first deep feature map for the target user's thyroid region.

[0043] It should be noted that in this application, the third depth feature map characterizes the spatial distribution and global quantitative law of hemodynamics in the thyroid region of the target user, and its size is the same as that of the first depth feature map. Figure 1 The three-dimensional feature map can be integrated and spatialized by the spatial distribution features of blood perfusion and the global quantification features, so that it can be fused element-by-element across modalities in the same structured format as the morphological feature map and the elasticity feature map, thereby achieving the unification and coordination of functional-morphological-rigidity information at the feature level.

[0044] In step 103, the first depth feature map, the second depth feature map, and the third depth feature map are fused across modalities to obtain a comprehensive feature map of the thyroid lesion region of the target user.

[0045] In some embodiments, cross-modal fusion of the first depth feature map, the second depth feature map, and the third depth feature map to obtain a comprehensive feature map of the thyroid lesion region of the target user can be achieved by the following steps: The first depth feature map, the second depth feature map, and the third depth feature map are input into a cross-modal collaborative attention fusion network to determine the attention weight matrix between each depth feature map. Based on all attention weight matrices, collaborative attention fusion is performed on the first deep feature map, the second deep feature map, and the third deep feature map to form a high-dimensional fused feature map. The high-dimensional fusion feature map is reduced in dimension and deeply integrated to obtain a comprehensive feature map of the thyroid lesion region of the target user.

[0046] In specific implementation, the first, second, and third depth feature maps are input into a cross-modal collaborative attention fusion network. The attention weight matrix between each depth feature map can be determined as follows: First, the feature vectors at each spatial location in the first depth feature map of morphological features, the second depth feature map of elasticity features, and the third depth feature map of hemodynamic features can be mapped to corresponding query vector groups, key vector groups, and value vector groups using three independent and learnable linear projection matrices, respectively. Then, for any two depth feature maps of different modalities, such as a morphological feature map and an elasticity feature map, the query vector group of one depth feature map and the key vector group of the other depth feature map are calculated. The dot product similarity between key vector groups of the depth feature maps is normalized using a learnable scaling factor. Then, the Softmax function is applied to normalize all normalized dot product similarities to obtain the attention weight matrix, which serves as the attention weight matrix between two different modal depth feature maps. This yields the attention weight matrix between each depth feature map. The attention weight matrix quantifies the degree of semantic association between different modal feature maps at different spatial locations. This attention weight matrix enables the system to adaptively and discriminatively focus on complementary key information from morphology, stiffness, and blood flow modalities, while suppressing redundancy or noise interference, thus providing precise guidance for subsequent intelligent fusion.

[0047] In specific implementation, the high-dimensional fused feature map is formed by performing collaborative attention fusion on the first deep feature map, the second deep feature map, and the third deep feature map based on all attention weight matrices. This can be achieved in the following way: The value vector groups corresponding to the first, second, and third deep feature maps are weighted and summed based on the attention weight matrices. Specifically, for a given target modality feature map (e.g., the first deep feature map), the attention weight matrices generated by other modalities (e.g., the second and third deep feature maps) are used to weight and aggregate the value vectors of those other modalities, resulting in a set of intermediate feature maps enhanced by the contextual information of other modalities. Then, the original feature map of the target modality is combined with its... All intermediate enhanced feature maps are added element-wise to achieve residual fusion of information, resulting in an enhanced representation map of the target modality. Through the above steps, enhanced representation maps of the depth feature maps under the three modalities can be obtained. Finally, the three enhanced feature maps are concatenated along the channel dimension to form a high-dimensional fusion feature map tensor with a channel multiplier, which serves as the high-dimensional fusion feature map. The high-dimensional fusion feature map is an intermediate feature representation that characterizes the original complementary information and enhanced contextual information from different physical sources in the thyroid region of the target user and has rich channel dimensions. This high-dimensional fusion feature map retains the original complementary information and enhanced contextual information from different physical sources to the greatest extent, providing a sufficient data foundation for generating highly discriminative final features.

[0048] In specific implementation, the high-dimensional fusion feature map is reduced in dimensionality and deeply integrated to obtain a comprehensive feature map of the target user's thyroid lesion region. This can be achieved by the following method: the high-dimensional fusion feature map can be input into a 1x1 convolutional layer with a bottleneck structure. This convolutional layer forces the high-dimensional features to be compressed and filtered by setting the number of output channels to be much less than the number of input channels. A batch normalization layer and a nonlinear activation function layer are connected to stabilize the training process and introduce nonlinear transformation capability into the model, promoting the interaction and integration of deep information between different modalities. The activated features are then passed through a second 1x1 convolutional layer to adjust the number of channels to the preset dimension for output, thus obtaining a comprehensive feature map of the target user's thyroid lesion region.

[0049] It should be noted that in this application, the comprehensive feature map is a compact feature representation that characterizes the original complementary information and enhanced contextual information of different physical sources in the thyroid region of the target user. This comprehensive feature map deeply integrates all-round discriminative information of morphology, hardness and function to form the most refined feature expression for the lesion, which can directly and efficiently drive multiple clinical decision-making tasks such as subsequent segmentation, classification and grading.

[0050] In step 104, the comprehensive feature map is input into the multi-task learning head for feature adaptation and optimization, and then the lesion segmentation mask, benign and malignant probability value and risk classification of the target user's thyroid region are inferred and output in parallel.

[0051] In some embodiments, the process of inputting the comprehensive feature map into a multi-task learning head for feature adaptation and optimization, and then inferring and outputting the lesion segmentation mask, benign / malignant probability value, and risk classification of the target user's thyroid region in parallel, can be achieved by the following steps: Channel adaptation and multi-scale context enhancement are performed on the comprehensive feature map to obtain an optimized feature map; The optimized feature map is input in parallel to the three task prediction heads of the multi-task learning head. Semantic segmentation is performed through the segmentation prediction head, and the lesion segmentation mask of the target user's thyroid region is output. The benign or malignant classification head is used to perform binary classification and output the benign or malignant probability value of the corresponding lesion; Multi-category classification is performed using the risk grading header, and the corresponding risk grading category of the lesion is output.

[0052] It should be noted that in this application, the multi-task learning head is the core decision-making module used to simultaneously generate multiple clinical diagnostic results. It consists of a shared feature decoding backbone network and three parallel task-specific prediction sub-networks (i.e., segmentation prediction head, benign / malignant classification head, and risk grading head). Through a unified model architecture and shared feature foundation, this multi-task learning head can simultaneously and efficiently complete three closely related clinical tasks: lesion localization, characterization, and risk assessment. This can significantly improve the overall efficiency of the diagnostic process and avoid the system complexity and data inconsistency problems caused by multiple independent models. Moreover, through knowledge sharing and joint optimization between tasks, each task can promote and regularize each other, thereby comprehensively improving the accuracy of each task and the generalization ability of the model, and finally outputting a complete, self-consistent diagnostic report that can directly guide clinical practice.

[0053] In specific implementation, channel adaptation and multi-scale context enhancement are performed on the comprehensive feature map to obtain an optimized feature map. This can be achieved by: adapting the comprehensive feature map to the channel dimension and supplementing it with multi-scale features through a feature decoding backbone network of a multi-task learning head. This feature decoding backbone network consists of multiple alternating transposed convolutional layers and batch normalization layers. At each layer, the spatial size of the feature vector is multiplied. After each upsampling operation, intermediate feature maps from the corresponding resolution level in the encoder network are introduced through skip connections to supplement the details that may be lost during the encoding process. The fused feature map is then enhanced with a non-linear activation layer to improve the model's expressive power and output a high-resolution feature map as the optimized feature map. The optimized feature map is a deep feature representation that retains high spatial resolution after channel adaptation and multi-scale detail enhancement. This optimized feature map can provide a unified and detailed, semantically rich shared feature foundation for all subsequent parallel diagnostic tasks, ensuring the consistency of prediction results in space and semantics across different tasks.

[0054] In specific implementation, the optimized feature map is input in parallel to the three task prediction heads of the multi-task learning head. Semantic segmentation is performed by the segmentation prediction head, and the output lesion segmentation mask of the target user's thyroid region can be implemented in the following way: The optimized feature map is input in parallel to the three task-specific prediction sub-networks of the multi-task learning head. In the segmentation prediction head, the channel number of the input optimized feature map is adjusted and local feature optimization is performed by a feature adaptation convolutional layer to make it more suitable for pixel-level classification tasks. The optimized feature map is then input into a multi-scale context-aware module. This multi-scale context-aware module captures lesions without reducing resolution by using dilated convolutional layers with different dilation rates in parallel. The multi-scale context relationship with surrounding tissues is then considered. The feature map output by the multi-scale context awareness module, which incorporates the multi-scale context, is then processed by a 1x1 convolution kernel in the pixel classification convolutional layer. This kernel maps the feature vector of each pixel to the corresponding category score. A Sigmoid activation function is applied to the score matrix of each pixel to generate a probability map of each pixel belonging to the target lesion. Finally, the probability map is binarized by setting a threshold (e.g., 0.5), and the resulting binary mask is used as the lesion segmentation mask for the target user's thyroid region. The lesion segmentation mask is a binary image corresponding to the original ultrasound image space, in which the precise pixel regions identified by the system as thyroid lesions are marked.

[0055] In specific implementation, binary classification is performed using a benign / malignant classification head. The output of the benign / malignant probability value of the corresponding lesion can be achieved in the following way: In the benign / malignant classification head, a global average pooling layer and a global max pooling layer are used to process the input optimized feature map in parallel, and the outputs of the two are concatenated along the channel dimension to simultaneously aggregate global statistical features and the most salient features. The aggregated global feature vector is then subjected to nonlinear transformation and feature dimensionality reduction through a fully connected layer. The dimensionality-reduced high-level semantic feature vector is then input to the final classification output layer, which is a fully connected layer with 2 nodes. The two node values ​​of the output layer are normalized to a probability distribution by applying the Softmax activation function, and the corresponding benign and malignant probability values ​​are output as the benign / malignant probability values ​​of the corresponding lesion. The benign / malignant probability value is a continuous value between 0 and 1, representing the degree of likelihood that the current thyroid lesion is classified as a malignant tumor.

[0056] In specific implementation, multi-class classification is performed through a risk grading head, and the risk grading category of the corresponding lesion can be output in the following way: In the risk grading head, a global average pooling layer and a global max pooling layer are used to process the input optimized feature map in parallel, and the outputs of the two are concatenated in the channel dimension using independent pooling and concatenation paths to ensure task-specific feature extraction. The concatenated feature vector is then input into a dedicated grading feature transformation network. Through a grading output layer equal to the number of risk levels and a Softmax activation function, the discrete probability distribution of the thyroid lesion belonging to each risk level is obtained, and the level with the highest probability value is output as the risk grading category of the corresponding lesion. The grading feature transformation network consists of fully connected layers, and its depth and width can be adjusted according to the number of grading categories to learn more complex decision boundaries. Regularization techniques and nonlinear activation are also introduced into the feature transformation network to improve the generalization ability of the model. The risk grading category is a discrete category result based on internationally recognized standards (such as the American College of Radiology Thyroid Imaging Reporting and Data System grading category) for hierarchical classification of the malignancy risk of thyroid lesions.

[0057] Furthermore, in another aspect of this application, in some embodiments, this application provides a multimodal ultrasound thyroid lesion precision diagnosis and treatment system, which includes a lesion intelligent recognition unit, as referenced. Figure 4 The figure is a schematic diagram of the structure of a lesion intelligent identification unit 400 according to some embodiments of this application. The lesion intelligent identification unit 400 includes: an acquisition module 401, a processing module 402, and an execution module 403, which are described below: The acquisition module 401 in this application is mainly used to acquire grayscale ultrasound image sequences, elastography maps and parametric maps of contrast time-intensity curves at the same anatomical location in the thyroid region of the target user, thereby obtaining multimodal image data. The multimodal image data is then subjected to three-dimensional spatial registration and temporal alignment to obtain image data blocks with multimodal spatial registration and grayscale-contrast temporal alignment. Processing module 402 in this application is mainly used to extract the temporal and spatial dependent lesion morphological features of the grayscale ultrasound image sequence in the image data block to obtain the first depth feature map of the target user's thyroid region, to perform local feature enhancement and extraction on the elastography map in the image data block to obtain the second depth feature map of the target user's thyroid region, to capture the hemodynamic spatial distribution features of the parametric map in the image data block, and then to obtain the third depth feature map of the target user's thyroid region. It should be noted that the processing module 402 in this application is also used to perform cross-modal fusion on the first depth feature map, the second depth feature map and the third depth feature map to obtain a comprehensive feature map of the thyroid lesion region of the target user; The execution module 403 in this application is mainly used to input the comprehensive feature map into the multi-task learning head for feature adaptation and optimization, and then infer and output the lesion segmentation mask, benign and malignant probability value and risk classification of the thyroid region of the target user in parallel.

[0058] Each module in the aforementioned multimodal ultrasound precision diagnosis and treatment system for thyroid lesions can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in the processor of a computer device in hardware form or independent of it, or stored in the memory of the computer device in software form, so that the processor can call and execute the corresponding operations of each module.

[0059] In another embodiment, this application provides a computer device, which may be a server, and its internal structure diagram may be as follows. Figure 5 As shown, the computer device includes a processor, memory, and a network interface connected via a system bus. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database stores data for a multimodal ultrasound intelligent identification method for thyroid lesions. The network interface communicates with external terminals via a network connection. When executed by the processor, the computer program can implement a multimodal ultrasound intelligent identification method for thyroid lesions.

[0060] Those skilled in the art will understand that Figure 5 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0061] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above embodiments of the multimodal ultrasound intelligent identification method for thyroid lesions.

[0062] In one embodiment, a computer-readable storage medium is provided storing a computer program that, when executed by a processor, implements the steps in the above-described embodiment of the intelligent identification method for multimodal ultrasound thyroid lesions.

[0063] In one embodiment, a computer program product or computer program is provided, comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the steps described in the embodiments of the multimodal ultrasound intelligent identification method for thyroid lesions.

[0064] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, or optical storage, etc. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM), etc.

[0065] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0066] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A multi-modal ultrasound thyroid lesion intelligent identification method, characterized in that, Includes the following steps: The system acquires grayscale ultrasound image sequences, elastography maps, and parametric maps of contrast time-intensity curves at the same anatomical location in the thyroid region of the target user, thereby obtaining multimodal image data. The system then performs three-dimensional spatial registration and temporal alignment on the multimodal image data to obtain image data blocks with multimodal spatial registration and grayscale-contrast temporal alignment. Spatiotemporally dependent lesion morphological features are extracted from grayscale ultrasound image sequences in the image data block to obtain the first depth feature map of the target user's thyroid region. Local feature enhancement and extraction are performed on the elastography map in the image data block to obtain the second depth feature map of the target user's thyroid region. The hemodynamic spatial distribution features of the parametric map in the image data block are captured to obtain the third depth feature map of the target user's thyroid region. Cross-modal fusion is performed on the first depth feature map, the second depth feature map and the third depth feature map to obtain a comprehensive feature map of the thyroid lesion region of the target user. The comprehensive feature map is input into a multi-task learning head for feature adaptation and optimization, and then the lesion segmentation mask, benign and malignant probability values ​​and risk classification categories of the target user's thyroid region are inferred and output in parallel.

2. The method of claim 1, wherein, Performing three-dimensional spatial registration and temporal alignment on the multimodal image data to obtain image data blocks with multimodal spatial registration and grayscale-contrast temporal alignment specifically includes: A three-dimensional spatial coordinate system is constructed using keyframes with anatomical landmarks within the grayscale ultrasound image sequence of the multimodal image data as spatial reference datum. Based on the three-dimensional spatial coordinate system, the elastic imaging map and parameter map in the multimodal image data are spatially transformed to obtain the spatially registered elastic imaging map and parameter map. Temporal interpolation and resampling are performed on the spatially registered parametric map to temporally align the grayscale ultrasound image sequence, resulting in a temporally aligned parametric map. Tensor stitching and dimensional reorganization are performed on grayscale ultrasound image sequences, time-aligned parametric maps, and spatially registered elastography maps to generate multimodal spatially registered and grayscale-contrast time-aligned image data blocks.

3. The method of claim 1, wherein, Spatiotemporally dependent morphological feature extraction of lesions was performed on the grayscale ultrasound image sequence in the image data block to obtain the first depth feature map of the target user's thyroid region, specifically including: The grayscale ultrasound image sequence in the image data block is input into a spatiotemporal feature coding network to extract continuous local texture and spatial structure features in the image sequence, generating a preliminary spatiotemporal feature map. Based on the spatiotemporal feature map, an enhanced feature map with global spatiotemporal context information is determined; The enhanced feature map is then subjected to feature projection and aggregation to generate a first depth feature map of the target user's thyroid region.

4. The method of claim 1, wherein, Local feature enhancement and extraction are performed on the elastography map in the image data block to obtain the second depth feature map of the target user's thyroid region, specifically including: The elastic imaging map in the image data block is input into a convolutional network based on refocusing convolution kernels for local feature enhancement to generate an initial elastic feature map; By enhancing the lesion-related feature channels and suppressing irrelevant noise channels in the initial elastic feature map, a channel-weighted feature map is obtained. Multi-scale contextual information is captured in the channel-weighted feature map, thereby obtaining an elastic feature map containing multi-scale contextual information; The elastic feature map is upsampled to make its spatial resolution consistent with the first depth feature map, thereby obtaining a second depth feature map of the target user's thyroid region.

5. The method of claim 1, wherein, The hemodynamic spatial distribution characteristics of the parametric maps in the image data block are captured, and then the third depth feature map of the target user's thyroid region is obtained, specifically including: The parametric map of the angiography time-intensity curve in the image data block is input into a multi-scale spatial convolutional network to extract the spatial distribution features of hemodynamic parameters and generate a parametric spatial feature map. Global average pooling of the spatial dimension is performed on the parameter space feature map to obtain global spatial aggregated quantized features; The global spatial aggregated quantization features are fused with the parameter spatial feature map at the channel level to obtain a fused feature vector that integrates hemodynamic spatial distribution and global quantization information. The fused feature vector is subjected to dimensional transformation and feature integration to generate a third deep feature map with the same spatial resolution as the first deep feature map for the thyroid region of the target user.

6. The method of claim 1, wherein, Cross-modal fusion of the first, second, and third depth feature maps yields a comprehensive feature map of the target user's thyroid lesion region, specifically including: The first depth feature map, the second depth feature map, and the third depth feature map are input into a cross-modal collaborative attention fusion network to determine the attention weight matrix between each depth feature map. Based on all attention weight matrices, collaborative attention fusion is performed on the first deep feature map, the second deep feature map, and the third deep feature map to form a high-dimensional fused feature map. The high-dimensional fusion feature map is reduced in dimension and deeply integrated to obtain a comprehensive feature map of the thyroid lesion region of the target user.

7. The method as described in claim 1, characterized in that, The comprehensive feature map is input into a multi-task learning head for feature adaptation and optimization, and then inferred in parallel to output the lesion segmentation mask, benign / malignant probability value, and risk classification category of the target user's thyroid region. Specifically, this includes: Channel adaptation and multi-scale context enhancement are performed on the comprehensive feature map to obtain an optimized feature map; The optimized feature map is input in parallel to the three task prediction heads of the multi-task learning head. Semantic segmentation is performed through the segmentation prediction head, and the lesion segmentation mask of the target user's thyroid region is output. The benign or malignant classification head is used to perform binary classification and output the benign or malignant probability value of the corresponding lesion; Multi-category classification is performed using the risk grading header, and the corresponding risk grading category of the lesion is output.

8. A multimodal ultrasound precision diagnosis and treatment system for thyroid lesions, the system comprising an intelligent lesion recognition unit, characterized in that, The intelligent lesion identification unit includes: The acquisition module is used to acquire grayscale ultrasound image sequences, elastography maps, and parametric maps of contrast time-intensity curves at the same anatomical location in the thyroid region of the target user, thereby obtaining multimodal image data. The multimodal image data is then subjected to three-dimensional spatial registration and temporal alignment to obtain image data blocks with multimodal spatial registration and grayscale-contrast temporal alignment. The processing module is used to extract the temporal and spatial dependent lesion morphological features from the grayscale ultrasound image sequence in the image data block to obtain the first depth feature map of the target user's thyroid region, perform local feature enhancement and extraction on the elastography map in the image data block to obtain the second depth feature map of the target user's thyroid region, capture the hemodynamic spatial distribution features of the parametric map in the image data block, and then obtain the third depth feature map of the target user's thyroid region. The processing module is used to perform cross-modal fusion of the first depth feature map, the second depth feature map and the third depth feature map to obtain a comprehensive feature map of the thyroid lesion region of the target user. The execution module is used to input the comprehensive feature map into the multi-task learning head for feature adaptation and optimization, and then infer and output the lesion segmentation mask, benign and malignant probability value and risk classification of the thyroid region of the target user in parallel.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the multimodal ultrasound intelligent identification method for thyroid lesions as described in any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the multimodal ultrasound intelligent identification method for thyroid lesions as described in any one of claims 1 to 7.