A vertebral classification system and method based on segmentation-guided coding for X-ray films
By developing a segmentation-guided coding-based X-ray vertebral classification system, the problems of semantic alignment and deep collaborative relationships between images and text in fine-grained classification of spinal vertebrae in multimodal image-text understanding frameworks were solved, achieving highly accurate and robust vertebral classification.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NINGBO UNIV
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-30
AI Technical Summary
Existing multimodal image-text understanding frameworks suffer from several problems in fine-grained classification of spinal vertebrae, including difficulty in achieving sufficient semantic alignment between images and text, difficulty in distinguishing subtle differences, and difficulty in effectively capturing deep bidirectional collaborative relationships between image and text information.
A segmentation-guided coding-based X-ray vertebral classification system is adopted. Through an image acquisition device, a prompt information acquisition device, a segmentation coding device, and a vertebral classification device, vertebral region segmentation, feature extraction, and feature fusion are performed. Combined with the operation rules of multi-level conditional random fields, the system achieves image-text semantic alignment and capture of deep bidirectional collaborative relationships.
It improves the accuracy and robustness of vertebral classification, ensures the spatial continuity and anatomical rationality of classification results, provides high-quality candidate regions and rich location features, and enhances clinical interpretability and reliability.
Smart Images

Figure CN121904487B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer modeling and systems technology, and more specifically, to a vertebral classification system and method based on segmentation guided coding for X-ray films. Background Technology
[0002] Automatic vertebral detection in spinal imaging is crucial for various medical applications, including early diagnosis of spinal diseases, preoperative planning, and postoperative evaluation. This task requires not only precise localization of each vertebra but also fine-grained vertebral classification. However, fine-grained classification is challenging due to several factors: first, adjacent vertebrae are highly similar in morphology and lack clear boundary differences; second, the Field of View (FoV) variations common in X-ray imaging limit the model's ability to capture complete context; and third, vertebral edges are blurred and have low contrast in X-ray images, making it difficult to distinguish the vertebral region from background tissue.
[0003] In recent years, in order to further improve the fine-grained classification of vertebrae, existing technologies have adopted a technical solution that combines the semantic understanding module of the Contrastive Language-Image Pretraining (CLIP) model with the fine segmentation module of the Segment Anything Model (SAM) to construct a multimodal image and text understanding framework. This multimodal image and text understanding framework has shown good performance in tasks such as zero-shot detection, image and text segmentation, and medical structure recognition.
[0004] However, directly applying this multimodal image-text understanding framework to fine-grained classification of spinal vertebrae still faces significant challenges. First, the CLIP model and SAM are pre-trained for global semantics and local structure, respectively, lacking targeted modeling for targets like vertebrae with repetitive structures and ambiguous boundaries. A modal gap exists in feature docking between the two, making it difficult to fully achieve image-text semantic alignment. Second, adjacent vertebrae often have highly similar semantic labels (e.g., "first thoracic vertebra," "second thoracic vertebra," "third thoracic vertebra"), making it difficult for this multimodal image-text understanding framework to distinguish such subtle differences. Furthermore, existing fusion methods mostly rely on shallow splicing or unidirectional guidance, making it difficult to effectively capture the deep bidirectional collaborative relationships between image and text information. Summary of the Invention
[0005] The technical problem this invention aims to solve is how to overcome the technical defects of existing multimodal image and text understanding frameworks in completing fine-grained vertebral classification, such as difficulty in fully achieving semantic alignment between images and text, difficulty in reasonably distinguishing subtle differences, and difficulty in effectively capturing deep bidirectional collaborative relationships between image and text information. To overcome these defects of the prior art, this invention provides an X-ray vertebral classification system and method based on segmentation guided coding, specifically including an X-ray vertebral classification system based on segmentation guided coding and an X-ray vertebral classification method based on segmentation guided coding.
[0006] This invention provides a segmentation-guided coding-based X-ray vertebral classification system, comprising:
[0007] Image acquisition device, used to acquire X-ray images of the spine of the individual being tested;
[0008] The prompt information collector communicates with the image collector and is used to perform target detection and key point detection of the vertebrae in the spinal X-ray image to obtain prompt information that displays prompt boxes and prompt points;
[0009] The segmentation and encoding device communicates with both the image acquisition device and the prompt information acquisition device to extract features from the spinal X-ray image to obtain image feature representation. Simultaneously, it segments the spinal X-ray image into vertebral regions based on the prompt information, generates shape information and spatial location information of each vertebra, and encodes them as location embedding features. Then, it performs feature fusion based on attention weighting with the image feature representation and the location embedding features to obtain a fused image feature sequence.
[0010] The vertebral classification device communicates with the segmentation and coding device to calculate the image-text feature similarity score between the fused image feature sequence and the vertebral category list, and substitutes the image-text feature similarity score into the operation rules of a multilayer conditional random field to obtain the X-ray vertebral classification result.
[0011] The segmentation-guided coding-based X-ray vertebral classification system disclosed in this invention addresses the aforementioned technical deficiencies by employing an image acquisition unit, a prompt information acquisition unit, a segmentation coding device, and a vertebral classification device. The prompt information acquisition unit obtains prompt information displaying prompt boxes and points. The segmentation coding device extracts features from the spinal X-ray image and segments the vertebral region based on the prompt information, generating shape and spatial location information for each vertebra, which is then encoded as location embedding features. This segmentation-guided vertebral coding stage, by generating preliminary vertebral segmentation location embedding features and combining them with the image features obtained from feature extraction, ensures image enhancement during the fusion process. Furthermore, the segmentation coding device performs attention-weighted feature fusion of image feature representation and location embedding features to obtain a fused image feature sequence. This method of fusing positional shape information with semantic representation promotes image-text semantic alignment and image enhancement, providing high-quality candidate regions and rich location features for subsequent classification tasks. Simultaneously, the image-text feature similarity score between the fused image feature sequence and the vertebral category list is calculated using the configured vertebral classification device to distinguish subtle differences. The image-text feature similarity score is then substituted into the computational rules of a multi-level conditional random field (MLF) to obtain the X-ray vertebral classification results. This region-aware vertebral classification stage employs MLF-based computational rules to optimize spatial sequence alignment, remodeling the vertebral classification task as a structured prediction problem. Furthermore, it models the joint distribution of vertebral sequences, decodes the optimal label sequence, and achieves image-text semantic alignment. This ensures that the predicted vertebral classification results satisfy spatial continuity and anatomical rationality, effectively capturing the deep bidirectional synergistic relationship between image and text information, resulting in vertebral classification results with good clinical interpretability and reliability.
[0012] In one possible implementation, the segmentation encoding device includes:
[0013] The segmentation model is used to segment the vertebral region of the spinal X-ray image according to the prompt information, generate shape information and spatial location information of each vertebra, and encode the obtained shape information and spatial location information into location embedding features;
[0014] An image semantic coding device is used to extract features from the spinal X-ray image to obtain an image feature representation with at least three levels of features;
[0015] The fusion module communicates with both the segmentation model and the image semantic encoding device to perform attention-weighted feature fusion of the image feature representation and the location embedding features, and integrates the fusion results at multiple scales to obtain a fusion image feature sequence composed of fusion image features of each vertebra.
[0016] The segmentation model segments the vertebral regions of the input spinal X-ray image, generating corresponding vertebral shape and spatial location information, which is then encoded as location embedding features. The image semantic encoding device extracts features from the input spinal X-ray image, obtaining image feature representations with medical semantic discriminative properties. The feature fusion module adaptively fuses the image feature representations and location embedding features through attention weighting, resulting in fused image features that integrate vertebral semantic and structural location information, providing stable and semantically consistent image feature representations for vertebral classification.
[0017] In one possible implementation, the segmentation model includes a cue encoder, an image encoder, and a mask decoder. The mask decoder communicates simultaneously with the cue encoder, the image encoder, and the fusion module. The image encoder performs feature segmentation encoding on the spinal X-ray image, the cue encoder performs feature segmentation encoding on the cue information, and the mask decoder decodes the encoding result obtained by the image encoder based on the encoding result obtained by the cue encoder to obtain the location embedding features. The image encoder performs feature segmentation encoding on the spinal X-ray image, the cue encoder encodes the cue information, and the encoding results obtained by both are decoded by the mask decoder, ultimately completing the vertebral region segmentation and providing high-quality candidate regions and rich location features for subsequent classification tasks.
[0018] In one possible implementation, the image semantic encoding device is a language-image contrast pre-trained image encoder obtained through image pre-training. This encoder can obtain image feature representations with medical semantic discriminative properties, and the obtained image feature representations have at least three levels of features, effectively improving the working efficiency of the fusion module.
[0019] In one possible implementation, the fusion module is configured to operate as follows:
[0020] A1: Perform multi-head attention fusion between the location embedding feature and each level of the image feature representation to obtain the fusion results of the location embedding feature and each level of feature;
[0021] A2: Perform feature pyramid pooling operation on each of the fusion results to obtain the corresponding pooling operation results;
[0022] A3: Perform channel concatenation on all the pooling operation results obtained in step A2 to obtain the concatenation result;
[0023] A4: Perform multi-head attention fusion of the location embedding features and the stitching result to obtain the fused image feature sequence.
[0024] The feature fusion module that performs the above steps takes each level of image feature representation, namely image semantic features, as the main feature and introduces location embedding features. It achieves adaptive fusion of the two through attention weighting and performs multi-head attention fusion on the fusion result to obtain fused image features containing vertebral semantic information and structural location information, thus providing the vertebral classification module with stable semantic consistency image feature representation.
[0025] In one possible implementation, the segmentation encoding device further includes:
[0026] The classifier communicates with the image semantic encoding device to execute an image-level classification algorithm to infer and dissect the global context of the image feature representation and dynamically generate region weighting coefficients.
[0027] The optimizer, which communicates with both the classifier and the vertebral classification device, is used to substitute the region weighting coefficients and the image-text feature similarity scores into the region-specific cross-entropy loss function, and optimize the model parameters of the segmentation model, the image semantic coding device, and the fusion module by solving for the optimal value of the region-specific cross-entropy loss function.
[0028] By adding a classifier and optimizer to the existing structure, the parameter optimization requirements of the segmentation coding device can be met, thereby improving the overall classification accuracy.
[0029] In one possible implementation, the region-specific cross-entropy loss function is calculated as follows:
[0030] ,
[0031] ,
[0032] ,
[0033] In the formula,
[0034] The value of the region-specific cross-entropy loss function represents the function value of the region-specific cross-entropy loss function.
[0035] The number of spinal X-ray images used during parameter optimization;
[0036] Represents the number of vertebrae;
[0037] Represents the number of vertebrae;
[0038] Represents the weighting coefficient for the region;
[0039] Representing the The first spinal X-ray image The probability that a particular vertebra belongs to the thoracic vertebrae;
[0040] Representing the The first spinal X-ray image The probability that a particular vertebra belongs to the lumbar vertebrae;
[0041] Representing the The first spinal X-ray image Each vertebra belongs to the vertebral category. The true label;
[0042] Representing the The first spinal X-ray image Vertebrae and Vertebrae Categories Image-text feature similarity score;
[0043] Representing the The first spinal X-ray image Fusion image features of individual vertebrae;
[0044] Representing vertebral categories Textual features;
[0045] Represents the dot product operation of vectors;
[0046] This represents the scaling factor.
[0047] The probability distribution of field of view categories is determined by a classifier, and region weighting coefficients are dynamically generated. Then, the region-specific cross-entropy loss function of the above form is used in conjunction with the region weighting coefficients to dynamically weight the loss of different regions, thereby realizing supervised optimization of region perception. This enables supervised optimization of region perception and reasonable differentiation of subtle differences, effectively improving the robustness and generalization ability of the model under standard and non-standard imaging conditions.
[0048] In one possible implementation, the vertebral sorting device includes:
[0049] A text semantic encoding device is used to receive or generate a list of categories and perform feature extraction on the list of categories to obtain text features of each vertebra category in the list of categories;
[0050] The similarity calculation device communicates with both the text semantic encoding device and the fusion module to calculate the image-text feature similarity score between the fused image feature sequence and all the text features.
[0051] The vertebral body classification device communicates with the similarity calculation device to substitute the image and text feature similarity scores into the calculation rules of a multilayer conditional random field to obtain the vertebral classification results of the X-ray film.
[0052] In one possible implementation, the operational rules of the multilayer conditional random field are described as follows:
[0053] The objective is to obtain the vertebral classification results from the X-ray images;
[0054] The algorithm used is: Viterbi dynamic programming algorithm;
[0055] The X-ray vertebral classification results are considered as: the label sequence that maximizes the conditional probability distribution of the label sequence;
[0056] The expression for the conditional probability distribution is:
[0057] ,
[0058] In the formula,
[0059] Represents the conditional probability distribution;
[0060] Represents the normalization factor;
[0061] Represents the fused image feature sequence, where, Representing the Fusion image features of individual vertebrae;
[0062] Represents a sequence of labels, where, Representing the Category labels for each vertebra;
[0063] Representing the The vertebral categories of each vertebra are: Image-text feature similarity score at the time;
[0064] The first design based on prior knowledge of vertebral anatomy Each vertebra is classified into vertebral categories. Smooth transition to vertebral category The amount of transfer.
[0065] The vertebral classification device with the above structure and functions introduces prior anatomical knowledge into the classification process, avoiding unreasonable predictions caused by independent classification. At the same time, it explicitly ensures the consistency of the order of vertebrae and category labels, effectively capturing the deep two-way synergistic relationship between image and text information, and significantly improving classification accuracy and robustness under non-standard imaging conditions (such as blurred vertebral bodies or missing vertebral segments).
[0066] Another technical solution of the present invention is to provide a method for classifying vertebrae on X-ray films based on segmentation-guided coding, the method comprising the following steps:
[0067] S1: Using the constructed spinal X-ray image-vertebra category dataset, the model parameters of the segmentation coding device are optimized by solving for the optimal value of the region-specific cross-entropy loss function;
[0068] S2: Acquire the spinal X-ray image of the individual being tested through the image acquisition device, and perform target detection and key point detection of the vertebrae on the spinal X-ray image through the prompt information acquisition device to obtain prompt information that displays prompt boxes and prompt points;
[0069] S3: The spinal X-ray image is subjected to feature extraction by the parameter-optimized segmentation and coding device to obtain image feature representation. At the same time, the spinal X-ray image is segmented into vertebral regions according to the prompt information by the parameter-optimized segmentation and coding device, generating shape information and spatial position information of each vertebra, and encoding them as position embedding features.
[0070] S4: The image feature representation and the location embedding feature are fused using an attention-weighted feature fusion method through a parameter-optimized segmentation coding device to obtain a fused image feature sequence;
[0071] S5: Calculate the image-text feature similarity score between the fused image feature sequence and the vertebral category list using a vertebral classification device, and substitute the image-text feature similarity score into the operation rules of a multilayer conditional random field to obtain the X-ray vertebral classification result.
[0072] The method disclosed in this invention first optimizes the model parameters of the segmentation and coding device. Then, it acquires spinal X-ray images and uses a prompt information collector to obtain prompt information displaying prompt boxes and points. The segmentation and coding device extracts features from the spinal X-ray images, and simultaneously segments the vertebral regions based on the prompt information, generating shape and spatial location information for each vertebra, which is then encoded as location embedding features. This segmentation-guided vertebral coding stage, by generating preliminary vertebral segmentation location embedding features and combining them with the image features obtained from feature extraction, ensures image enhancement during the fusion process. Furthermore, the segmentation and coding device performs attention-weighted feature fusion of the image feature representation and location embedding features to obtain a fused image feature sequence. This method of fusing location and shape information with semantic representation promotes image-text semantic alignment and image enhancement, providing high-quality candidate regions and rich location features for subsequent classification tasks. Finally, a vertebral classification device calculates the image-text feature similarity score between the fused image feature sequence and the vertebral category list to distinguish subtle differences. The image-text feature similarity score is then substituted into the operation rules of a multilevel conditional random field to obtain the X-ray vertebral classification result. This region-aware vertebral classification stage employs conditional random field-based computational rules to optimize spatial sequence alignment, remodeling the vertebral classification task as a structured prediction problem. It then models the joint distribution of vertebral sequences, decodes the optimal label sequence, and achieves semantic alignment between images and text. This ensures that the predicted vertebral classification results satisfy spatial continuity and anatomical rationality, effectively capturing the deep bidirectional synergistic relationship between image and text information, resulting in vertebral classification results with good clinical interpretability and reliability. Attached Figure Description
[0073] Figure 1 This is a schematic diagram of the overall structure of the X-ray vertebral classification system based on segmentation guided coding disclosed in Embodiment 1 or 2 of this application;
[0074] Figure 2 This is a schematic diagram of the network structure of the X-ray vertebral classification system based on segmentation guided coding disclosed in Embodiment 1 of this application;
[0075] Figure 3 This is a schematic diagram of the process for optimizing model parameters of a segmentation coding device as disclosed in Embodiment 1 of this application;
[0076] Figure 4 This is a schematic diagram of the operation flow of the fusion module disclosed in Embodiment 1 or 2 of this application;
[0077] Figure 5 This is a schematic diagram of the method flow disclosed in Embodiment 1 or 2 of this application;
[0078] Figure 6This is a schematic diagram of the network structure of the X-ray vertebral classification system based on segmentation guided coding disclosed in Embodiment 2 of this application;
[0079] Figure 7 This is a schematic diagram of the parameter optimization process for the X-ray vertebral classification system based on segmentation guided coding disclosed in Embodiment 2 of this application. Detailed Implementation
[0080] First, those skilled in the art should understand that these embodiments are merely used to explain the technical principles of the embodiments of this application and are not intended to limit the scope of protection of the embodiments of this application. Those skilled in the art can make adjustments as needed to adapt to specific application scenarios.
[0081] In the embodiments of this application, unless otherwise explicitly specified and limited, communication or communication connection between the first feature and the second feature means that there is information transmission between the first feature and the second feature. This information transmission can be unidirectional or bidirectional, and the communication connection can be achieved through electrical connection of wires, wireless communication, communication through electromagnetic media (such as semiconductors), communication through channels, etc. Furthermore, unless otherwise specified, the base of the logarithmic function used is 2.
[0082] In the embodiments of this application, unless otherwise explicitly specified and limited, "above," "below," "in front of," or "behind" the second feature can mean that the first and second features are in direct contact, or that the first and second features are in indirect contact through an intermediate medium. Furthermore, "above," "on top of," and "over" the second feature can mean that the first feature is directly above or diagonally above the second feature, or simply indicates that the first feature is at a higher horizontal level than the second feature. "Below," "below," and "under" the second feature can mean that the first feature is directly below or diagonally below the second feature, or simply indicates that the first feature is at a lower horizontal level than the second feature. "Before," "in front of," and "in front of" the second feature can mean that the first feature is directly in front of or diagonally in front of the second feature, or simply indicates that the first feature precedes the second feature in sequence. "After," "behind," and "behind" the second feature can mean that the first feature is directly behind or diagonally behind the second feature, or simply indicates that the first feature is after the second feature in sequence.
[0083] The technical solution of this application will be further described in detail below using two embodiments, in conjunction with the accompanying drawings and specific embodiments.
[0084] Example 1: See Figures 1-5 This embodiment discloses an X-ray vertebral classification system based on segmentation-guided coding. Figure 1 This is a schematic diagram of the overall structure of the X-ray vertebral classification system. Figure 2 This is a schematic diagram of the network structure of the X-ray vertebral classification system. (Example:) Figure 1 As shown, the X-ray vertebral classification system includes an image acquisition unit, a prompt information acquisition unit, a segmentation and coding device, and a vertebral classification device. The prompt information acquisition unit communicates with the image acquisition unit, the segmentation and coding device communicates with both the image acquisition unit and the prompt information acquisition unit, and the vertebral classification device communicates with the segmentation and coding device.
[0085] In this X-ray vertebral classification system, an image acquisition unit is used to acquire X-ray images of the spine of the individual being tested; a prompt information acquisition unit is used to perform target detection and keypoint detection on the spinal X-ray images to obtain prompt information displaying prompt boxes and prompt points. In this embodiment, the prompt information acquisition unit includes a target detection network and a keypoint detection network connected in parallel or in series. The target detection network is used to perform target detection on the spinal X-ray images to obtain prompt box information. The keypoint detection network is used to perform keypoint detection on the spinal X-ray images to obtain prompt point information. In this embodiment, the target detection network and the keypoint detection network are connected in parallel. The target detection network uses the YOLOv8 target detection network, and the keypoint detection network uses the spinal keypoint detection network.
[0086] See Figure 2 In this X-ray vertebral classification system, a segmentation and coding device is used to extract features from spinal X-ray images to obtain image feature representations. Simultaneously, based on prompts, the device segments the spinal X-ray images into vertebral regions, generating shape and spatial location information for each vertebra, which is then encoded as location embedding features. Subsequently, the image feature representations and location embedding features are fused using an attention-weighted approach to obtain a fused image feature sequence. For example... Figure 2 As shown, in this embodiment, the segmentation and encoding device includes a segmentation model, an image semantic encoding device, and a fusion module. The segmentation model communicates with both the image acquisition device and the prompt information acquisition device. The image semantic encoding device communicates with the image acquisition device. The fusion module communicates with both the segmentation model and the image semantic encoding device.
[0087] See Figure 2In the segmentation and coding device, the Segment Anything Model (SAM) is used to segment vertebral regions in spinal X-ray images based on cue information, generating shape and spatial location information for each vertebra. This shape and spatial location information is then encoded into location embedding features, which are represented as vectors, i.e., vertebral segmentation location embedding vectors. The SAM includes a cue encoder, an image encoder, and a mask decoder. The cue encoder communicates with the cue information collector, the image encoder communicates with the image collector, and the mask decoder communicates with the cue encoder, image encoder, and fusion module. The SAM generates preliminary vertebral segmentation location embedding vectors. Specifically, the image encoder performs feature segmentation encoding on the spinal X-ray image, while the cue encoder segments and encodes the cue information. The encoded results are then decoded by the mask decoder, ultimately completing vertebral region segmentation and obtaining vertebral segmentation location embedding vectors. This provides high-quality candidate regions and rich location features for subsequent classification tasks.
[0088] See Figure 2 In the segmentation and coding device, the image semantic coding unit is used to extract features from the spinal X-ray image to obtain an image feature representation with at least three levels of features. In this embodiment, the image semantic coding unit is a language-image contrast pre-trained image encoder, or MedCLP image encoder for short, obtained through medical image pre-training. This image encoder is an image encoder based on a language-image contrast pre-trained model, and it is obtained after pre-training on a medical image dataset. It has at least three outputs to output features at multiple levels. This image encoder can extract features from the input image, i.e., the spinal X-ray image, to obtain an image feature representation with medical semantic discriminative properties, i.e., MedCLP image features. In this embodiment, the MedCLP image features have three levels of features, denoted as feature layer 1, feature layer 2, and feature layer 3, as follows... Figure 4 As shown.
[0089] See Figure 2 and Figure 4 In segmentation and coding devices, the fusion module is used to perform attention-weighted feature fusion of image feature representations and location embedding features, and to integrate the fusion results at multiple scales to obtain a fused image feature sequence composed of fused image features of each vertebra. For example... Figure 4As shown, in this embodiment, the fusion module is configured to operate according to the following steps: A1: Perform multi-head attention fusion on the location embedding features and each layer of features represented by the image features respectively to obtain the fusion results of the location embedding features with each feature layer, that is, obtain the fusion results of the location embedding features with feature layer 1, the fusion results of the location embedding features with feature layer 2, and the fusion results of the location embedding features with feature layer 3; A2: Perform feature pyramid pooling operation on each fusion result to obtain the corresponding pooling operation results; A3: Perform channel concatenation on all pooling operation results obtained in step A2 to obtain the concatenation result; A4: Perform multi-head attention fusion on the location embedding features and the concatenation result to obtain the fused image feature sequence.
[0090] See Figure 2 and Figure 3 In this X-ray vertebral classification system, the vertebral classification device calculates the image-text feature similarity score between the fused image feature sequence and the vertebral category list. The image-text feature similarity score is then substituted into the operation rules of a multilevel conditional random field to obtain the X-ray vertebral classification result. For example... Figure 2 As shown, in this embodiment, the vertebral classification device includes a text semantic encoding device, a similarity calculation device, and a vertebral body classification device. The similarity calculation device communicates with both the text semantic encoding device and the fusion module, while the vertebral body classification device communicates with the similarity calculation device.
[0091] See Figure 2 and Figure 3 In a vertebral classification device, a text semantic encoding unit is used to receive or generate a list of categories and extract features from the list to obtain the text features of each vertebral category. For example... Figure 2 and Figure 3 As shown, the text semantic encoding device used in this embodiment is a text encoder of a language-image contrast pre-trained model, and it is a text encoder obtained by pre-training on a medical text dataset, referred to as the MedCLP text encoder.
[0092] In fact, the combination of image semantic encoding devices and text semantic encoding devices constitutes the Medical Contrastive Learning from Image and Paired Text (MedCLIP) model. However, although the MedCLIP model has good expressive ability in image-text feature alignment, it essentially treats vertebrae as independent classification units, ignoring the inherent sequence constraints of the spine as an anatomical structure. This independent classification paradigm easily leads to anatomically unreasonable prediction results, such as label skipping or sequence discontinuity, thus seriously affecting the clinical reliability of the model. Therefore, the vertebral classification device in this embodiment includes a similarity calculation device and a vertebral body classification device in addition to the text semantic encoding device, to overcome the aforementioned technical defect of anatomically unreasonable prediction results.
[0093] Specifically, in this embodiment, the similarity calculation device is used to calculate the image-text feature similarity score between the fused image feature sequence and all text features. The similarity score calculation formula used is as follows:
[0094] ,
[0095] In the formula, Representing the first in a spinal X-ray image Vertebrae and Vertebrae Categories The image-text feature similarity score, Representing the first in a spinal X-ray image Fusion image features of individual vertebrae Representing vertebral categories Textual features, Represents the dot product operation of vectors. This represents the scaling factor.
[0096] The vertebral classification device is used to substitute the image-text feature similarity score into the operation rules of a multilevel conditional random field (MLF) to obtain the vertebral classification results of X-ray films. The operation rules of the MLF are described as follows: (a) The goal is to obtain the vertebral classification results of X-ray films; (b) The algorithm used is the Viterbi dynamic programming algorithm; (c) The vertebral classification results of X-ray films are considered as the label sequence that maximizes the conditional probability distribution of the label sequence; (d) The expression for the conditional probability distribution is:
[0097] ,
[0098] In the formula, Represents a conditional probability distribution; Represents the number of vertebrae; As the normalization factor, Represents the fused image feature sequence, where, Representing the Fusion image features of individual vertebrae; Represents a sequence of labels, where, Represents the first iteration in the iterative computation. Category labels for each vertebra; The radiation potential function is represented by the first function, which in this embodiment is set to equal the first function. The vertebral categories of each vertebra are: The image-text feature similarity score at that time, i.e. ; The transition potential function is a rationally designed first-order function based on prior knowledge of vertebral anatomy. Each vertebra is classified into vertebral categories. Smooth transition to vertebral category The amount of transfer.
[0099] In the calculation, the objective function is the negative log-likelihood loss function, which is to take the negative logarithm of the conditional probability distribution:
[0100] ,
[0101] During the inference phase, the Viterbi dynamic programming algorithm is used to solve for the optimal label sequence. :
[0102] ,
[0103] This refers to the vertebral classification results from X-ray images. The Viterbi dynamic programming algorithm is existing technology, and those skilled in the art can learn about it by consulting relevant books on optimization methods; therefore, it will not be elaborated upon here.
[0104] Regarding the optimization of model parameters for the vertebral body classification device, other parameters can be frozen, and the transition matrix and scale parameters corresponding to the smooth transition potential function can be optimized only based on prior knowledge of vertebral anatomy to ensure that the prediction results conform to anatomical principles. The objective function remains the negative log-likelihood loss function. The design of the vertebral body classification device in this embodiment effectively introduces anatomical priors into the model optimization process, avoiding unreasonable predictions caused by independent classification, while explicitly ensuring the consistency of the vertebral label order. This significantly improves classification accuracy and robustness under non-standard imaging conditions (such as limited field of view, blurred vertebral bodies, or missing vertebral segments).
[0105] See Figure 3 and Figure 5 The following will further disclose the method of using the X-ray vertebral classification system based on segmentation guided coding in this embodiment, namely, a method for classifying X-ray vertebral bones based on segmentation guided coding. Figure 5 Here is the overall flowchart of the method, which includes the following steps:
[0106] S1: Using the constructed spinal X-ray image-vertebra category dataset, the model parameters of the segmentation coding device are optimized by solving for the optimal value of the region-specific cross-entropy loss function.
[0107] Spinal X-ray images exhibit significant structural differences depending on the field of view (e.g., cervical, thoracic, lumbar vertebrae). These structural differences often lead to a decline in the discriminative ability of a uniform classification model. Traditional loss functions typically treat all regions equally, ignoring the anatomical differences contained in different imaging fields, thereby weakening the model's generalization ability in multi-field scenarios.
[0108] To address the issue of decreased discrimination ability of a unified classification model due to changes in field of view, and to fully utilize the known anatomical information in spinal images and effectively guide the model to focus on important features from different image sources, this embodiment designs a region-specific conditional entropy loss function. This loss function combines weighting coefficients predicted by the region classifier to dynamically weight the losses of different regions, thereby achieving supervised optimization of region perception.
[0109] See Figure 3 Specifically, the process for optimizing the model parameters of the segmentation and coding device is as follows:
[0110] First, a large number of spinal X-ray images were collected from the database, and the vertebrae on each spinal X-ray image were classified and their category labels were obtained to construct a spinal X-ray image-vertebra category dataset.
[0111] Secondly, feature extraction is performed on each spinal X-ray image using the image semantic coding device in the segmentation coding equipment to obtain image feature representations with at least three levels of features, and then an image-level region classifier is used. Figure 3 The classifier (abbreviated as classifier) performs global anatomical context reasoning on the image feature representation of each spinal X-ray image and dynamically generates region weighting coefficients.
[0112] Subsequently, the prompt information collector performs target detection and key point detection on each spinal X-ray image to obtain prompt information that displays prompt boxes and prompt points.
[0113] Subsequently, the segmentation model in the segmentation coding device segments the vertebral regions of each spinal X-ray image according to the prompt information, generating shape and spatial location information of each vertebra, and encoding the obtained shape and spatial location information into location embedding features. At the same time, the image semantic coding device in the segmentation coding device extracts features from each spinal X-ray image to obtain image feature representations with at least three levels of features.
[0114] Subsequently, the image feature representation and location embedding feature of each spinal X-ray image are fused using an attention-weighted method through the fusion module in the segmentation coding device, and the fusion result is integrated into features at multiple scales to obtain a fused image feature sequence composed of the fused image features of each vertebra.
[0115] Subsequently, a category list is generated by the text semantic encoding device in the vertebral classification device, and feature extraction is performed on the category list to obtain the text features of each vertebral category in the category list.
[0116] Subsequently, the similarity calculation device in the vertebral classification device calculates the image-text feature similarity score between the fused image feature sequence of each spinal X-ray image and all text features to obtain the image-text feature similarity score.
[0117] Finally, the image-text feature similarity scores and region weighting coefficients of all spinal X-ray images are substituted into the region-specific cross-entropy loss function. The model parameters of the segmentation coding device are solved by the optimization model algorithm when the region-specific cross-entropy loss function reaches its optimal value. That is, the model parameters of the segmentation model, the image semantic coding device and the fusion module are optimized to complete the model parameter optimization.
[0118] The formula for calculating the region-specific cross-entropy loss function is as follows:
[0119] ,
[0120] ,
[0121] ,
[0122] In the formula, The value of the region-specific cross-entropy loss function; The number of spinal X-ray images used during parameter optimization; Represents the number of vertebrae; Represents the number of vertebrae; The representation obtained by the region classifier, the first The first spinal X-ray image Regional weighting coefficients for each vertebra; Representing the The first spinal X-ray image The probability that the nth vertebra belongs to the thoracic vertebrae, i.e., the probability that the n The probability that a vertebra belongs to any one of the first to twelfth thoracic vertebrae; Representing the The first spinal X-ray image The probability that the vertebra belongs to the lumbar vertebrae, i.e., the 1st vertebra... The probability that a vertebra belongs to any lumbar vertebra or sacrum; Representing the The first spinal X-ray image Each vertebra belongs to the vertebral category. The true label; Representing the The first spinal X-ray image Vertebrae and Vertebrae Categories Image-text feature similarity score; Representing the The first spinal X-ray image Fusion image features of individual vertebrae; Representing vertebral categories Textual features.
[0123] S2: Acquire spinal X-ray images of the individual being tested through an image acquisition device, and perform target detection and key point detection of the vertebrae on the spinal X-ray images through a prompt information acquisition device to obtain prompt information that displays prompt boxes and prompt points.
[0124] S3: The segmentation and coding device with optimized parameters extracts features from the spinal X-ray image to obtain image feature representation. At the same time, the segmentation and coding device with optimized parameters performs vertebral region segmentation on the spinal X-ray image according to the prompt information, generates shape information and spatial location information of each vertebra, and encodes them as location embedding features.
[0125] S4: The image feature representation and the location embedding feature are fused using an attention-weighted method through a parameter-optimized segmentation coding device to obtain a fused image feature sequence.
[0126] S5: Calculate the image-text feature similarity score between the fused image feature sequence and the vertebral category list using a vertebral classification device. Substitute the image-text feature similarity score into the operation rules of a multilayer conditional random field to obtain the X-ray vertebral classification result.
[0127] To verify the impact of each functional module on the model performance of the segmentation-guided coding-based X-ray vertebral classification system in this embodiment, an ablation experiment was conducted on the thoracic spine dataset. The contribution of the region-specific cross-entropy loss function (RSCE loss) and the vertebral classification device based on multilevel conditional random fields (CFRF) to the overall system performance was compared and analyzed. Experimental results were measured using the Identification Rate (IDR) and Instance Recognition Accuracy (IRA) metrics, and the measurement results are shown in Table 1.
[0128] Table 1: Ablation Experiment Results
[0129]
[0130] The recognition accuracy metric measures the model's overall accuracy in identifying vertebral body categories. Specifically, it represents the proportion of samples correctly identified as the target vertebral body category out of the total number of vertebral body samples in the test dataset. This metric primarily reflects the model's accuracy in vertebral body detection and category discrimination, demonstrating its ability to distinguish between different vertebral body structures. The instance recognition accuracy metric measures the model's overall consistency and completeness in recognizing multiple vertebral body instances within a single image. Specifically, it represents the percentage of samples in the same image where all vertebral body instances are correctly identified and accurately labeled. This metric considers not only the correctness of individual vertebral body recognition but also comprehensively reflects the model's instance-level recognition stability and spatial consistency in scenarios where multiple vertebrae coexist.
[0131] By simultaneously introducing recognition accuracy and instance recognition accuracy metrics, the model performance can be comprehensively evaluated from two aspects: single vertebral body recognition accuracy and overall consistency of multi-vertebral body recognition. This is beneficial for objectively reflecting the actual application effect of the segmentation-guided coding-based X-ray vertebral classification system in the automatic recognition task of spinal images in this embodiment.
[0132] As can be seen from Table 1, under the conditions of only introducing the fusion module for fusion, without introducing the region-specific cross-entropy loss function (RSCE loss) and any one of the basic models in the cone classification device, the IDR is 63.2% and the IRA is 21.8%, indicating that the overall recognition performance is low. When RSCE loss is introduced on top of the fusion module, the IDR increases to 82.3% and the IRA increases to 45.2%, indicating that this loss function can effectively optimize the model training process and enhance the class discrimination ability.
[0133] Furthermore, in the complete scheme that simultaneously incorporates a fusion module, RSCE loss, and a conditional random field, the IDR is improved to 87.3%, and the IRA is improved to 78.4%. The results show that each module works synergistically within the overall framework. The fusion module enhances feature representation capabilities, RSCE loss improves training stability, and the conditional random field utilizes structural constraints to optimize prediction results, thereby significantly improving the accuracy and consistency of thoracic vertebrae recognition. Therefore, the segmentation-guided coding-based X-ray vertebrae classification system in this embodiment achieves significantly better technical performance than the basic model in the thoracic vertebrae image recognition task.
[0134] To verify the overall performance advantage of the segmentation-guided coding-based X-ray vertebral classification system in this embodiment for lumbar spine image recognition, this embodiment selects several existing object detection and vertebral body recognition models for comparison and conducts comparative experiments on the same lumbar spine dataset. The comparison models used include the DETR model based on Transformer, the fifth version of the YOLO object detection network (YOLOv5) and the eighth version of the YOLO object detection network (YOLOv8) based on a single-stage detection framework, and the VertFound model designed for vertebral body localization tasks. The experiments uniformly use the same dataset and evaluation metrics, with recognition accuracy (IDR) and instance recognition accuracy (IRA) as performance metrics. The experimental results are shown in Table 2.
[0135] Table 2: Comparative Experiment Results
[0136]
[0137] Experimental results show that the DETR model achieved an IDR of 63.2% and an IRA of 21.8% on the lumbar spine dataset, indicating limited ability to finely distinguish vertebral structures. The YOLOv5 and YOLOv8 models achieved IDRs of 82.3% and 81.4%, respectively, but their IRAs were only 45.2% and 41.5%, showing shortcomings in distinguishing multiple vertebral instances and overall consistency. The VertFound model achieved an IDR of 75.4% and an IRA of 63.3%, demonstrating its advantage in vertebral structure modeling, but its overall recognition accuracy still has room for improvement.
[0138] In comparison, the segmentation-guided coding-based X-ray vertebral classification system in this embodiment achieves an IDR of 87.3% and an IRA of 78.4% on the lumbar spine dataset, significantly outperforming the aforementioned comparative models in both evaluation metrics. Experimental results demonstrate that the segmentation-guided coding-based X-ray vertebral classification system in this embodiment can effectively improve the stability and consistency of vertebral instance-level recognition while ensuring accurate vertebral body localization, showcasing the comprehensive performance advantages of this invention in the field of automatic lumbar spine image recognition.
[0139] Example 2: See Figure 1 , Figure 4 , Figure 5 , Figure 6 and Figure 7 This embodiment discloses an X-ray vertebral classification system based on segmentation-guided coding. Figure 1 This is a schematic diagram of the overall structure of the X-ray vertebral classification system. Figure 6 This is a schematic diagram of the network structure of the X-ray vertebral classification system. Similar to the technical solution in Embodiment 1, the X-ray vertebral classification system includes an image acquisition unit, a prompt information acquisition unit, a segmentation and coding device, and a vertebral classification device.
[0140] Compared to Embodiment 1, the technical solution in this embodiment differs as follows: See Figure 6 In this embodiment, the segmentation coding device includes a segmentation-all model, an image semantic coding device, and a fusion module, as well as a classifier and an optimizer. The classifier communicates with the image semantic coding device, and the optimizer communicates with both the classifier and the similarity calculation device in the vertebral classification device. The classifier executes an image-level classification algorithm to infer and dissect the global context of the image feature representation and dynamically generate region weighting coefficients. The optimizer substitutes the region weighting coefficients and the image-text feature similarity score into the region-specific cross-entropy loss function, and obtains the model parameters of the segmentation-all model, the image semantic coding device, and the fusion module by solving for the optimal value of the region-specific cross-entropy loss function, thereby optimizing the model parameters of the segmentation-all model, the image semantic coding device, and the fusion module.
[0141] See Figure 7 The technical solution in this embodiment is the same as that in Embodiment 1 in terms of model parameter optimization. Since the optimizer is a model structure designed based on the region-specific cross-entropy loss function, its cooperation with the classifier can complete the parameter optimization process of Embodiment 1. Therefore, the technical solution in this embodiment has the same technical effect as that in Embodiment 1.
[0142] In the description of this application, the references to terms such as "an embodiment," "some embodiments," "in this embodiment," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.
[0143] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A vertebral classification system based on segmentation-guided coding for X-ray films, characterized in that, include: Image acquisition device, used to acquire X-ray images of the spine of the individual being tested; The prompt information collector communicates with the image collector and is used to perform target detection and key point detection of the vertebrae in the spinal X-ray image to obtain prompt information that displays prompt boxes and prompt points; The segmentation and encoding device communicates with both the image acquisition device and the prompt information acquisition device to extract features from the spinal X-ray image to obtain image feature representation. Simultaneously, it segments the spinal X-ray image into vertebral regions based on the prompt information, generates shape information and spatial location information of each vertebra, and encodes them as location embedding features. Then, it performs feature fusion based on attention weighting with the image feature representation and the location embedding features to obtain a fused image feature sequence. The vertebral classification device communicates with the segmentation and coding device to calculate the image-text feature similarity score between the fused image feature sequence and the vertebral category list, and substitutes the image-text feature similarity score into the operation rules of a multilayer conditional random field to obtain the X-ray vertebral classification result.
2. The X-ray vertebral classification system based on segmentation-guided coding according to claim 1, characterized in that, The segmentation and encoding device includes: The segmentation model is used to segment the vertebral region of the spinal X-ray image according to the prompt information, generate shape information and spatial location information of each vertebra, and encode the obtained shape information and spatial location information into location embedding features; An image semantic coding device is used to extract features from the spinal X-ray image to obtain an image feature representation with at least three levels of features; The fusion module communicates with both the segmentation model and the image semantic encoding device to perform attention-weighted feature fusion of the image feature representation and the location embedding features, and integrates the fusion results at multiple scales to obtain a fusion image feature sequence composed of fusion image features of each vertebra.
3. The X-ray vertebral classification system based on segmentation-guided coding according to claim 2, characterized in that, The segmentation model includes a cue encoder, an image encoder, and a mask decoder. The mask decoder communicates with the cue encoder, the image encoder, and the fusion module. The image encoder performs feature segmentation encoding on the spinal X-ray image. The cue encoder performs feature segmentation encoding on the cue information. The mask decoder decodes the encoding result obtained by the image encoder based on the encoding result obtained by the cue encoder to obtain the location embedding feature.
4. The X-ray vertebral classification system based on segmentation-guided coding according to claim 3, characterized in that, The image semantic coding device is a language-image contrast pre-trained image encoder obtained through image pre-training.
5. The X-ray vertebral classification system based on segmentation-guided coding according to any one of claims 2-4, characterized in that, The fusion module is configured to operate as follows: A1: Perform multi-head attention fusion between the location embedding feature and each level of the image feature representation to obtain the fusion results of the location embedding feature and each level of feature; A2: Perform feature pyramid pooling operation on each of the fusion results to obtain the corresponding pooling operation results; A3: Perform channel concatenation on all the pooling operation results obtained in step A2 to obtain the concatenation result; A4: Perform multi-head attention fusion of the location embedding features and the stitching result to obtain the fused image feature sequence.
6. The X-ray vertebral classification system based on segmentation-guided coding according to claim 5, characterized in that, The segmentation and encoding device further includes: The classifier communicates with the image semantic encoding device to execute an image-level classification algorithm to infer and dissect the global context of the image feature representation and dynamically generate region weighting coefficients. The optimizer, which communicates with both the classifier and the vertebral classification device, is used to substitute the region weighting coefficients and the image-text feature similarity scores into the region-specific cross-entropy loss function, and optimize the model parameters of the segmentation model, the image semantic coding device, and the fusion module by solving for the optimal value of the region-specific cross-entropy loss function.
7. The X-ray vertebral classification system based on segmentation-guided coding according to claim 6, characterized in that, The formula for calculating the region-specific cross-entropy loss function is as follows: , , , In the formula, The value of the region-specific cross-entropy loss function represents the function value of the region-specific cross-entropy loss function. The number of spinal X-ray images used during parameter optimization; Represents the number of vertebrae; Represents the number of vertebrae; Represents the region weighting coefficients generated by the classifier; Representing the The first spinal X-ray image The probability that a particular vertebra belongs to the thoracic vertebrae; Representing the The first spinal X-ray image The probability that a vertebra belongs to any lumbar vertebra or sacrum; Representing the The first spinal X-ray image Each vertebra belongs to the vertebral category. The true label; Representing the The first spinal X-ray image Vertebrae and Vertebrae Categories Image-text feature similarity score; Representing the The first spinal X-ray image Fusion image features of individual vertebrae; Representing vertebral categories Textual features; Represents the dot product operation of vectors; This represents the scaling factor.
8. The X-ray vertebral classification system based on segmentation-guided coding according to claim 5, characterized in that, The vertebral classification device includes: A text semantic encoding device is used to receive or generate a list of categories and perform feature extraction on the list of categories to obtain text features of each vertebra category in the list of categories; The similarity calculation device communicates with both the text semantic encoding device and the fusion module to calculate the image-text feature similarity score between the fused image feature sequence and all the text features. The vertebral body classification device communicates with the similarity calculation device to substitute the image and text feature similarity scores into the calculation rules of a multilayer conditional random field to obtain the vertebral classification results of the X-ray film.
9. A method for classifying vertebrae on X-ray films based on segmentation-guided coding, characterized in that, The X-ray vertebral classification system based on segmentation-guided coding as described in any one of claims 1-8 includes the following steps: S1: Using the constructed spinal X-ray image-vertebra category dataset, the model parameters of the segmentation coding device are optimized by solving for the optimal value of the region-specific cross-entropy loss function; S2: Acquire the spinal X-ray image of the individual being tested through the image acquisition device, and perform target detection and key point detection of the vertebrae on the spinal X-ray image through the prompt information acquisition device to obtain prompt information that displays prompt boxes and prompt points; S3: The spinal X-ray image is subjected to feature extraction by the parameter-optimized segmentation and coding device to obtain image feature representation. At the same time, the spinal X-ray image is segmented into vertebral regions according to the prompt information by the parameter-optimized segmentation and coding device, generating shape information and spatial position information of each vertebra, and encoding them as position embedding features. S4: The image feature representation and the location embedding feature are fused using an attention-weighted feature fusion method through a parameter-optimized segmentation coding device to obtain a fused image feature sequence; S5: Calculate the image-text feature similarity score between the fused image feature sequence and the vertebral category list using a vertebral classification device, and substitute the image-text feature similarity score into the operation rules of a multilayer conditional random field to obtain the X-ray vertebral classification result.