A system, method, and electronic device for differentiating between a child's dentigerous cyst and periapical cyst

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining text-guided few-shot learning with multimodal fusion, the problem of ROI annotation dependence and data requirements in the diagnosis of odontogenic cysts in children by existing AI models is solved, and high-precision, interpretable cyst classification is achieved, which can meet the diagnostic needs of different medical centers.

CN122245708APending Publication Date: 2026-06-19BEIJING CHILDRENS HOSPITAL AFFILIATED TO CAPITAL MEDICAL UNIV

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING CHILDRENS HOSPITAL AFFILIATED TO CAPITAL MEDICAL UNIV
Filing Date: 2026-03-13
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing AI models suffer from several problems in differentiating pediatric odontogenic cysts, including high ROI labeling dependence, high data requirements, lack of interpretability, difficulty in handling data imbalance, and poor cross-center generalization ability, which limits their application, especially in the field of pediatric oral pathology.

Method used

We employ a text-guided few-shot learning method, using axial two-dimensional CBCT images and radiological feature description text as input. Combined with interpretable AI technology, we generate multimodal class prototypes through multimodal fusion and few-shot learning classification, thereby achieving high-precision diagnosis.

Benefits of technology

No ROI annotation is required, which reduces data requirements, improves diagnostic accuracy and model interpretability, enhances adaptability and stability in different medical centers, and improves the classification accuracy of pediatric odontogenic cysts.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122245708A_ABST

Patent Text Reader

Abstract

This invention belongs to the field of artificial intelligence-assisted clinical medical diagnosis technology, specifically relating to a system, method, and electronic device for distinguishing between dentin cysts and periapical cysts in children. The artificial intelligence technical solution of this invention combines medical images and disease text information to create a multimodal model, while simultaneously using few-shot learning to optimize the model. Specifically, it includes: acquiring and preprocessing cone-beam computed tomography (CBCT) images of pediatric patients; acquiring and encoding radiographic descriptions of dentin cysts and periapical cysts as text feature vectors; extracting visual features from the images; fusing the visual features and text feature vectors to generate multimodal class prototypes corresponding to the two types of cysts respectively; and using a prototype network to classify samples to the category corresponding to the nearest prototype by calculating the distance between the visual features of the sample to be classified and the multimodal class prototypes. This invention, through its multimodal architecture integrating images and text and employing few-shot learning, can achieve high-precision classification without manual annotation in clinical practice with limited training samples and imbalanced data, simplifying clinical procedures and improving diagnostic efficiency and robustness.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence-assisted diagnostic technology, specifically relating to a system, method, and electronic device for distinguishing between dentigerous cysts and periapical cysts in children. Background Technology

[0002] Odontogenic cysts are common lesions of the oral and maxillofacial region. In children, dentigerous cysts (DCs) and periapical cysts (PCs) are the two most common types. DCs are developmental cysts that form around the crown of unerupted teeth; PCs are inflammatory cysts, usually caused by pulp necrosis and periapical infection. Children's jaws are in a period of growth and development. Dentigerous cysts may obstruct the normal eruption of permanent teeth, leading to malocclusion and malocclusion; while infection from periapical cysts may affect the underlying permanent tooth germ, impacting its normal development. Early and accurate identification is crucial for protecting the normal development of the dentition in children.

[0003] Clinically, these two types of cysts are treated with completely different approaches: Treatment of periapical cysts generally revolves around the "pathogenic tooth," with root canal treatment being the first choice. After eliminating the infection, the cyst often heals spontaneously. The goal is to preserve the natural tooth. Treatment of dentigerous cysts revolves around the "cyst and affected tooth." Surgical removal of the cyst is necessary, and depending on the tooth's position and development, a decision is made whether to assist eruption or extract the impacted tooth. The treatment goal is to remove the lesion and manage the dentition.

[0004] If a dentigerous cyst is misdiagnosed as a periapical cyst and only root canal treatment is performed, the cyst will continue to grow, destroying the jawbone, leading to displacement of adjacent teeth, root resorption, and even pathological fractures. Conversely, if a periapical cyst is misdiagnosed as a dentigerous cyst and unnecessary surgery is performed, it may damage adjacent teeth and important structures (such as the inferior alveolar nerve) and sacrifice teeth that could have been saved through treatment.

[0005] In recent years, artificial intelligence (AI) models have become an important tool for assisting diagnosis. The accuracy of existing AI methods in the diagnosis of odontogenic cysts varies, and most of them have the following problems: (1) Reliance on region of interest (ROI) annotation: Existing AI models require ROI (covering the lesion area) as input data for modeling, which greatly limits the practical application of AI models, because accurate ROI annotation requires experienced doctors, which is labor-intensive and prone to errors. (2) Requires a large amount of labeled data: Traditional supervised learning methods require a large amount of labeled training data, but in the field of pediatric oral pathology, obtaining large-scale labeled datasets is often impractical, because expert annotation is both expensive and time-consuming. (3) Lack of interpretability of models: Existing AI models are usually "black box" models, lacking interpretability. Doctors cannot understand the decision-making basis of the model, which limits the trust and acceptance of AI in clinical practice. (4) Inability to handle data imbalance: The distribution of DC and PC in clinical data is often imbalanced. Existing models have difficulty effectively handling the class imbalance problem, resulting in a decline in the ability to identify a minority of classes. (5) Poor generalization ability across centers: Different medical centers have different CBCT equipment configurations and scanning parameters, making it difficult for existing models to achieve good generalization performance across different centers. (6) Transformer architecture is not applicable: Although the Transformer model performs well in many image domains, due to the low signal-to-noise ratio, numerous artifacts, and complex anatomical structures of CBCT images, the attention mechanism of the Transformer architecture is difficult to effectively focus on meaningful features. Summary of the Invention

[0006] To address the aforementioned issues, this invention proposes an automatic diagnostic method and system for pediatric odontogenic cysts based on text-guided few-sample learning using CBCT images. This method eliminates the need for ROI annotation, using only axial two-dimensional CBCT images and text guidance as input. Combined with interpretable AI technology, it achieves high-precision cyst classification and diagnosis.

[0007] A system for distinguishing between dentigerous cysts and periapical cysts in children according to a specific embodiment of the present invention includes: The data acquisition module is used to acquire CBCT images of pediatric patients; The image preprocessing module is used to standardize CBCT images and construct multi-channel input images containing three-dimensional spatial context; The text encoding module is used to obtain radiographic feature description texts of dentin cysts and periapical cysts and encode them into feature vectors; The visual coding module is used to extract visual features from the preprocessed CBCT images; A multimodal fusion module is used to fuse the visual features and text feature vectors to generate two multimodal class prototypes corresponding to dentin cysts and periapical cysts, respectively. The few-shot learning classification module uses a prototype network to calculate the distance between the visual features of the sample to be classified and the two multimodal class prototypes based on Euclidean distance, and classifies the sample to be classified into the category corresponding to the nearest prototype. And an output module, used to output the obtained classification results.

[0008] In this invention, the three-dimensional spatial context refers to the information that reflects the morphological continuity of the lesion in three-dimensional space, introduced by stacking continuous slices passing through the center of the lesion; compared with a single slice, this contextual information helps the model learn the three-dimensional features of the lesion.

[0009] Optionally, the system further includes an enhanced inference module for enhancing the classification of samples during the inference phase, including performing multiple geometric transformations on each sample and combining the prediction results after multiple transformations to determine the classification.

[0010] This approach can improve the model's robustness to different poses and positional inputs, thereby enhancing the stability and accuracy of the final classification results.

[0011] Optionally, the system further includes an interpretability module for generating a visual interpretation of the classification results. This module includes: a saliency calculation unit for determining the contribution of each region in the input image to the classification results based on the output gradient information; and a visualization unit for generating a saliency heatmap based on the contribution and fusing the heatmap with the CBCT image for display.

[0012] This module visualizes the model's decision-making process, solving the "black box" problem and allowing doctors to intuitively understand which regions of the image the model bases its judgments on, thereby enhancing their trust in the diagnostic results.

[0013] Optionally, the system further includes a clustering analysis module for visualizing and reducing the dimensionality of a high-dimensional feature space, wherein the high-dimensional feature space is composed of the features of the samples to be classified in the few-sample learning classification module.

[0014] By reducing high-dimensional features to two- or three-dimensional space for visualization, we can intuitively examine whether the features learned by the model have good intra-class aggregation and inter-class separability, thereby verifying the model's discriminative ability.

[0015] Optionally, the image preprocessing module includes: A resolution normalization unit is used to normalize the CBCT image to 0.25 mm / pixel to eliminate the influence of differences in scanning parameters from different devices. The image cropping unit is used to crop the image to 224×224 pixels to meet the input size requirements of the subsequent neural network model; The system also includes a channel stacking unit for generating a three-channel image. The system uses a slice passing through the central region of the lesion as a reference, acquires the reference slice and its two adjacent consecutive slices, and stacks the three slices into a three-channel image. This incorporates local three-dimensional spatial context information into the two-dimensional image, which helps the model to understand the lesion structure more comprehensively.

[0016] Optionally, the text encoding module includes: The text acquisition unit is used to collect text describing the radiographic features of dentin cysts and periapical cysts. The text prompt generation unit is used to generate multiple class-level text prompts for each category, wherein each text prompt is used to describe the imaging features of the corresponding lesion; A text encoder is used to encode text prompts into 256-dimensional feature vectors. By introducing standardized text prior knowledge, it can effectively guide the model to learn more discriminative features when image data is limited.

[0017] Preferably, the imaging features include the location, shape, or density pattern of the corresponding lesion.

[0018] Optionally, the visual encoding module includes: Visual encoder; The feature projection layer includes a fully connected layer that receives 2048-dimensional features extracted by the visual encoder and converts them into 256-dimensional features.

[0019] The high-dimensional visual features are reduced in dimensionality by using a feature projection layer to match the dimension of the text feature vector, which facilitates subsequent feature fusion.

[0020] Optionally, the multimodal fusion module includes: The feature concatenation unit is used to connect the image feature vector and the text feature vector to generate a joint feature vector, wherein the dimension of the joint feature vector is equal to the sum of the dimensions of the image feature vector and the dimensions of the text feature vector.

[0021] Preferably, the image feature vector is 256-dimensional, the text feature vector is 256-dimensional, and the joint feature vector is 512-dimensional.

[0022] The multimodal fusion module also includes a fusion network, which consists of two fully connected layers for mapping 512-dimensional input features to 128-dimensional output, with the 128-dimensional output serving as a multimodal class prototype.

[0023] In this invention, the multimodal class prototype refers to a typical feature vector representing a specific category, obtained by fusing image and text modal information. In some embodiments, the first layer of the two fully connected layers maps the 512-dimensional features to 256 dimensions and performs ReLU activation, while the second layer maps the 256-dimensional features to 128 dimensions to obtain the final multimodal class prototype.

[0024] The few-shot learning classification module uses a prototype network for few-shot learning classification, including: A meta-learning unit is used to construct multiple meta-tasks. Each meta-task contains a support set and a query set. The support set contains multiple categories, and each category contains a first number of samples. The query set contains multiple categories that are the same as the support set, and each category contains a second number of samples. A distance metric unit is used to calculate the Euclidean distance between the query sample features and the prototypes of various types, wherein the prototypes are determined based on the sample features of the corresponding categories in the support set; A classification decision unit is used to classify samples in the query set into the class prototype with the smallest distance based on the distance.

[0025] This scheme defines in detail how to achieve effective fusion of multimodal information through feature concatenation and neural networks, and how to use prototype networks to achieve efficient classification in small sample scenarios.

[0026] This invention also provides a method for distinguishing between dentigerous cysts and periapical cysts in children, comprising the following steps: S1: Data acquisition, acquiring CBCT images of pediatric patients; S2: Image preprocessing, standardizing the CBCT image and constructing a multi-channel input image containing a three-dimensional spatial context; S3: Text encoding, obtaining radiographic feature description text of dentin cysts and periapical cysts, and encoding the text into feature vectors; S4: Extract visual features; extract visual features from the preprocessed CBCT image. S5: Multimodal fusion, fusing the visual features and the text feature vectors to generate two multimodal class prototypes corresponding to dentin cysts and periapical cysts, respectively; S6: Small sample learning classification uses a prototype network to calculate the distance between the visual features of the sample to be classified and the two multimodal class prototypes based on Euclidean distance, and classifies the sample to be classified into the category corresponding to the nearest prototype. S7: Output, output the classification results.

[0027] The present invention also provides an electronic device, comprising: a processor and a memory; the memory for storing a computer program; the processor for executing the computer program stored in the memory to cause the electronic device to perform the method for distinguishing between dentigerous cysts and periapical cysts in children as described in any of the preceding claims.

[0028] The beneficial effects of this invention are: (1) No ROI labeling required, simplifying clinical application and deployment process. This invention directly uses axial 2D CBCT images as input, eliminating the need for manual annotation of regions of interest. It integrates the visual encoder, text encoder, fusion network, and classification module into a single unit, achieving joint optimization of multimodal features and classification targets, thus avoiding complex preprocessing and post-processing steps. This combined design not only simplifies clinical workflows and reduces physician workload but also facilitates rapid system deployment and maintenance, significantly improving the system's acceptability and ease of deployment in real-world clinical environments.

[0029] (2) Small sample learning reduces data dependence This invention employs a few-shot learning paradigm, requiring only 15 labeled samples per class to achieve effective learning, significantly reducing dependence on large-scale datasets. Simultaneously, radiological description text is incorporated into the model as prior knowledge. The synergistic effect of these two approaches improves classification accuracy from 84.90% of the image-only baseline to 87.47% in the data-scarce field of pediatric oral pathology, achieving the dual advantages of low data dependence and high classification accuracy.

[0030] (3) Text-guided enhancement to improve classification accuracy This invention deeply fuses image and text features to generate multimodal class prototypes, while employing Grad-CAM++ interpretable AI technology to generate classification decision heatmaps. Multimodal fusion improves classification accuracy, while the interpretable heatmaps visually display the lesion regions the model focuses on and exhibit good consistency with actual lesion segmentation. Together, these two techniques ensure the model's discriminative performance and enhance doctors' understanding and trust in the AI system's decision-making process.

[0031] (4) Enhance the robustness and generalization stability of the model by using augmentation and few-shot learning during testing. This invention employs a geometric flipping strategy for testing enhancement, combined with a few-shot learning paradigm, to make the model more robust during the inference phase. Experiments show that horizontal and vertical flipping further improves the classification accuracy from 87.47% to 88.96%, while few-shot learning ensures the model's basic performance with limited data. The combination of these two approaches effectively reduces the model's sensitivity to changes in the input image, improving its stability in practical applications.

[0032] (5) Cluster analysis and interpretable AI, dual verification of model discrimination ability This invention employs t-SNE technology to reduce high-dimensional features to a 2D space, intuitively displaying the distribution and separation of DC and PC samples. Simultaneously, Grad-CAM++ heatmaps are used to visualize the decision-making basis. Cluster analysis verifies the model's discriminative ability at the feature space level, while the heatmap explains the classification criteria at the image space level. These two methods mutually reinforce each other, comprehensively demonstrating the model's effectiveness and interpretability.

[0033] (6) Enhanced cross-center validation and testing to ensure clinical generalization ability. This invention performs cross-center validation on data from two different medical centers, while employing test-time augmentation strategies to improve model robustness. Cross-center validation demonstrates the model's adaptability under different device configurations and scanning parameters, while test-time augmentation further enhances the model's stability in single inferences. The combination of these two approaches ensures that the model maintains reliable classification performance in diverse real-world clinical scenarios.

[0034] (7) Small sample learning and processing class imbalance, adapting to the distribution of real clinical data The few-shot learning paradigm of this invention is naturally suitable for scenarios with scarce data, while its prototype network-based mechanism can effectively handle class imbalance problems. Even with uneven distribution of DC and PC samples, the model can still maintain stable classification performance, avoiding the bias of traditional methods towards the majority class. This combined design makes the system more closely resemble the actual situation of imbalanced sample distribution in real clinical settings.

[0035] (8) High-precision diagnosis + explainable AI to assist clinical decision-making and teaching This invention achieved a classification accuracy of 88.96% on data from 457 pediatric patients, while simultaneously visualizing the decision-making process using Grad-CAM++ heatmaps. High-precision diagnosis provides reliable supplementary opinions for physicians, and the interpretable heatmaps help them understand the logic behind the model's judgments. It can also be used in medical teaching scenarios, helping young doctors learn the typical characteristics of two types of cysts in CBCT images. The combination of these two approaches not only improves diagnostic efficiency but also promotes the transfer of clinical knowledge and experience. Attached Figure Description

[0036] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0037] Figure 1 This is a schematic diagram of the system structure of the present invention.

[0038] Figure 2 The ROC-AUC curves and PR curves for embodiments of the present invention are shown in the figure below. Left: ROC-AUC curve for a balanced query set (0.888); Right: PR curve for an unbalanced query set.

[0039] Figure 3 The Grad-CAM++ visualization of the dental cyst case of this invention is shown in the left image: 2D CBCT image; right image: Grad-CAM++ heatmap overlay.

[0040] Figure 4 The Grad-CAM++ visualization of periapical cyst cases in this invention is shown in the image below. Left image: 2D CBCT image; Right image: Grad-CAM++ heatmap overlay.

[0041] Figure 5 For the t-SNE 2D mapping of this invention, the upper figure represents the balanced query set; the lower figure represents the unbalanced query set; orange dots represent DC patients, and blue dots represent PC patients. Detailed Implementation

[0042] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be described in detail below. Obviously, the described embodiments are merely some embodiments of this invention, and not all embodiments. Based on the embodiments of this invention, all other implementation methods obtained by those skilled in the art without creative effort are within the scope of protection of this invention.

[0043] It should be noted that in the description of this invention, the terms "comprising" and "having," and any variations thereof, are intended to be non-exclusive inclusion. Furthermore, the terms "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance. Unless otherwise specified, "a plurality of" in the embodiments of this invention means two or more.

[0044] Before providing a further detailed description of the embodiments of the present invention, some of the nouns and terms involved in the embodiments of the present invention will be explained, and the nouns and terms involved in the embodiments of the present invention shall be interpreted as follows.

[0045] (1) CBCT image: refers to cone beam computed tomography (CBCT) image, which is a medical image that can provide three-dimensional high-resolution images of the oral and maxillofacial region. Its data is essentially three-dimensional volume data composed of a series of two-dimensional axial slices.

[0046] (2) Multimodal class prototype: refers to a feature vector that represents a specific lesion category (e.g., dentin cyst or periapical cyst). This feature vector is generated by fusing visual features from the image modality and semantic features from the text modality. It can be regarded as the center point or average representation of the category in the multimodal feature space.

[0047] (3) Prototype Network: refers to a few-shot learning classification model based on metric learning. Its core working principle is to calculate a prototype (i.e. the mean of all sample features under that category) in the feature space for each category, and to classify the sample to be classified into the category corresponding to the nearest prototype by calculating the distance between the features of the sample to be classified and the prototypes of each category (such as Euclidean distance).

[0048] (4) Saliency heatmap: This refers to a visual image that is overlaid on the original medical image using color coding (e.g., from cool to warm tones) to mark the regions in the original image that contribute the most to the AI model's specific classification decision. Warm-toned regions usually indicate a higher contribution, thus providing an intuitive visual explanation for the model's decision.

[0049] Example 1: A system for differentiating between dentigerous cysts and periapical cysts in children. like Figure 1 As shown, the system of the present invention includes a data acquisition module, an image preprocessing module, a text encoding module, a visual encoding module, a multimodal fusion module, a few-shot learning classification module, and an output module. These modules work together to form a complete technical chain from raw data input to interpretable result output.

[0050] Data acquisition module: This module is used to acquire CBCT images of pediatric patients. In practical applications, it can be an interface connecting to a hospital image archiving and communication system (PACS), or a software interface that allows users to manually upload CBCT image files (such as DICOM format files) to acquire axial two-dimensional slice image data of pediatric patients from the CBCT device. This module provides the raw data source for all subsequent analyses.

[0051] Image preprocessing module: Connected to the data acquisition module, this module standardizes the acquired raw CBCT images and constructs a multi-channel input image containing a 3D spatial context. Because different medical institutions may use different CBCT equipment models and have varying scanning parameters, the resolution, size, and grayscale range of the images differ. Directly processing this heterogeneous data would severely impact the model's performance and generalization ability. Therefore, the standardization process performed by the image preprocessing module is a crucial step in ensuring model stability. The processing results from this module are then transmitted to the visual encoding module.

[0052] Text encoding module: This module is used to acquire radiographic descriptive text about dentigerous cysts and periapical cysts, and encode it into feature vectors with semantic information. These textual descriptions (collected from medical textbooks and professional medical websites) essentially digitize the prior knowledge and diagnostic experience of human physicians, providing the model with descriptive textual information beyond images. The output of this module, the textual feature vectors, is then fed to the multimodal fusion module.

[0053] Visual encoding module: This module receives standardized images from the image preprocessing module and is responsible for extracting deep visual features from the images. These visual features capture key radiological information such as the shape, texture, density, and boundaries of the lesion area. The visual features extracted by this module are then transmitted to the multimodal fusion module.

[0054] Multimodal fusion module: This serves as a bridge connecting image and text information. The multimodal fusion module receives visual features from the visual encoding module and text feature vectors from the text encoding module, and effectively fuses these two different modalities. The purpose of fusion is to generate two multimodal class prototypes corresponding to dentin cysts and periapical cysts, respectively. Because these prototypes simultaneously contain information on "what they look like" (visual features) and "what they should look like" (textual description features), they are more comprehensive and accurate than single-modal feature representations.

[0055] Few-shot learning classification module: The module receives multimodal class prototypes generated by the multimodal fusion module and uses a prototype network for classification. When a sample to be classified (query sample) is input, the module first extracts its visual features through the visual encoding module. Then, based on a preset distance metric (e.g., Euclidean distance), it calculates the distance between the visual features of the sample to be classified and the two multimodal class prototypes: dentin cyst and periapical cyst. Finally, according to the nearest neighbor principle, the sample to be classified is assigned to the category corresponding to the nearest prototype. This metric-based classification method enables the model to learn effectively and classify accurately even with only a small number of labeled samples.

[0056] Output module: This is used to present the final diagnostic results to users (such as clinicians). The output includes not only the judgment of the cyst type (e.g., "dentine cyst" or "periapical cyst"), but also the confidence score of the classification, and more importantly, the visual explanation generated by the interpretability module 8, thus achieving a complete and reliable closed loop for auxiliary diagnosis.

[0057] In a preferred embodiment, to further improve the accuracy and robustness of the diagnosis, such as Figure 1 As shown, the system may also include an enhanced inference module. This module enhances the system when the input samples to be classified are tested during the inference (i.e., testing) phase.

[0058] Specifically, for each sample to be classified, the augmented inference module performs multiple pre-defined geometric transformations, such as horizontal flipping, vertical flipping, and small-angle rotation. After each transformation, a prediction is made by the few-shot learning classification module. Finally, the final classification is determined by combining the prediction results from multiple transformations (e.g., through voting or average probability). This strategy is equivalent to "observing" the lesion from multiple perspectives, which can effectively reduce misjudgments caused by accidental image orientation or positional deviations, thereby improving the stability of the model in real clinical scenarios.

[0059] In another preferred embodiment, in order to address the "black box" problem of model decision-making, such as Figure 1 As shown, the system also includes an interpretability module. This module generates visual interpretations of the classification results, enabling doctors to understand the basis for the model's judgments. The interpretability module may include a saliency calculation unit and a visualization unit. The saliency calculation unit, based on the gradient information output by the few-shot learning classification module, uses class activation mapping algorithms such as Grad-CAM++ to determine the contribution of each region in the input image to the final classification result. The visualization unit generates a color-coded saliency heatmap 405 based on the calculated contributions and then fuses or displays this heatmap alongside the original CBCT image.

[0060] The highlighted areas (usually warm-toned) in the heatmap precisely cover the lesion area, visually telling doctors, "The model mainly identified it as a dentigerous cyst because it saw the features of this area." This visual basis for decision-making greatly enhances clinicians' trust in AI diagnostic results.

[0061] Furthermore, in order to verify the effectiveness of the features learned by the model from a macroscopic perspective of data distribution, such as Figure 1 As shown, the system may also include a clustering analysis module. This module performs dimensionality reduction visualization on the features of the samples to be classified, which are located in a high-dimensional feature space after being processed by the few-shot learning classification module. For example, algorithms such as t-SNE (t-distributed random neighborhood embedding) or PCA (principal component analysis) can be used to map the high-dimensional features to a two-dimensional or three-dimensional space for easy visualization.

[0062] After processing by the clustering analysis module, a scatter plot is generated, in which sample points of different categories automatically cluster together. For example, all sample points of dentin cysts (DCs) will cluster into a dentin cyst sample point cluster, while all sample points of periapical cysts (PCs) will cluster into a periapical cyst sample point cluster. The boundary between these two clusters is clear, demonstrating that the multimodal features learned by this system have strong discriminative power and can effectively distinguish between these two types of lesions.

[0063] Furthermore, the methods of multimodal fusion and classification can also be varied. In another implementation, the multimodal fusion module can avoid simple feature concatenation and instead employ an attention-based fusion mechanism. This mechanism dynamically assigns different weights to visual and textual features based on the characteristics of the input samples, enabling the model to more intelligently focus on the more important information modalities in the current discrimination task. In addition, the few-shot learning classification module can also employ relation networks, besides using prototype networks.

[0064] The technical solution of the present invention is not limited to the specific model combination described above. In another embodiment, the visual encoder in the visual encoding module can be replaced with other advanced convolutional neural network models, such as InceptionV3 or DenseNet121.

[0065] Similarly, the text encoder in the text encoding module can be replaced with other language models such as RoBERTa. Experimental results show that although the specific classification accuracy fluctuates after model replacement (e.g., 84.6% accuracy with InceptionV3 and 82.6% with DenseNet121), the system of this invention can still effectively complete the classification task, and the interpretability module can still generate meaningful decision heatmaps. This demonstrates that the technical solution of this invention has good model scalability and universality.

[0066] Similarly, there are multiple technical means to achieve interpretability. In another implementation, the interpretability module may not use the Grad-CAM++ algorithm, but instead employ other interpretability algorithms such as LIME (Locally Interpretable Model-Independent Interpretation) or SHAP (Shapley Additive Interpretation). LIME approximates the local decision-making behavior of the original complex model by learning a simple, interpretable linear model in the neighborhood of the sample to be interpreted. SHAP, on the other hand, is based on the Shapley value concept in cooperative game theory to calculate the contribution of each input feature (e.g., a region in an image) to the final prediction result. Regardless of the algorithm used, the ultimate goal is to generate a visual interpretation that can be understood by doctors, thereby breaking the "black box" nature of AI models and enhancing human-machine trust in clinical applications. This demonstrates that the innovation of this invention lies in integrating the function of "interpretability analysis" itself, rather than being limited to a specific implementation algorithm.

[0067] The following provides a more specific implementation scheme. The system of the present invention includes the following modules: (1) Data acquisition module: Acquire axial two-dimensional slice image data of pediatric patients from CBCT equipment.

[0068] (2) Image preprocessing module: Image resolution normalization unit: Normalizes images acquired by different devices to 0.25 mm / pixel; Image cropping unit: Crops the image to 224×224 pixels; Channel stacking unit: Select 3 slices that span the central region of the lesion and stack them into 3 channels.

[0069] (3) Text encoding module: Text acquisition unit: Collects textual descriptions of the radiological characteristics of DC and PC from medical textbooks; Text hint generation unit: Generates 9 class-level text hints for each category; BERT encoder: Employs the bert-base-Chinese model to encode text into 256-dimensional features; (4) Visual encoding module: ResNet50 encoder: Employs a ResNet50 model pre-trained on ImageNet; Fully connected layer: projects 2048-dimensional features to 256 dimensions; (5) Multimodal fusion module: Feature concatenation unit: concatenates image features and text features into a 512-dimensional vector; Converged network: Two-layer fully connected network (512→256→128); (6) Small sample learning classification module: Episodic training unit: 2-way 15-shot setup; Distance metric unit: based on Euclidean distance; Classification decision unit: based on the principle of nearest prototype; (7) Enhancement module during testing: Geometric flip unit: horizontal flip + vertical flip; Multiple enhancement units: three enhancements are applied independently; (8) Explainable AI analysis module: Grad-CAM++ computational unit: computes gradient-weighted class activation maps; Heatmap generation unit: Generates importance heatmaps; Visualization unit: Overlays heatmaps onto the original image. (9) Cluster analysis module: t-SNE dimensionality reduction unit: reduces 128-dimensional features to 2-dimensionality; Visualization unit: Generates a 2D mapping graph; (10) Output module: Output results, including classification results, confidence scores, heatmaps and cluster diagrams.

[0070] Example 2: Method for differentiating between dentigerous cysts and periapical cysts in children The method flow of this invention corresponds one-to-one with the functions of each module in the above system. The method of this invention may include the following steps: S1: data acquisition; S2: image preprocessing; S3: text encoding; S4: visual feature extraction; S5: multimodal fusion; S6: few-shot learning classification; S7: data augmentation; S8: interpretable AI analysis; S9: cluster analysis; S10: output results.

[0071] In step S1, CBCT images of the pediatric patient are acquired.

[0072] In step S2, the CBCT images are standardized.

[0073] In one specific embodiment, the image preprocessing module performs this step. This step can be further refined as follows: First, a resolution normalization unit uses an interpolation algorithm to normalize the voxel resolution of all CBCT images to 0.25 mm / pixel to eliminate device differences. Next, an image cropping unit crops or scales the image to 224×224 pixels to match the input size requirements of the subsequent visual encoder model. Finally, to introduce three-dimensional spatial context information into the two-dimensional image, a channel stacking unit uses an axial slice passing through the center of the lesion as a reference, acquires this reference slice and its two adjacent upper and lower consecutive slices, and places these three grayscale slices into the three channels of the RGB image, stacking them into a three-channel pseudo-color image. The advantage of this approach is that, without using a complex 3D model, the model can still perceive the local changes of the lesion in the Z-axis direction, improving information utilization.

[0074] In step S3, text encoding is performed.

[0075] This step can be further refined as follows: First, a text acquisition unit is responsible for collecting and organizing text descriptions of the typical radiographic features of dentin cysts and periapical cysts from authoritative medical textbooks, clinical guidelines, or professional hospital websites. Then, multiple class-level text cues are generated for each category (DC and PC). For example, the cues generated for dentin cysts could be "The cyst typically surrounds the crown of an unerupted tooth" or "The cyst has well-defined borders and is round or oval." Preferably, these text cues describe key imaging features such as the location, shape, boundaries, internal density, or relationship with adjacent structures of the lesion. Finally, a text encoder, for example using a pre-trained BERT (bert-base-Chinese) model, encodes each of these text cues into a 256-dimensional text feature vector.

[0076] In step S4, visual features are extracted.

[0077] This module internally includes a visual encoder and a feature projection layer. The visual encoder can employ a convolutional neural network pre-trained on a large image dataset (such as ImageNet), for example, the ResNet50 model. The ResNet50 model receives a pre-processed 224×224×3 three-channel image and outputs a 2048-dimensional visual feature. Since this dimension does not match the text feature dimension (256 dimensions), a feature projection layer (e.g., a fully connected layer) is needed to linearly transform this 2048-dimensional feature into 256 dimensions for subsequent fusion.

[0078] In step S5, multimodal fusion is performed to generate class prototypes.

[0079] This process occurs during the model training or prototype building phase, using a small sample dataset called the "support set." For each sample in the support set, a feature concatenation unit concatenates the sample's 256-dimensional visual feature vector (from ResNet50 302) with its corresponding class's 256-dimensional text feature vector (from BERT 305), generating a 512-dimensional joint feature vector. This 512-dimensional vector is then input into a fusion network 308, which can consist of two fully connected layers, such as a fully connected network with a 512-dimensional input, a 256-dimensional intermediate layer, and a 128-dimensional output. The 128-dimensional output of this network is defined as the multimodal feature of the sample. Finally, the multimodal features of all samples belonging to the same class in the support set are averaged, resulting in the final multimodal class prototype 309 for that class. For example, by averaging the multimodal features of all samples with dentin cysts, a "dentin cyst multimodal class prototype" is obtained.

[0080] In step S6, few-sample learning classification is performed.

[0081] This module employs a prototype network for classification, and its training process follows a meta-learning paradigm. Specifically, a meta-learning unit constructs multiple meta-tasks (also called episodes), each simulating a few-shot learning scenario, containing a support set (N-way K-shot, e.g., 2 classes, 15 samples per class) and a query set. The model's goal is to construct class prototypes using the support set and accurately classify samples in the query set. During the inference phase, when a query sample to be classified is input, the system first extracts its 256-dimensional visual features through a visual encoding module. Then, a distance metric unit calculates the Euclidean distance between the visual features of the query sample and two pre-calculated multimodal class prototypes (309 in total, including the dental cyst prototype and the periapical cyst prototype). A classification decision unit classifies the query sample into the category corresponding to the class prototype with the smallest distance based on the two calculated distance values. For example, if the distance between the query sample and the dental cyst prototype is less than its distance to the periapical cyst prototype, the classification result is "dental cyst".

[0082] In step S7, enhancements are performed during testing: The geometric flip unit enables horizontal and vertical flipping; the multi-enhancement unit can perform three independent enhancements.

[0083] In step S8, interpretable AI analysis is performed: The gradient-weighted class activation mapping is computed by the Grad-CAM++ computation unit; the heatmap generation unit generates an importance heatmap; and the visualization unit overlays the heatmap onto the original image.

[0084] In step S9, cluster analysis is performed: The 128-dimensional features are reduced to 2-dimensionality using the t-SNE dimensionality reduction unit; the visualization unit generates a 2D mapping map.

[0085] In step S10, the output results include classification results, confidence scores, heatmaps, and cluster diagrams.

[0086] The output module performs this step to display the classification results to the user, and may optionally display the analysis results generated by the interpretability module and the clustering analysis module, such as CBCT images with overlaid saliency heatmaps and feature space cluster maps.

[0087] The key algorithms for the above steps are described below: (1) Image comparison algorithm In step S8, the Grad-CAM++ algorithm specifically performs the following operations: First, the feature maps of the final convolutional layer and the gradient of the model output relative to these feature maps are obtained. Based on the gradient information, the weight coefficients of each feature map are calculated to measure the importance of each feature map for predicting the target class. Then, the feature maps and their corresponding weights are weighted and combined to generate a preliminary class activation map. The ReLU function is applied to this activation map to retain only regions that positively influence the target class and suppress negative contributions. Next, the resulting low-resolution heatmap is upsampled to the size of the original input image to align it with the original image space. Finally, the upsampled heatmap is overlaid on the original CBCT image to highlight lesion areas that play a crucial role in the model's decision-making, thereby enabling a visual interpretation of the classification results.

[0088] (2) Prototype loss function In step S5, the prototype loss function is calculated as follows: First, the Euclidean distance between the visual features of the query sample and the multimodal class prototype of each category is calculated to obtain the distance metric; then, these distance values are converted into the probability distribution of the query sample belonging to each category through the Softmax function; based on the true category label, the negative log-likelihood loss of the probability distribution is calculated as the classification loss of the current query sample; finally, the loss gradient is backpropagated to the entire network through the backpropagation algorithm to update the model parameters, so that the features of samples of the same class are closer to their class prototypes in the feature space, while the features of samples of different classes are farther away from other class prototypes.

[0089] (3) Multimodal fusion algorithm In step S5, the multimodal fusion algorithm is specifically implemented as follows: In the vision branch, deep features are extracted from the support image using a pre-trained ResNet50, and then mapped to a 256-dimensional image feature vector through a fully connected layer. In the text branch, multiple pre-set text prompts for each category (such as sentences describing the location, shape, and density of lesions) are encoded into 768-dimensional features by a pre-trained BERT (bert-base-chinese), reduced to 256 dimensions through a fully connected layer, and multiple text features of the same category are aggregated (e.g., averaged) to obtain a 256-dimensional text prototype for that category. Subsequently, the image features and text prototypes are concatenated to form a 512-dimensional joint feature vector. This vector is then input into a fusion network (consisting of two fully connected layers, with dimensions transformed from 512→256→128), and after non-linear mapping, the final 128-dimensional multimodal class prototype is obtained. For the query sample to be classified, its visual features are also extracted using ResNet50, but when input into the fusion network, the corresponding text features are replaced with zero vectors. That is, only visual information is used to generate the feature representation of the query sample through the fusion network, and then the distance metric is used with each type of prototype to complete the classification.

[0090] Example 3 Data were collected from pediatric patients at Peking University School of Stomatology and Beijing Children's Hospital. A total of CBCT image data of 457 pediatric patients (<18 years old) were collected, including 282 patients with DC and 175 patients with PC.

[0091] The method for distinguishing between dentigerous cysts and periapical cysts in children according to the present invention includes the following steps: S1: Dataset Construction: CBCT image data of the above 457 pediatric patients were used.

[0092] S2: Image preprocessing: CBCT images acquired by different devices were normalized to a resolution of 0.25 mm / pixel; Crop the image to 224×224 pixels; Select three consecutive slices that span the central region of the lesion; Stack the three slices into a 3-channel image; No metal artifact denoising was performed to test the model's robustness; S3: Dataset partitioning: Randomly partition the dataset by patient level to avoid data leakage; Training set: 80% of patients (365 cases); Test set: 20% of patients (92 cases); S4: Text Prompt Generation Radiological descriptions of DC and PC were collected from medical textbooks and professional medical websites: Nine class-level text hints were generated for each category (including dental cysts and periapical cysts); the text hints described the location, shape, density pattern, and other features of the lesion; the generation was assisted by Claude Sonnet 4.0 and verified by professional dentists; S5: Model Training The episodeic meta-learning paradigm was used to train 2000 episodes; Each episode contains: Support set: 15 samples per class (15×2, 30 samples in total); Query set: 15-20 samples per class (30-40 samples in total); Patients are strictly separated to ensure that no patient appears in both the support set and the query set simultaneously. Using the AdamW optimizer, with a learning rate of 1×10⁻⁶. -4 Weight decay 1×10 -4 ; Calculate the prototype loss function based on the Euclidean distance between the query feature and the class prototype; S6: Data Augmentation (on the fly) Online data augmentation is supported for the application, including: random horizontal flipping (probability 0.5), random rotation (±10°), and color dithering (brightness and contrast adjustment 0.2). S7: Test-time augumentation; Apply a geometric flipping strategy (horizontal flipping + vertical flipping) to the query samples; Independent application enhanced three times; Integrating multiple prediction results improves stability; S8: Explainable AI Analysis: Heatmaps were generated using the Grad-CAM++ method. Calculate the gradient of the model output relative to the feature map of the final convolutional layer; Generate a heatmap highlighting the region that contributes the most to the prediction; Visualize the heatmap by overlaying it onto the original CBCT image; S9: Cluster analysis; t-SNE was used to reduce the 128-dimensional features to 2-dimensionality; Generate 2D mapping visualizations of DC and PC samples; Evaluate the degree of separation between the two types of samples in the feature space; S10: Output the result; Output classification results (including dental cysts or periapical cysts), output confidence scores, output Grad-CAM++ heatmap visualization, and output t-SNE clustering visualization. Based on the above data, the present invention further validated the model performance, and the results are as follows: 1. Comparison of benchmark models To select the optimal visual feature extraction backbone network, experiments compared the classification performance of three mainstream convolutional neural networks—ResNet50, InceptionV3, and DenseNet121—under the same few-shot learning settings. The results showed that ResNet50 achieved the highest average accuracy of 87.47% ± 0.02%, significantly outperforming InceptionV3's 84.6% ± 0.07% and DenseNet121's 82.6% ± 0.10%, and also exhibiting the smallest standard deviation, indicating the most stable performance. Therefore, ResNet50 was used as the backbone network for the visual encoder in subsequent experiments.

[0093] 2. Verification of Text-Based Guidance Effect To verify the effect of textual prior knowledge on classification performance, a baseline model using only image features was set up and compared with a multimodal model fused with text guidance. The image-only baseline achieved an accuracy of 84.90% ± 0.13%, while the text-guided fusion model reached 87.47% ± 0.02%, an improvement of 2.57%. This indicates that describing text using BERT-encoded radial features can effectively supplement the lack of visual information in small sample sizes and enhance the ability to discriminate class prototypes.

[0094] 3. Supports the influence of set size The experiment tested the impact of three support set sizes—5-shot, 10-shot, and 15-shot—on classification accuracy in a 2-way configuration. As the number of support samples increased, the accuracy gradually improved: 78.05% ± 1.36% for 5-shot, 85.25% ± 0.69% for 10-shot, and 87.47% ± 0.02% for 15-shot, while the standard deviation gradually decreased. This indicates that more support samples help obtain a more stable and accurate class prototype; therefore, 15-shot was selected as the final configuration.

[0095] 4. Enhanced performance during testing To improve the robustness of the model during the inference phase, the effects of different time-of-test (TTA) strategies were compared, including rotation (±5°), horizontal and vertical flipping, and multi-scale transformation. The baseline accuracy (without TTA) was 87.47%±0.02%. Horizontal and vertical flipping improved the accuracy to 88.96%±0.14%, making them the optimal strategies; rotation-based enhancement slightly decreased to 87.28%±0.08%, and multi-scale enhancement decreased to 85.68%±0.11%. This indicates that horizontal and vertical flipping effectively enhances generalization ability, while rotation and multi-scale transformations may introduce transformations inconsistent with lesion characteristics, leading to performance degradation.

[0096] 5. Confusion Matrix Analysis The model performance was evaluated on a balanced query set (20 samples per class, 40 in total), achieving an overall accuracy of 90.0%. The recall for dentigerous cysts (DC) was 95.0%, and the specificity was 85.0%; the recall for periapical cysts (PC) was 85.0%, and the specificity was 95.0%. This indicates that the model is more sensitive to DC but more accurate in identifying PCs. The area under the ROC curve (AUC) was 0.887, demonstrating that the model has good overall discriminative ability. The results are as follows... Figure 2 As shown.

[0097] 6. Interpretable AI results The Grad-CAM++ algorithm is used to generate classification decision heatmaps, which are then overlaid on the original CBCT images, such as... Figure 3 , 4 As shown, the highlighted areas on the heatmaps of DC and PC cases are highly consistent with the clinically labeled real lesion areas, demonstrating that the model can effectively identify typical lesion areas in DC and PC. Even under low signal contrast or complex background interference, the heatmaps can still accurately focus on the lesions, validating the reliability and interpretability of the model's decisions.

[0098] 7. Cluster Analysis Results The high-dimensional features are reduced to 2 dimensions using the t-SNE algorithm, and scatter plots of DC and PC samples are drawn, such as... Figure 5 As shown, the DC and PC cases exhibit a clear separation trend: DCs mainly cluster in the upper left region, while PCs are concentrated in the lower right region. In the balanced query set, 36 / 40 samples were correctly classified (90% accuracy), while in the imbalanced query set, 34 / 40 samples were correctly classified (85%). The visualization results further confirm that the multimodal class prototype can effectively distinguish between the two types of cysts, and the feature space separation is consistent with the classification accuracy.

[0099] This invention effectively addresses the problem of insufficient training data in pediatrics due to the scarcity of cases by constructing a multimodal few-shot learning framework that integrates image and text information. It achieves high-precision classification of odontogenic cysts in children with only a small number of samples (e.g., an accuracy of 88.96% in a 2-way 15-shot setting). More importantly, by integrating an interpretability module, this invention can generate visual saliency heatmaps, intuitively displaying the key lesion regions that the model focuses on when making diagnostic decisions. This solves the "black box" problem of AI models in medical applications, greatly enhancing clinicians' trust and acceptance of AI diagnostic results. Furthermore, this invention eliminates the need for doctors to manually delineate regions of interest, simplifying clinical workflows and demonstrating strong practical value and promising prospects for wider application.

[0100] This invention also provides an electronic device, which may be a server, a personal computer (PC), a tablet computer, or a smartphone. The electronic device includes a processor and a memory. The memory stores a computer program. When the processor executes the computer program stored in the memory, it implements the method described in any of the above embodiments for distinguishing between dentigerous cysts and periapical cysts in children. For example, by executing the program, the processor controls the device to complete the entire process from acquiring CBCT images, performing multimodal feature extraction and fusion, performing small sample classification, to finally outputting a diagnostic result with a significant heatmap.

[0101] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A system for differentiating between a child's dentigerous cyst and periapical cyst, comprising: The system comprises a data acquisition module configured to acquire a CBCT image of a child patient; an image preprocessing module configured to standardize the CBCT image and construct a multi-channel input image containing a three-dimensional spatial context; a text encoding module configured to acquire radiological feature description texts about odontogenic cysts and periapical cysts and encode the texts into feature vectors; a visual encoding module configured to extract visual features of the preprocessed CBCT image; a multi-modal fusion module configured to fuse the visual features and the text feature vectors to generate two multi-modal class prototypes corresponding to odontogenic cysts and periapical cysts, respectively; a small sample learning classification module configured to calculate distances between visual features of a sample to be classified and the two multi-modal class prototypes based on Euclidean distance and classify the sample to be classified into a class corresponding to the prototype with the closest distance by using a prototype network; an output module configured to output the obtained classification result.

2. The system of claim 1, wherein, The system further comprises an enhanced inference module configured to enhance the sample to be classified when testing in the inference stage, including performing multiple geometric transformations on each sample to be classified and determining the classification by synthesizing the prediction results after multiple transformations.

3. The system of claim 1, wherein, The system further comprises an explainability module configured to generate visual explanations of the classification results, including: a saliency calculation unit configured to determine the contribution of each region in the input image to the classification result based on the output gradient information; a visual presentation unit configured to generate a saliency heat map according to the contribution and display the heat map and the CBCT image.

4. The system of claim 1, wherein, The system further comprises a clustering analysis module configured to perform visual dimension reduction on a high-dimensional feature space constituted by features of the sample to be classified in the small sample learning classification module.

5. The system of claim 1, wherein, The image preprocessing module comprises: a resolution normalization unit configured to normalize the CBCT image to 0.25 mm / pixel; an image cropping unit configured to crop the image to 224x224 pixels; a channel stacking unit configured to generate a three-channel image, wherein, taking a slice passing through the center area of the lesion as a reference, the reference slice and its two adjacent continuous slices are acquired, and the three slices are stacked into a three-channel image.

6. The system of claim 1, wherein, The text encoding module comprises: a text collection unit configured to collect radiological feature description texts about odontogenic cysts and periapical cysts; a text prompt generation unit configured to generate multiple class-level text prompts for each class, wherein each text prompt is used to describe the imaging features of the corresponding lesion; preferably, the imaging features include the position, shape or density pattern of the corresponding lesion; a text encoder configured to encode the text prompts into 256-dimensional feature vectors.

7. The system of claim 1, wherein, The visual encoding module comprises: a visual encoder; a feature projection layer comprising a fully connected layer configured to receive 2048-dimensional features extracted by the visual encoder and convert them into 256-dimensional features.

8. The system of claim 1, wherein, The multi-modal fusion module comprises: a feature concatenation unit configured to concatenate an image feature vector and a text feature vector to generate a joint feature vector, wherein a dimension of the joint feature vector is equal to a sum of a dimension of the image feature vector and a dimension of the text feature vector; preferably, the image feature vector is 256-dimensional, the text feature vector is 256-dimensional, and the joint feature vector is 512-dimensional; a fusion network comprising two fully connected layers configured to map a 512-dimensional input feature to a 128-dimensional output, the 128-dimensional output serving as a multi-modal class prototype; and / or a few-shot learning classification module configured to perform few-shot learning classification using a prototype network, comprising: a meta-learning unit configured to construct a plurality of meta-tasks, each meta-task comprising a support set and a query set; the support set comprising a plurality of classes, each class comprising a first number of samples; the query set comprising the same plurality of classes as the support set, each class comprising a second number of samples; a distance metric unit configured to calculate a Euclidean distance between a query sample feature and each class prototype, wherein the each class prototype is determined based on sample features of a corresponding class in the support set; a classification decision unit configured to classify a sample in the query set into a class prototype having a smallest distance to the sample.

9. A method of differentiating between a child's dentigerous cyst and periapical cyst, comprising, comprising the following steps: S1: data acquisition, acquiring a CBCT image of a child patient; S2: image preprocessing, performing standardization processing on the CBCT image, and constructing a multi-channel input image comprising a three-dimensional spatial context; S3: text encoding, acquiring a text describing radiological features of odontogenic cysts and periapical cysts, and encoding the text into a feature vector; S4: visual feature extraction, extracting visual features of the preprocessed CBCT image; S5: multi-modal fusion, fusing the visual features and the text feature vector to generate two multi-modal class prototypes corresponding to odontogenic cysts and periapical cysts, respectively; S6: few-shot learning classification, using a prototype network to calculate a distance between a visual feature of a sample to be classified and the two multi-modal class prototypes based on a Euclidean distance, and classifying the sample to be classified into a class corresponding to a prototype having a smallest distance; S7: output, outputting the obtained classification result.

10. An electronic device, comprising: comprising: a processor and a memory; the memory is configured to store a computer program; the processor is configured to execute the computer program stored in the memory, so that the electronic device performs the method for distinguishing odontogenic cysts and periapical cysts in children according to claim 9.