A pediatric pneumonia pathogen classification method based on causal representation learning

By employing a causal representation learning-based classification method for pediatric pneumonia pathogens, and utilizing a spatial transformation and adversarial realignment visual Transformer framework, we have addressed the issues of scarce labeled data, cross-modal domain shift, and insufficient spatial robustness of the visual Transformer, achieving accurate, robust, and generalizable classification of pediatric pneumonia pathogens.

CN122244571APending Publication Date: 2026-06-19OCEAN UNIV OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
OCEAN UNIV OF CHINA
Filing Date
2026-05-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing methods for classifying pediatric pneumonia pathogens suffer from several drawbacks, including scarce labeled data, cross-modal domain shift, insufficient robustness of the visual Transformer space, mismatch between self-supervised pre-training and downstream tasks, and issues related to disease keywords and technical application phrases. These issues are: (1) Technical field: The technical scope of the patent application, typically corresponding to a specific industry, discipline, or technical classification; (2) Technical keywords: The specific technical entities or systems involved in the patent, usually physical entities or independent subsystems; (3) Technical application: The specific application field or product of the new technology in the patent; (4) Technical application phrases: Specific and meaningful words extracted from the technical application, accurately describing the application field or product of the invention.

Method used

A pediatric pneumonia pathogen classification method based on causal representation learning is adopted. By constructing a spatial transformation and adversarial realignment visual Transformer framework, the method realizes the preprocessing of multimodal image data, cross-modal shared encoding, decoding and adversarial realignment, and performs three-stage causal representation learning to output classification results.

Benefits of technology

It improves the accuracy and robustness of pediatric pneumonia pathogen classification, reduces the dependence on high-quality labeled data, enhances cross-modal generalization ability and model stability, improves the decoupling ability of disease causal features and confounding factors, and achieves accurate classification under small sample conditions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244571A_ABST
    Figure CN122244571A_ABST
Patent Text Reader

Abstract

This invention relates to the field of pneumonia pathogen classification technology, and in particular to a method for classifying pediatric pneumonia pathogens based on causal representation learning. The method includes: data preprocessing based on acquired multimodal image data; spatial transformation and cross-modal shared encoding based on the preprocessed data; decoding and adversarial realignment based on the encoded data, including multimodal reconstruction decoding and target domain decoding, cross-modal loss reconstruction, and task coordination modulation based on TCM; and three-stage causal representation learning based on the aligned features to output the classification result. This invention explicitly separates disease-related causal features from confounding variables such as imaging equipment, scanning protocol, and patient position through a three-stage causal representation learning module, reducing the risk of the model learning spurious correlation features.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of pneumonia pathogen classification technology, and in particular to a method for classifying pediatric pneumonia pathogens based on causal representation learning. Background Technology

[0002] Currently, the gold standard for diagnosing pediatric pneumonia pathogens in clinical practice mainly relies on sputum culture, polymerase chain reaction (PCR), and serological testing. These methods typically suffer from long testing cycles, poor child compliance, high susceptibility to sample quality affecting positive rates, and high resource consumption, making them insufficient to meet the clinical needs for rapid classification and individualized treatment decisions in pediatric pneumonia. Chest CT and chest X-rays are commonly used first-line imaging methods for pediatric pneumonia, revealing lung pathological changes such as ground-glass opacities, consolidation, increased lung texture, and air bronchograms, providing objective imaging evidence for pathogen classification. However, image interpretation is highly dependent on the radiologist's experience; subtle differences in imaging between different pathogens can easily lead to subjective bias, missed diagnoses, misdiagnoses, and insufficient diagnostic efficiency.

[0003] With the development of artificial intelligence technology, convolutional neural networks, visual Transformers, and self-supervised learning methods have been applied to tasks such as pneumonia image classification, lesion identification, and severity assessment. Although existing methods have improved diagnostic efficiency to some extent, they still have many shortcomings in the scenario of accurate pathogen typing for pediatric pneumonia.

[0004] First, high-quality labeled data is scarce. Labeling of pediatric pneumonia pathogens typically relies on invasive examinations or complex laboratory tests, resulting in high labeling costs and a limited number of labeled samples available for model training. Purely supervised models are prone to overfitting and struggle to learn stable pathogen-specific features under small sample conditions. Second, the spatial robustness of the visual Transformer is insufficient. The visual Transformer relies on large-scale training data, exhibiting weak spatial inductive bias and sensitivity to changes in lesion location, morphology, and scale. However, pediatric pneumonia lesions are often diffusely distributed, located in variable locations, and exhibit significant morphological variations, leading to insufficient model stability in real-world clinical data. Third, there is a significant cross-modal domain shift between CT and X-ray imaging. CT and X-ray differ considerably in imaging principles, spatial resolution, anatomical structure presentation, and artifact types. Without effective cross-modal feature alignment, models are prone to learning modality-specific artifacts rather than pathology-related features, resulting in insufficient cross-modal, cross-center, and cross-dataset generalization capabilities. Fourth, there is a mismatch between the self-supervised pre-training objectives and downstream pathogen typing tasks. Conventional self-supervised reconstruction or contrastive learning typically focuses more on global semantic or anatomical reconstruction. However, pediatric pneumonia pathogen typing requires the identification of fine-grained texture, local distribution, lesion morphology, and multimodal consistency features. Direct transfer of general pre-trained representations may lack discriminative power. Fifth, disease-related causal features are difficult to decouple from confounding variables. Factors such as different equipment, scanning protocols, patient positions, age differences, and image quality can create confounding variables. Existing models are prone to using spurious correlation features for classification, reducing the reliability of clinical application.

[0005] Therefore, there is an urgent need to propose a classification method and system for pediatric pneumonia pathogens that can make full use of unlabeled multimodal image data, improve the spatial robustness of visual Transformer, achieve cross-modal alignment between CT and X-ray, and separate disease characteristics and confounding factors through causal representation learning, so as to achieve accurate, robust, and generalizable intelligent typing of pediatric pneumonia pathogens under small sample conditions. Summary of the Invention

[0006] To address the problems in existing pediatric pneumonia pathogen classification methods, such as scarce labeled data, cross-modal domain shift, insufficient spatial robustness of visual Transformers, mismatch between self-supervised pre-training and downstream fine-grained classification tasks, and difficulty in decoupling disease causal features from imaging confounding factors, this invention provides a pediatric pneumonia pathogen classification method based on causal representation learning. By constructing a classification framework centered on spatial transformation and adversarial realignment of a visual Transformer, accurate assisted classification of pediatric pneumonia pathogens is achieved.

[0007] Firstly, the present invention provides a method for classifying pediatric pneumonia pathogens based on causal representation learning, which adopts the following technical solution:

[0008] A pediatric pneumonia pathogen classification method based on causal representation learning includes:

[0009] Acquire multimodal image data;

[0010] Data preprocessing is performed based on the acquired multimodal image data;

[0011] Spatial transformation and cross-modal shared coding are performed based on the preprocessed data;

[0012] Decoding and adversarial realignment based on encoded data, including multimodal reconstruction decoding and target domain decoding, cross-modal loss reconstruction, and TCM-based task coordination modulation;

[0013] Based on aligned features, a three-stage causal representation learning is performed, and classification results are output.

[0014] Secondly, a pediatric pneumonia pathogen classification system based on causal representation learning includes:

[0015] The data acquisition module is configured to acquire multimodal image data;

[0016] The preprocessing module is configured to perform data preprocessing based on the acquired multimodal image data;

[0017] The encoding module is configured to perform spatial transformation and cross-modal shared encoding based on the preprocessed data;

[0018] The alignment module is configured to perform decoding and adversarial realignment based on the encoded data, including multimodal reconstruction decoding and target domain decoding, cross-modal loss reconstruction, and TCM-based task coordination modulation.

[0019] The classification module is configured to perform three-stage causal representation learning based on aligned features and output classification results.

[0020] Thirdly, the present invention provides a computer-readable storage medium storing a plurality of instructions adapted for loading and execution by a processor of a terminal device of the aforementioned method for classifying pediatric pneumonia pathogens based on causal representation learning.

[0021] Fourthly, the present invention provides a terminal device, including a processor and a computer-readable storage medium, wherein the processor is used to implement various instructions; the computer-readable storage medium is used to store multiple instructions, the instructions being adapted to be loaded and executed by the processor to provide a pediatric pneumonia pathogen classification method based on causal representation learning.

[0022] In summary, the present invention has the following beneficial technical effects:

[0023] Compared with the existing classification methods for pediatric pneumonia pathogens, which suffer from technical bottlenecks such as strong dependence on high-quality labeled data, poor cross-modal generalization, insufficient spatial robustness, unstable training process, and insufficient clinical interpretability, this invention has the following beneficial effects.

[0024] First, this invention, through spatial transformation consistency constraints and multimodal self-supervised pre-training, enables the model to learn general and robust chest imaging features under unlabeled or minimally labeled conditions, significantly reducing reliance on invasive pathogen detection labels and large-scale manually labeled data. Ablation experiments show that, compared to the untrained ViT baseline, the full STAR-ViT model's average accuracy increased from 82.90% to 92.07%, and its average AUC increased from 79.57% to 94.95%, effectively alleviating the problem of overfitting with small sample sizes.

[0025] Secondly, this invention achieves implicit alignment of CT and X-ray feature distributions through an adversarial cross-modal alignment module, solving the domain shift problem caused by differences in imaging principles, spatial resolution, and artifacts between different imaging modalities. The model can share pathological representations between the two mainstream clinical imaging modalities of CT and X-ray, and can complete cross-modal inference without retraining for a single modality, improving its adaptability in primary hospitals and multi-center application scenarios.

[0026] Furthermore, this invention dynamically adjusts the task loss weights for spatial consistency, multimodal reconstruction, and adversarial alignment through the TCM task coordination modulation module, thereby mitigating gradient conflicts in multi-task joint optimization and improving model training stability and convergence efficiency. Experimental results show that the average standard deviation of the accuracy of the complete model's 5-fold cross-validation is only ±0.81%, significantly lower than that of the ordinary ViT model, indicating that this scheme has stronger reproducibility and stability.

[0027] Finally, this invention explicitly separates disease-related causal features from confounding variables such as imaging equipment, scanning protocols, and patient positioning through a three-stage causal representation learning module, reducing the risk of the model learning spurious correlation features. Combined with the KAN classification head and focus loss, this invention can improve the classification performance of small sample classes under imbalanced conditions, providing reliable intelligent auxiliary support for pathogen typing and diagnosis of pediatric pneumonia, rational use of antibiotics, and prognostic assessment.

[0028] In summary, the STAR-ViT model proposed in this invention achieved an average classification accuracy of 92.07%, an average AUC of 94.95%, an average recall of 86.10%, and an average balanced accuracy of 89.99% on a self-built pediatric pneumonia pathogen CT dataset. In external validation on the publicly available ChestXRay2017 pediatric pneumonia dataset, it achieved an average classification accuracy of 97.86%, an average recall of 96.87%, an average specificity of 98.53%, and an average F1 score of 97.23%, demonstrating that this invention has significant advantages in classification accuracy, cross-modal generalization ability, and clinical reliability. Attached Figure Description

[0029] Figure 1 This is a schematic diagram of a pediatric pneumonia pathogen classification method based on causal representation learning, according to Embodiment 1 of the present invention;

[0030] Figure 2 This is a detailed structural diagram of the ST-CMFAE shared encoder of Embodiment 1 of the present invention;

[0031] Figure 3 This is a diagram of the causal representation learning framework based on continuous learning in Embodiment 1 of the present invention;

[0032] Figure 4 This is a comparison chart of STAR-ViT from Embodiment 1 of the present invention with other models in the pediatric pneumonia pathogen classification task;

[0033] Figure 5 This is a comparison diagram of the ablation experiment results of the core module of Embodiment 1 of the present invention;

[0034] Figure 6 This is a confusion matrix diagram of different models in the classification task of Embodiment 1 of the present invention;

[0035] Figure 7 These are verification loss curves for different models in Embodiment 1 of the present invention. Detailed Implementation

[0036] The present invention will be further described in detail below with reference to the accompanying drawings.

[0037] Terminology Explanation: STAR-ViT: Spatial Transformation and Adversarial Realignment Visual Transformer, the core network architecture for pediatric pneumonia pathogen classification in this scheme. ST-CMFAE: Spatial Transformation and Cross-Modal Feature Alignment Shared Encoder, used to be compatible with 3D CT and 2D X-ray inputs and extract spatially robust, modality-invariant image features. DSM: Dimension Selection Module, used to uniformly map 3D CT image blocks and 2D X-ray image blocks to token sequences of the same dimension. R-decoder: Multimodal Reconstruction Decoder, used to reconstruct masked multimodal images. TD-decoder: Target Domain Decoder, used to preserve the pathological-specific features of the CT target domain. TCM: Task Coordination Modulation Module, used to dynamically adjust the multi-task loss weights based on task gradient similarity. KAN: Kolmogorov-Arnold Network Classification Head, used to replace the traditional fully connected classification layer and output pathogen subtype probabilities. WGAN-GP: Wasserstein Generative Adversarial Network Loss with Gradient Penalty, used for cross-modal feature alignment between CT and X-ray.

[0038] Example 1

[0039] Reference Figure 1 This embodiment presents a pediatric pneumonia pathogen classification method based on causal representation learning. Its overall technical framework includes six highly coupled core modules: (1) multimodal image data acquisition and standardization module; (2) spatial transformation and cross-modal shared coding module; (3) multimodal reconstruction, target domain preservation and adversarial realignment module; (4) task coordination modulation module; (5) three-stage causal representation learning and KAN classification module; and (6) system deployment module. The specific scheme is as follows.

[0040] (1) Multimodal image data acquisition and standardization module

[0041] This module forms the data foundation of the entire system. Its core task is to acquire multimodal pediatric chest imaging data and convert 3D CT and 2D X-ray images from different sources, devices, and resolutions into a unified, trainable, and aligned input representation. Let the training sample set be:

[0042] ,

[0043] in, Indicates the first Three-dimensional chest CT images of the patient. This represents a two-dimensional chest X-ray image that corresponds to or is in the same domain. This indicates the pathogen classification label, which can include normal, mycoplasma pneumonia, bacterial pneumonia, and viral pneumonia.

[0044] 1) Training Data Acquisition. Training data includes a self-built pediatric pneumonia pathogen CT dataset, patient-matched chest X-ray images, publicly available unlabeled chest CT datasets, publicly available unlabeled chest X-ray datasets, and external validation datasets. The self-built pediatric pneumonia pathogen CT dataset includes 3D chest CT images of children aged 1-5 years with pneumonia, covering three pathogen subtypes: bacterial pneumonia, mycoplasma pneumonia, and viral pneumonia. It can be matched with 2D chest X-ray images of the same patients to form CT-X-ray multimodal paired data. The images were double-blindly labeled by pediatric radiologists, with annotations including lesion regions, pathogen subtype labels, and normal control labels. Unlabeled data includes multiple sets of publicly available chest CT and chest X-ray datasets, used for the first stage of multimodal self-supervised pre-training, enabling the model to learn general chest image representations without relying on pathogen labels. External validation data uses the publicly available ChestXRay2017 pediatric pneumonia dataset to verify cross-modal and cross-dataset generalization capabilities.

[0045] 2) Image size and spatial resolution standardization. For 3D chest CT images, data from different acquisition centers are resampled to isotropic resolution. Voxel resolution, and normalize the volume data size to For 2D chest X-ray images, the images are uniformly scaled down to [size value missing]. This step can eliminate interference caused by differences in acquisition equipment, scanning protocols, and spatial spacing during model training.

[0046] 3) Gray-level normalization and noise reduction. For chest CT images, dual-window normalization was used, with the lung window having a width of 1500 HU and a window level of -600 HU, and the mediastinal window having a width of 350 HU and a window level of 40 HU, to highlight the texture and anatomical structure of lung lesions. For chest X-ray images, histogram equalization was used to enhance the contrast of lesions.

[0047] 4) Multimodal spatial registration. To address the issue of positional and projection differences between CT and X-ray images, a rigid registration method based on mutual information is adopted to align the lung field regions of the X-ray images with the corresponding maximum density projection images of the CT images, ensuring that anatomical structures such as lesion centers and lobar boundaries remain relatively consistent in multimodal images.

[0048] 5) Spatial transformation augmentation. This applies to the preprocessed original image. Perform random rotation, translation, scaling, flipping, and affine transformations to generate spatially perturbed paired images. Used for subsequent spatial consistency constraints:

[0049] ,

[0050] in, Indicates random rotation. Indicates random translation. This indicates random scaling. Spatial transformation samples can be generated online during training, including random flipping in the sagittal, coronal, and axial planes, as well as random rotation around the X, Y, and Z axes within a set angular range, to improve the model's robustness to changes in lesion location.

[0051] (2) Spatial transformation and cross-modal shared coding module

[0052] This module achieves unified encoding adaptation for 2D / 3D multimodal inputs through the collaborative design of three core units: dimension selection, multi-scale Transformer encoding, and spatial transformation consistency constraints. At the same time, it forces the model to learn the pathological semantic features of lesions rather than non-related information such as location and scale, providing robust, universal, and modally invariant basic feature support for subsequent cross-modal alignment and causal feature decoupling.

[0053] 1) Dimension Selection Module (DSM)

[0054] The dimension selection module is the core adaptation unit of ST-CMFAE to achieve multimodal input compatibility. Its core goal is to map 3D CT and 2D X-ray images, which have completely different dimensions, resolutions, and imaging principles, into token sequences with completely consistent dimensions. This ensures that the two types of modal data can be input into the same Transformer network for joint feature learning, thus solving the problem of multimodal input dimension mismatch at its root.

[0055] For 3D chest CT images, the module first normalizes the preprocessed CT data to a voxel size of 224×224×128, and then uniformly crops it into 14×14×8 non-overlapping 3D image blocks. The voxel size of each image block is fixed at 16×16×16 to ensure that the scale of the image blocks matches the scale of small lesions such as micronodules and ground-glass opacities in children's lungs, avoiding the loss of fine-grained pathological features during the segmentation process. Subsequently, feature encoding is completed through three consecutive layers of 3D convolution operations. The convolution kernel size is 3×3×3, the stride is 2, the padding is 1, and the activation function adopts GELU non-linear activation. Finally, a 512-dimensional feature vector is generated for each 3D image block, and the entire CT image is finally output as a 1568×512-dimensional token sequence.

[0056] For 2D chest X-ray images, the module normalizes the X-ray image to a size of 224×224 pixels and divides it into 1568 16×16 two-dimensional image blocks with an overlap rate of 20%. This overlapping block design avoids the problem of broken lesion edge features caused by traditional non-overlapping block design, and is particularly suitable for feature extraction of marginal lesions such as thickened texture and patchy shadows along the bronchi in pediatric pneumonia. Subsequently, encoding is completed through three consecutive layers of 2D convolution operations. The convolution kernel size is 3×3, the stride is 2, and the padding is 1. The activation function is also GELU. Finally, a 512-dimensional feature vector is generated for each two-dimensional image block, and the entire X-ray image also outputs a 1568×512-dimensional token sequence.

[0057] After processing by the DSM module, regardless of whether the input is a 3D CT or a 2D X-ray image, it is converted into a token sequence with the same length and dimension. This achieves dimensionality unification of multimodal data in the feature space, laying the foundation for global context feature extraction and cross-modal joint learning in the subsequent Transformer layer.

[0058] 2) Multi-scale Transformer coding layer

[0059] The shared encoder adopts an 8-layer stacked Transformer coding structure. Each coding layer consists of layer normalization, multi-head self-attention mechanism, feedforward network (FFN), residual connection and secondary layer normalization in sequence. The core task is to simultaneously capture the local fine-grained texture features and global anatomical context dependence of pediatric lung images, and solve the core defects of traditional CNN models, such as limited receptive field and difficulty in capturing long-range associations of diffuse lesions.

[0060] The multi-head self-attention mechanism of the encoding layer is set to 8-head parallel attention with a head dimension of 64. It can simultaneously capture the correlation features between lesions and normal lung tissue, lesions and lung lobe anatomy, and multimodal images from different feature subspaces. Compared with single-head attention, it is more suitable for the clinical characteristics of pediatric pneumonia, which is characterized by multiple lesions, multiple scales, and diffuse distribution. During the attention calculation process, the module also incorporates prior constraints of lung anatomy, adaptively enhancing the feature attention weights of lung field regions and suppressing the weights of extra-lung background regions to avoid interference from irrelevant background features on the learning of pathological features.

[0061] The feedforward network employs a two-layer structure: a linear layer followed by GELU activation, then a dropout layer, and finally another linear layer. The hidden layer dimension is 2048, and the dropout rate is set to 0.1. This allows for non-linear transformation of the features output by the attention layer, further enhancing pathological features related to pneumonia pathogen typing and filtering out irrelevant interference features such as imaging artifacts and individual anatomical differences in patients. Each encoding layer is equipped with two residual connections, corresponding to the multi-head self-attention module and the feedforward network module, respectively. This effectively alleviates the gradient vanishing problem during the training of deep Transformer networks, improving the stability and convergence efficiency of model training.

[0062] After feature extraction through 8 Transformer coding layers, the module finally outputs a high-dimensional shared feature of 1568×512 dimensions. This feature integrates the local lesion texture of the image, the global anatomical context, and shared pathological information between multiple modalities. It can be simultaneously input into the subsequent reconstruction decoder, domain discriminator, and causal representation learning module to achieve joint optimization of multiple tasks.

[0063] 3) Spatial transformation consistency constraint branch

[0064] This branch is a core innovative unit that addresses the lack of spatial robustness in Transformers. Its core objective is to force the encoder to learn pathological features that are insensitive to spatial perturbations and have translation / rotation / scaling invariance, rather than relying on the location, shape, and scale information of lesions for feature encoding. This perfectly adapts to the clinical scenario where pediatric pneumonia lesions have variable locations and large individual differences in morphology.

[0065] During training, the module processes the input raw images. Perform online spatial transformation augmentation to generate spatially perturbed paired images. The spatial transformation operations include: random horizontal / vertical flipping in the sagittal, coronal, and axial planes; random rotation within ±30° around the X, Y, and Z axes; random translation within ±10%; random scaling within 0.8-1.2 times; and random affine transformation. The original image and the spatially transformed paired image are simultaneously input into a shared encoder for feature extraction, yielding features from the original image and the spatially transformed image, respectively.

[0066] The module uses cosine similarity loss to ensure that the features of the original image and the features of the spatially transformed image remain highly consistent in the feature space. The loss function is defined as follows:

[0067] ,

[0068] in Encode features for the original image. The image encoding features are spatially transformed, and N is the number of samples in the batch. The core optimization goal of this loss function is to ensure that the model can stably extract the same pathological semantic features regardless of changes in the location, shape, and scale of lesions in the lungs. This fundamentally solves the problem of traditional ViT being sensitive to spatial perturbations and significantly improves the model's generalization ability across different clinical data sources.

[0069] (3) Multimodal reconstruction, target domain preservation and adversarial realignment module

[0070] This module employs a collaborative design of three sub-units: a multimodal reconstruction decoder, a target domain decoder, and a domain discriminator. It drives the model to learn shared pathological structural features across multiple modalities through a self-supervised mask reconstruction task, preserves the pathological specificity of CT images through target domain decoding, and achieves implicit alignment of CT and X-ray features based on WGAN-GP adversarial game. Ultimately, the encoder learns high-quality features that possess both cross-modal generalization and target domain discriminability, making full use of massive amounts of unlabeled multimodal image data to alleviate the core challenge of scarce labeled data for pediatric pneumonia.

[0071] 1) Multimodal reconstruction decoder (R-decoder)

[0072] The multimodal reconstruction decoder is the core unit driving self-supervised feature learning. Its core task is to force the encoder to learn the essential correlation features between lung anatomy and lesion texture through masked image reconstruction, rather than relying on pixel-level surface information, thus achieving general feature learning without labeled data.

[0073] The decoder employs a transposed convolution + residual block structure that is completely symmetrical to the ST-CMFAE shared encoder. It consists of four consecutive transposed convolutional layers, each with residual connections and batch normalization operations. The activation function is GELU, and the parameters of each transposed convolutional layer perfectly match those of the corresponding convolutional layer in the encoder, ensuring that the compressed feature information during encoding can be fully recovered during decoding. The decoder's input is the multimodal general features output from the ST-CMFAE shared encoder, and the final output is a reconstructed image with the exact same size as the input image. The number of output channels precisely matches the number of input modalities, enabling simultaneous reconstruction of 3D CT and 2D X-ray images.

[0074] During training, the module employs a random block masking strategy to preprocess the input image. The mask block size is fixed at 16×16, consistent with the image block size of the DSM module, and the mask ratio is set to 40%, randomly occluding a portion of the input image. Only the unoccluded image regions are input to the encoder for feature extraction, and then the decoder reconstructs the complete original image, including the masked regions, from the compressed features. This masking reconstruction strategy forces the model to learn the spatial relationships and semantic dependencies between different anatomical regions of the lung, lesions, and normal tissues, rather than simple pixel mapping. This allows the features learned by the encoder to better meet the fine-grained feature requirements of downstream pathogen typing tasks.

[0075] 2) Target Domain Decoder (TD-decoder)

[0076] The target domain decoder is the core unit for solving the feature degradation caused by cross-modal over-alignment. The core objective is to fully preserve the pathological-specific features of the CT target domain during cross-modal feature alignment, and to avoid the loss of fine-grained CT image features that are highly correlated with the classification of pediatric pneumonia pathogens in order to achieve modal alignment.

[0077] The TD-decoder shares the same 3D branch structure as the R-decoder, employing a symmetrical structure of 4 layers of transposed convolutions plus residual blocks. However, it only receives encoder-specific feature inputs corresponding to CT images and outputs only the reconstructed 3D CT image, without participating in the reconstruction of X-ray images. This decoder is trained in parallel with the multimodal reconstruction decoder, with the optimization objective being to minimize the pixel error between the reconstructed image and the original CT image. However, its core function is to provide the encoder with an independent CT feature optimization objective, forcing the encoder to retain the unique, fine-grained pathological features in CT images that are related to pathogen classification, such as the internal density of ground-glass opacities, the boundary features of consolidation, and the fine structure of air bronchograms. These features cannot be clearly presented in 2D X-ray images but are the core radiological basis for distinguishing bacterial, viral, and mycoplasma pneumonia.

[0078] By using the parallel constraints of the TD-decoder, the model can achieve cross-modal alignment between CT and X-ray while avoiding the degradation of key discriminative features in the target domain CT. This perfectly balances the contradiction between cross-modal generalization and target domain discriminability, which is one of the core innovations that distinguishes this approach from existing general cross-modal alignment methods.

[0079] 3) Reconstruction Loss. Both multimodal reconstruction and target domain reconstruction use mean squared error loss:

[0080] ,

[0081] in, The original input image, The reconstructed image output by the decoder. This refers to the number of samples or pixels.

[0082] 4) Domain Discriminator. The domain discriminator employs a three-layer hidden-layer multilayer perceptron structure, with 1024, 2048, and 512 neurons in the hidden layers, respectively. The domain discriminator receives CT features output from the shared encoder. and X-ray characteristics The goal of the domain discriminator is to predict the modality to which a feature belongs. The optimization objective of the domain discriminator is to distinguish CT and X-ray features as accurately as possible; the optimization objective of the shared encoder is to generate modality-invariant features that make it difficult for the domain discriminator to distinguish their source.

[0083] 5) Cross-modal adversarial loss based on WGAN-GP. To address the mode collapse and gradient vanishing problems that easily occur in traditional GAN ​​adversarial training, this scheme uses Wasserstein generative adversarial network loss with gradient penalty (WGAN-GP) to construct the adversarial optimization objective. Compared with the traditional binary cross-entropy loss, it has stronger training stability in small sample scenarios of medical images, is less prone to gradient vanishing, and can achieve more stable cross-modal feature alignment. The discriminator loss can be expressed as:

[0084] ,

[0085] The encoder adversarial loss can be expressed as:

[0086] ,

[0087] in, For domain discriminator, The gradient penalty weights are used to define the bimodal interpolation features:

[0088] ,

[0089] The overall optimization goal is:

[0090] ,

[0091] Through the aforementioned minimax game, the model can eliminate the modal domain gap caused by the difference between CT and X-ray imaging principles, and extract more stable modally invariant pathological features.

[0092] (4) Task Coordination and Modulation Module (TCM)

[0093] In the first stage of self-supervised pre-training, tasks such as spatial consistency constraints, multimodal reconstruction, and cross-modal adversarial alignment are optimized simultaneously, which can easily lead to gradient direction conflicts. To address this, this invention constructs a Task Coordination Modulation (TCM) module, which dynamically updates the loss weights of each task based on the cosine similarity between the gradients of multiple tasks. The first stage uses unlabeled CT and X-ray image data, requiring no pathogen labels, and jointly trains the ST-CMFAE shared encoder, multimodal reconstruction decoder, target domain decoder, and domain discriminator. Training objectives include spatial consistency constraints, multimodal reconstruction, and cross-modal adversarial alignment. The training batch size can be set to 96, the optimizer is Adam, and the initial learning rate uses a cosine annealing decay strategy. During training, an alternating training method can be used, training the domain discriminator for every two steps and then the encoder for one step to avoid modal collapse. After training, the optimal pre-trained model is saved, and the shared encoder weights are frozen.

[0094] 1) Task gradient acquisition. Assume there are a total of... Each task has the following losses: The corresponding gradient is:

[0095] ,

[0096] 2) Gradient similarity calculation. The cosine similarity between the gradients of any two tasks is:

[0097] ,

[0098] when When, explain the task With the task There is a conflict in the optimization direction.

[0099] 3) Dynamic weight adjustment. TCM dynamically updates task weights based on gradient similarity:

[0100] ,

[0101] in, For the Sigmoid function, For temperature parameters, For the task The original weights, This is the updated task weight. This mechanism enhances the gradient contribution of tasks with consistent optimization directions and suppresses the negative impact of conflicting tasks.

[0102] 4) Total loss of the first stage of pre-training. After TCM dynamic modulation, the total loss of the first stage of pre-training can be expressed as:

[0103] ,

[0104] in, , , These are the weights of spatial consistency loss, reconstruction loss, and encoder adversarial loss after TCM modulation.

[0105] (5) Three-stage causal representation learning and KAN classification module

[0106] This module is used to further decouple disease-related causal features from imaging confounding variables based on pre-trained features, avoiding the model learning of spurious correlation features such as equipment, body position, and scanning protocol. This module employs a three-stage progressive causal representation learning process and ultimately completes pathogen classification using the KAN classification head.

[0107] Phase 1: Causal Decoupling Phase. This phase is the foundation of causal representation learning. The core objective is to decouple the general features obtained from the first phase of pre-training into two categories: structurally stable variables and task-discriminative variables. Structurally stable variables correspond to general features that are invariant to lung anatomy and modality, and are highly correlated with confounding factors such as imaging equipment and individual patient differences. Task-discriminative variables correspond to causal features directly related to pneumonia pathogen typing and are the core basis for classification decisions. Through this decoupling phase, the model can initially separate disease causal features from confounding variables, avoiding interference from non-pathological factors in subsequent classification. This phase employs a frozen training strategy: completely freezing the ST-CMFAE shared encoder weights obtained from the first phase of pre-training, and introducing a new auxiliary encoder B. Encoder B uses a Transformer structure identical to the shared encoder, and only the parameters of encoder B are trained, while the parameters of the shared encoder remain fixed throughout. The core task of encoder B is to separate the structurally stable variables from the general features output by the fixed shared encoder, achieving initial decoupling of causal variables. The optimization objective of this phase consists of a weighted sum of three loss functions, with the total loss formula as follows:

[0108] ,

[0109] in, and These are the weighting coefficients. For distillation losses, To compare the losses, This represents the structural consistency loss. The following are the detailed design and physical meaning of each loss function:

[0110] The core function of knowledge distillation is to ensure that encoder B fully inherits the structurally stable features learned by the shared encoder, avoiding catastrophic forgetting and preserving the modal invariance and spatial robustness of the features. The loss function uses mean squared error loss, calculated as follows:

[0111] ,

[0112] in, Features of the frozen shared encoder output, The features output by encoder B are minimized to force the feature space of encoder B to align with the shared encoder, thus fully preserving the general structural features learned during the pre-training phase.

[0113] To contrast the loss, the core function is to drive encoder B to extract task-specific discriminative features related to pathogen typing, achieving separation of discriminative and structural features. This scheme employs a patient-level contrastive learning strategy, treating CT and X-ray multimodal images of the same patient as positive sample pairs and images from different patients as negative sample pairs. Feature distance is optimized using InfoNCE loss, calculated as follows:

[0114] ,

[0115] in, and For the different modalities of the i-th patient, This is the cosine similarity calculation function. This is the temperature parameter, set to 0.07 by default. By minimizing this loss, the model brings the multimodal features of the same patient closer together in the feature space and pushes the features of different patients further apart, thereby extracting task-discriminative features that are highly correlated with the patient's pathological state and separating them from general structural features.

[0116] The structural consistency loss function's core function is to further enhance encoder B's learning of modality-invariant structural features, encourage the model to filter modality-specific artifacts, and improve the cross-modal generalization ability of features. The loss function is calculated as follows:

[0117] ,

[0118] in, and These are features extracted from CT and X-ray images of the same patient using a shared encoder. By minimizing this loss, the model is further constrained to learn modality-invariant structural representations, thus enhancing the extraction of structurally stable variables.

[0119] 2) Phase Two: Causal Reinforcement Phase. This phase is the core of causal representation learning. The core objective is to further enhance the causality and robustness of the task-specific variables based on the decoupling in Phase One. This addresses the problem that the discriminative features extracted in Phase One are still bound to the source domain distribution and lack cross-domain generalization ability. It also prevents causal drift during task transfer, ensuring that the features learned by the model are causally related only to the pathological state of the pneumonia pathogen and are unrelated to confounding factors such as equipment, center, and individual patient differences.

[0120] This phase continues the progressive freeze-training strategy: the weights of encoder B, trained in Phase 1, are completely frozen, and a new auxiliary encoder C is introduced. Encoder C uses the same Transformer structure as encoder B, and only the parameters of encoder C are trained, while the parameters of encoder B remain fixed throughout. The core task of encoder C is to further enhance the causal features related to pathogen typing based on the discriminative features extracted by encoder B, filter out the remaining non-causal confounding features, and improve the cross-domain robustness and discriminativeness of the features.

[0121] ,

[0122] in, These are the weighting coefficients. To retain the distillation loss due to cause and effect, To enhance the classification loss for the task, the following is a detailed design and physical meaning of each loss function:

[0123] The core function of preserving causal distillation loss is to ensure that encoder C fully inherits the causal structure learned by encoder B through knowledge distillation. This prevents causal drift during task transfer and ensures that the causal relationships obtained from decoupling in the preceding stages are not destroyed. The loss function uses mean squared error loss, and the calculation formula is as follows:

[0124] ,

[0125] in, Features of the output of the frozen encoder B The features output by encoder C are minimized to force encoder C to maintain consistency with the feature space of encoder B, thus fully preserving the learned causal structure and avoiding catastrophic forgetting.

[0126] To enhance the classification loss for the task, its core function is to further strengthen the encoder C's ability to distinguish different pathogen subtypes through labeled pathogen typing tags, allowing the causal features learned by the model to be deeply bound to the pathogen typing task. The loss function uses multi-class cross-entropy loss, calculated as follows:

[0127] ,

[0128] in, For the true pathogen category label of the k-th sample, The characteristics output by encoder C, The forward propagation function of the auxiliary classification head outputs the predicted probability of the corresponding category. By minimizing this loss, the encoder C is driven to further enhance the causal features associated with different pathogen subtypes, significantly improving the discriminative power and robustness of the features.

[0129] 3) Phase Three: Task-Specific Causal Fine-Tuning. This phase is the final implementation of causal representation learning. The core objective is to further apply the general causal representations learned in the previous phases to the target task of pediatric pneumonia pathogen classification, solve the final adaptation problem between general causal representations and specific classification tasks, alleviate classification bias caused by class imbalance in the dataset, and achieve balanced and high-performance classification of different pathogen subtypes.

[0130] This phase continues the progressive freeze training strategy: the weights of encoder C, trained in Phase 2, are completely frozen. A brand-new final encoder D and a KAN-based classification head are introduced. Only the parameters of encoder D and the KAN classification head are trained, while the parameters of encoder C remain fixed throughout. The core task of encoder D is to distill prior causal knowledge from encoder C and extract task-specific causal features that are fully adapted to the pediatric pneumonia pathogen typing. The core task of the KAN classification head is to complete the final pathogen multi-classification based on the refined causal features and output the probability results of the corresponding categories.

[0131] The optimization objective in this stage is composed of a weighted sum of two loss functions, and the total loss formula is:

[0132] ,

[0133] in, These are the weighting coefficients. For task knowledge distillation loss, Focus loss based on KAN classification head. The following is a detailed design and physical meaning of each part:

[0134] The core function of the task knowledge distillation loss is to ensure that encoder D fully inherits all the causal prior knowledge learned by encoder C, guaranteeing that learned causal features are not lost during task fine-tuning, while further refining the task-specific features most relevant to the classification task. The loss function uses mean squared error loss, calculated as follows:

[0135] ,

[0136] in, For the causal features of the output of the frozen encoder C, Refined features specific to the encoder D output.

[0137] Based on the focus loss of the KAN classification head, this scheme uses the Kolmogorov-Arnold Network (KAN) instead of the traditional fully connected MLP classification head. The core reason is that KAN replaces the fixed activation function of the traditional MLP with a learnable activation function, which makes its fitting ability for fine-grained features much stronger than that of the traditional MLP. It is especially suitable for fine-grained classification tasks such as pediatric pneumonia pathogen typing, and its fitting ability for small sample categories is significantly improved. The KAN classification head receives 512-dimensional refined features from the encoder D and outputs a 4-channel probability map, corresponding to the four categories of normal lungs, bacterial pneumonia, mycoplasma pneumonia, and viral pneumonia.

[0138] To address the inherent class imbalance problem in the pediatric pneumonia dataset, this approach employs FocalLoss as the classification optimization objective. This loss automatically reduces the loss weight of easily classified samples, strengthens the learning of difficult-to-classify and small sample classes, and effectively alleviates classification bias caused by class imbalance. The formula for calculating FocalLoss is:

[0139] ,

[0140] in, For real category labels, The predicted class probabilities output by the KAN classification head. This is a focus parameter, set to 2.0 by default, used to control the degree of enhancement for difficult-to-classify samples.

[0141] 4) Classification probability output. The logit output of the KAN classification head is... The probabilities of each class are obtained by Softmax:

[0142] ,

[0143] The final category tags are:

[0144] ,

[0145] When the highest classification confidence level is lower than the preset threshold (e.g., 0.85), the system outputs a "pending review" prompt, reminding the physician to conduct further manual review.

[0146] (6) System Deployment Module

[0147] System Composition. This invention also provides a pediatric pneumonia pathogen classification system based on causal representation learning, including a data acquisition module, a preprocessing module, a model building module, a model training module, an inference module, and a report output module. The data acquisition module is used to acquire CT and X-ray multimodal images; the preprocessing module is used to perform resampling, normalization, denoising, registration, and spatial transformation augmentation; the model building module is used to construct the STAR-ViT network; the model training module is used to optimize model parameters based on TCM and a three-stage causal representation learning strategy; the inference module is used to output multi-channel probability maps of the images under test; and the report output module is used to generate pathogen classification labels, confidence scores, suspected lesion areas of interest, and prompts for review.

[0148] Experimental verification

[0149] To validate the effectiveness of STAR-ViT, three types of experiments were designed: a comparison experiment with mainstream models, a core module ablation experiment, and external generalization validation on the ChestXRay2017 dataset. The core dataset was a self-built pediatric pneumonia pathogen CT dataset, employing a 5-fold cross-validation strategy; additionally, cross-modal and cross-dataset generalization ability was validated on the publicly available ChestXRay2017 pediatric pneumonia dataset. Evaluation metrics included accuracy (ACC), area under the receiver operating characteristic curve (AUC), recall (REC), precision (PRE), specificity (SPE), F1 score (F1), and balanced accuracy (BACC).

[0150] The experimental environment was uniformly set up as follows: 8 NVIDIA RTX 4090 GPUs, Intel Xeon Platinum 8375C CPUs, and 256GB of RAM; the software environment consisted of TensorFlow 2.15, CUDA 12.2, cuDNN 8.9, and Python 3.10. All comparison models used the same data preprocessing, data augmentation strategies, data partitioning methods, and optimizer settings to ensure fairness in the comparison.

[0151] To fully verify the superior performance of the STAR-ViT model proposed in this invention in the classification task of pediatric pneumonia pathogens, STAR-ViT is compared with three mainstream models in the current field of medical image classification under the same conditions:

[0152] Classic CNN models: InceptionV3, ResNet50, DenseNet121, EfficientNetB0;

[0153] Transformer model: ViT, Swin Transformer;

[0154] Mainstream self-supervised (SSL) methods: PCRL, MoCo v3, UniMiSS+.

[0155] All comparative models were trained and tested on a self-built pediatric pneumonia pathogen CT (PPP-CT) dataset, employing identical data preprocessing, data augmentation, training optimization strategies, and data partitioning methods to ensure fairness in the comparison. Evaluation metrics included accuracy, area under the receiver operating characteristic (AUC), recall, specificity, precision, F1 score, and balanced accuracy, covering three subtypes of bacterial pneumonia, mycoplasma pneumonia, and viral pneumonia, as well as the overall average performance. All results are expressed as mean ± standard deviation after 5-fold cross-validation. Detailed comparison results are shown in Table 1.

[0156] Table 1. Comparison of classification performance of different methods (mean ± standard deviation, %)

[0157]

[0158] Results analysis:

[0159] (1) Overall performance is superior and significantly surpasses all mainstream models.

[0160] The STAR-ViT method of this invention achieves best performance across all average evaluation metrics: average accuracy of 92.07±0.81%, average AUC of 94.95±0.73%, average recall of 86.10±1.50%, average specificity of 93.89±0.80%, average precision of 73.17±1.81%, average F1 score of 77.39±1.81%, and average balanced accuracy of 89.99±0.87%. Compared to the best-performing comparative method UniMiSS+, STAR-ViT improves average accuracy by 4.91 percentage points, average AUC by 3.34 percentage points, and average balanced accuracy by 4.90 percentage points, comprehensively validating the superior performance of the model of this invention.

[0161] (2) Make targeted breakthroughs based on different types of problems to address the core defects of existing models.

[0162] Compared to classic CNN models, STAR-ViT achieves an average accuracy improvement of 9.12 percentage points over the best-performing DenseNet121 (82.95±2.31%), outperforming it across all metrics. This effectively addresses the core shortcomings of traditional CNNs, such as limited receptive field and difficulty in capturing long-term contextual dependencies in diffuse lung lesions.

[0163] Compared to the Transformer baseline model, STAR-ViT achieved an average accuracy improvement of 8.37 percentage points over the best-performing SwinTransformer (83.70±3.77%), addressing the issues of traditional ViT's lack of spatial inductive bias and poor robustness to lesion spatial variability.

[0164] Compared to mainstream self-supervised methods, STAR-ViT comprehensively surpasses PCRL, MoCo v3, and UniMiSS+, demonstrating that the self-supervised tasks designed in this invention, such as spatial transformation consistency, adversarial cross-modal alignment, and causal representation learning, are more suitable for fine-grained classification scenarios of pediatric pneumonia pathogens, and solve the problem of mismatch between pre-training objectives and downstream classification tasks in traditional self-supervised methods.

[0165] (3) Balanced high performance across all subtypes, effectively mitigating the problem of class imbalance.

[0166] STAR-ViT outperformed STAR-ViT across all indicators for the three pneumonia subtypes:

[0167] For bacterial pneumonia with the smallest sample size: STAR-ViT achieved an accuracy of 89.96±0.83%, a recall of 88.91±0.93%, and an F1 score of 93.09±0.60%, which are 2.79, 7.78, and 4.94 percentage points higher than UniMiSS+, respectively. This significantly improves the classification performance for small sample classes and effectively alleviates the classification bias caused by class imbalance in the pediatric pneumonia dataset.

[0168] For mycoplasma pneumonia, which has the largest sample size: STAR-ViT achieved an accuracy of 93.07±0.85%, an AUC of 95.27±0.72%, and an F1 score of 82.16±2.26%, maintaining extremely high classification accuracy;

[0169] For viral pneumonia: STAR-ViT also leads in all indicators, achieving balanced and high-performance classification of the three subtypes.

[0170] (4) The model exhibits excellent stability, strong generalization and reproducibility.

[0171] The standard deviations of all evaluation metrics of STAR-ViT are at extremely low levels. For example, the standard deviation of the average accuracy is only 0.81%, which is much lower than that of comparative models such as ResNet50 (5.24%) and ViT (4.12%). This proves that the Task Coordination Modulation (TCM) module of this invention effectively alleviates gradient conflicts in multi-task optimization, greatly improves the stability, generalization and reproducibility of model training, and is more suitable for actual clinical application scenarios.

[0172] Compared with a pure ViT model as the baseline, a stepwise ablation experiment using the controlled variable method was conducted to verify the effectiveness and contribution of each core module of the present invention. The experimental results are shown in Table 2. Among them, M1 is the pure ViT baseline model, M2 is based on M1 with CT-specific pre-training, M3 is based on M2 with X-ray multimodal pre-training, M4 is based on M3 with spatial transformation module, M5 is based on M4 with cross-modal adversarial alignment module, and the complete STAR-ViT is the final fully integrated module model.

[0173] Table 2. Ablation Experiment Results of Core Module (Mean ± Standard Deviation, %)

[0174]

[0175] Results analysis:

[0176] The ablation experiment results show that each core module designed in this invention has a significant positive gain in the classification performance of pediatric pneumonia pathogens, and the complete STAR-ViT model has the best overall performance. Compared with the pure ViT baseline model, the complete model improves the average accuracy by 9.17 percentage points and the average AUC by 15.38 percentage points, achieving a significant leap in overall performance.

[0177] After introducing multimodal self-supervised pre-training from CT and X-ray, the model can fully utilize unlabeled data to learn common pathological features, effectively alleviating the overfitting problem caused by the scarcity of labeled data. The addition of spatial transformation consistency constraints enhances the model's robustness to differences in the spatial location and morphology of lesions, significantly improving lesion detection capabilities. The cross-modal adversarial alignment module effectively reduces the modal domain differences between CT and X-ray, enabling the model to learn more discriminative pathological features, making it the component with the most significant single-module gain. The causal representation learning module can separate core disease features from confounding interference, especially improving the classification accuracy of bacterial pneumonia with a small sample size.

[0178] Furthermore, as the modules are gradually stacked, the standard deviation of various model metrics continues to decrease, and the training stability and consistency of results are significantly enhanced. Experiments fully demonstrate that the synergistic cooperation of the modules in this invention can effectively solve problems such as scarce annotations, modality shift, and insufficient spatial robustness, making the model more suitable for the practical application needs of clinical pneumonia pathogen classification.

[0179] External validation was performed on the publicly available ChestXRay2017 pediatric pneumonia dataset to verify the cross-modal and cross-dataset generalization ability of STAR-ViT. This dataset contains 5856 chest X-ray images of children aged 1-5 years, covering three classification tasks: normal, bacterial pneumonia, and viral pneumonia. It is highly matched with the core task of this invention. The comparison results are shown in Table 3.

[0180] Table 3. External validation results of the ChestXRay2017 dataset (mean ± standard deviation, %)

[0181]

[0182] Figure 7 The graphs show the validation loss curves for different models, where Model A represents the model without spatial transformation and cross-modal alignment, Model B represents the model with spatial transformation and cross-modal alignment, and Model C represents the complete STAR-ViT model.

[0183] External validation results show that STAR-ViT achieves an average classification accuracy of 97.86% on the ChestXRay2017 dataset. Compared to the best-performing comparison method UniMiSS+, this represents an improvement of 0.21 percentage points in average accuracy and 0.66 percentage points in average F1 score, achieving 100% classification specificity for the normal category. Experimental results demonstrate that STAR-ViT learns general features related to pneumonia pathology, rather than dataset-specific artifacts, exhibiting strong cross-modal and cross-dataset generalization capabilities, making it adaptable to the diagnostic needs of different clinical scenarios.

[0184] Example 2

[0185] This embodiment provides a classification system for pediatric pneumonia pathogens based on causal representation learning.

[0186] A computer-readable storage medium storing a plurality of instructions adapted for loading and execution by a processor of a terminal device of the aforementioned method for classifying pediatric pneumonia pathogens based on causal representation learning.

[0187] A terminal device includes a processor and a computer-readable storage medium, the processor being configured to implement various instructions; the computer-readable storage medium being configured to store multiple instructions adapted for loading and execution by the processor of the aforementioned pediatric pneumonia pathogen classification method based on causal representation learning.

[0188] The above are all preferred embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Therefore, all equivalent changes made in accordance with the structure, shape and principle of the present invention should be covered within the scope of protection of the present invention.

Claims

1. A method for classifying pediatric pneumonia pathogens based on causal representation learning, characterized in that, include: Acquire multimodal image data; Data preprocessing is performed based on the acquired multimodal image data; Spatial transformation and cross-modal shared coding are performed based on the preprocessed data; Decoding and adversarial realignment based on encoded data, including multimodal reconstruction decoding and target domain decoding, cross-modal loss reconstruction, and TCM-based task coordination modulation; Based on aligned features, a three-stage causal representation learning is performed, and classification results are output.

2. The method for classifying pediatric pneumonia pathogens based on causal representation learning according to claim 1, characterized in that, The data preprocessing based on the acquired multimodal image data includes converting 3D CT and 2D X-ray images from different sources, devices, and resolutions into a unified, trainable, and aligned input representation based on the acquired pediatric chest multimodal image data. The training sample set is defined as follows: ,in, Indicates the first Three-dimensional chest CT images of the patient. This represents a two-dimensional chest X-ray image that corresponds to or is in the same domain. The pathogen classification labels are then used; subsequently, image size and spatial resolution are standardized, and for 3D chest CT images, data from different acquisition centers are resampled to isotropic values. Voxel resolution, and normalize the volume data size to For 2D chest X-ray images, the images are uniformly scaled down to [size value missing]. For chest CT images, dual-window normalization was employed, using both lung and mediastinal windows. The lung window had a width of 1500 HU and a window level of -600 HU, while the mediastinal window had a width of 350 HU and a window level of 40 HU, to highlight the texture and anatomical structure of lung lesions. For chest X-ray images, histogram equalization was used to enhance lesion contrast. To address the differences in patient position and projection between CT and X-ray images, a rigid registration method based on mutual information was used to align the lung field regions of the X-ray images with the corresponding maximum density projection images of the CT images, ensuring relative consistency of anatomical structures such as lesion centers and lobar boundaries across multimodal images. Finally, the preprocessed original images were analyzed. Perform random rotation, translation, scaling, flipping, and affine transformations to generate spatially perturbed paired images. Used for subsequent spatial consistency constraints: , in, Indicates random rotation. Indicates random translation. This indicates random scaling.

3. The method for classifying pediatric pneumonia pathogens based on causal representation learning according to claim 2, characterized in that, The spatial transformation and cross-modal shared coding based on preprocessed data includes achieving unified coding adaptation of 2D / 3D multimodal inputs through dimension selection, multi-scale Transformer coding, and spatial transformation consistency constraints. Simultaneously, it forces the model to learn the pathological semantic features of lesions. Dimension selection, for 3D chest CT images, first normalizes the preprocessed CT data to 224×224×128 voxel size and uniformly crops it into non-overlapping 3D image blocks, ensuring the image block scale matches the scale of small lung nodules and ground-glass opacities in children. Then, feature encoding is completed through three consecutive 3D convolution operations, ultimately generating a 512-dimensional feature vector for each 3D image block. The entire CT image outputs a 1568×512-dimensional token sequence. For 2D chest X-ray images, the X-ray image normalized to 224×224 pixels is divided into 1568 16×16 two-dimensional image blocks with a 20% overlap rate. Then, through 3... Encoding is completed through successive 2D convolutional operations, ultimately generating a 512-dimensional feature vector for each 2D image patch. The entire X-ray image also outputs a 1568×512-dimensional token sequence. The multi-scale Transformer encoding layer includes an 8-layer stacked Transformer encoding structure. Each encoding layer consists of layer normalization, multi-head self-attention mechanism, feedforward network (FFN), residual connections, and secondary layer normalization in sequence. This simultaneously captures the local fine-grained texture features and global anatomical context dependencies of pediatric lung images, addressing the core shortcomings of traditional CNN models, such as limited receptive field and difficulty in capturing long-range associations of diffuse lesions. The feedforward network adopts a two-layer structure of linear layer → GELU activation → Dropout layer → linear layer, performing nonlinear transformation on the features output by attention to further enhance the pathological discrimination features related to pneumonia pathogen classification. Each encoding layer is equipped with two residual connections, corresponding to the multi-head self-attention module and the feedforward network module, respectively, to alleviate the gradient vanishing problem during the training of deep Transformer networks. The module extracts features layer by layer from the Transformer encoding layer, and finally outputs 1568×512-dimensional high-dimensional shared features.

4. The method for classifying pediatric pneumonia pathogens based on causal representation learning according to claim 3, characterized in that, The spatial transformation and cross-modal shared coding based on the preprocessed data also include spatial transformation consistency constraints, specifically including the input original image. Perform online spatial transformation augmentation to generate spatially perturbed paired images. The spatial transformation operations include: random horizontal / vertical flipping in the sagittal, coronal, and axial planes; random rotation around the X, Y, and Z axes within ±30°; random translation within ±10%; random scaling within 0.8-1.2 times; and random affine transformation. The original image and the spatially transformed paired image are synchronously input into a shared encoder for feature extraction, yielding features from the original image and the spatially transformed image. Cosine similarity loss is used to constrain the original image features and the spatially transformed image features to maintain a high degree of consistency in the feature space. The loss function is defined as follows: , in Encode features for the original image. Here, N represents the spatial transformation image encoding features, and N is the number of samples in the batch.

5. A method for classifying pediatric pneumonia pathogens based on causal representation learning according to claim 4, characterized in that, The multimodal reconstruction decoding and target domain decoding include a transposed convolution + residual block structure symmetrical to the ST-CMFAE shared encoder, containing 4 consecutive transposed convolutional layers. Each transposed convolutional layer is equipped with residual connections and batch normalization operations. The activation function is GELU. The parameters of each transposed convolutional layer are completely matched with the corresponding convolutional layer of the encoder, ensuring that the compressed feature information during the encoding process is completely recovered during the decoding process. The input of the decoder is the multimodal general features output by the ST-CMFAE shared encoder. The final output is a reconstructed image with the same size as the input image. During training, a random block masking strategy is used to preprocess the input image. The size of the mask block is consistent with the image block size of the DSM module. Part of the input image is randomly occluded. Only the unoccluded image area is input to the encoder for feature extraction. Then, the decoder recovers the complete original image from the compressed features, including the areas occluded by the mask. In the process of cross-modal feature alignment, the target domain decoder is used to preserve the pathological-specific features of the CT target domain, avoiding the loss of fine-grained CT image features that are highly correlated with the classification of pediatric pneumonia pathogens in order to achieve modal alignment. The target domain decoder adopts a symmetrical structure of 4 layers of transposed convolution + residual blocks, but it only receives encoder-specific feature inputs corresponding to CT images and only outputs the reconstructed 3D CT image. It does not participate in the reconstruction task of X-ray images. Through the parallel constraints of TD-decoder, while achieving cross-modal alignment between CT and X-ray, the key discriminative features of the target domain CT are prevented from degrading, thus balancing the contradiction between cross-modal generalization and target domain discriminability.

6. A method for classifying pediatric pneumonia pathogens based on causal representation learning according to claim 5, characterized in that, The cross-modal loss reconstruction includes multimodal reconstruction and target domain reconstruction, wherein both multimodal reconstruction and target domain reconstruction use mean squared error loss, expressed as: ,in, The original input image, The reconstructed image output by the decoder. The number of samples or pixels; the domain discriminator employs a 3-layer hidden layer multilayer perceptron structure, with 1024, 2048, and 512 neurons in the hidden layers, respectively. The domain discriminator receives CT features output from the shared encoder. and X-ray characteristics It predicts the mode to which a feature belongs; finally, to solve the mode collapse and gradient vanishing problems that are prone to occur in traditional GAN ​​adversarial training, it adopts the Wasserstein generative adversarial network loss with gradient penalty (WGAN-GP) to construct the adversarial optimization objective. The discriminator loss is expressed as: , The encoder adversarial loss is expressed as: ,in, For domain discriminator, The gradient penalty weight coefficient, Vector features of chest X-ray images. For the Euclidean norm, , for the discriminator output of interpolated features The gradient vector, the bimodal interpolation features are: The overall optimization goal is: By using minimax games, the modal domain gap caused by the difference between CT and X-ray imaging principles is eliminated, and stable modally invariant pathological features are extracted.

7. A method for classifying pediatric pneumonia pathogens based on causal representation learning according to claim 6, characterized in that, The TCM-based task coordination modulation includes constructing a Task Coordination Modulation Module (TCM). In the first stage, unlabeled CT and X-ray image data are used to jointly train an ST-CMFAE shared encoder, a multimodal reconstruction decoder, a target domain decoder, and a domain discriminator. The training objectives include spatial consistency constraints, multimodal reconstruction, and cross-modal adversarial alignment. The initial step involves task gradient acquisition, assuming a shared... Each task has the following losses: The corresponding gradient is: Then, gradient similarity is calculated, and the cosine similarity between the gradients of any two tasks is: ,in, Let be the gradient vector for the i-th task. Gradient vector transpose, Gradient vector The Euclidean norm, when When, explain the task With the task There are conflicts in the optimization directions; finally, dynamic weight adjustment is performed, and TCM dynamically updates the task weights based on gradient similarity: in, For the Sigmoid function, For temperature parameters, For the task The original weights, The updated task weights; the total loss of the first-stage pre-training, after TCM dynamic modulation, is expressed as: , in, , , These are the weights of spatial consistency loss, reconstruction loss, and encoder adversarial loss after TCM modulation.

8. A method for classifying pediatric pneumonia pathogens based on causal representation learning according to claim 7, characterized in that, The three-stage causal representation learning based on aligned features and the output of classification results include a three-stage progressive causal representation learning process and pathogen classification using a KAN classification head. Stage one is the causal decoupling stage, in which a frozen training strategy is implemented to completely freeze the weights of the ST-CMFAE shared encoder pre-trained in the first stage. A new auxiliary encoder B is introduced, which adopts a Transformer structure identical to the shared encoder. Only the parameters of encoder B are trained, while the parameters of the shared encoder remain fixed throughout. Encoder B is used to separate structurally stable variables and task-discriminative variables from the general features output by the fixed shared encoder, achieving initial decoupling of causal variables. The optimization objective is composed of a weighted sum of three loss functions, and the total loss formula is: , in, and These are the weighting coefficients. To mitigate distillation loss, knowledge distillation is used to ensure that encoder B fully inherits the structural stability features learned by the shared encoder. The formula is as follows: ,in, Features of the frozen shared encoder output, Features output by encoder B; To compare the loss, the encoder B is driven to extract task-specific features related to pathogen typing, achieving separation between discriminative features and structural features. The formula is as follows: in, and For the different modal images of the i-th patient, This is the cosine similarity calculation function. For temperature parameters; The structural consistency loss is used to enhance encoder B's learning of mode-invariant structural features, and its formula is: ,in, and Features extracted from CT and X-ray images of the same patient using a shared encoder.

9. A method for classifying pediatric pneumonia pathogens based on causal representation learning according to claim 8, characterized in that, The three-stage progressive causal representation learning process also includes a second stage: the causal reinforcement stage. The goal is to enhance the causality and robustness of the task-discriminating variables based on the decoupling in the first stage. This stage continues the progressive freeze-training strategy by completely freezing the weights of encoder B trained in the first stage and introducing a new auxiliary encoder C. Encoder C adopts the same Transformer structure as encoder B. Only the parameters of encoder C are trained, while the parameters of encoder B remain fixed throughout. Using the discriminative features extracted by encoder B, encoder C strengthens the causal features related to pathogen typing and filters out any remaining non-causal confounding features. This is represented as: ,in, To preserve the distillation loss for causality, knowledge distillation allows encoder C to fully inherit the causal structure learned by encoder B. The formula is as follows: ,in, Features of the output of the frozen encoder B Features output by encoder C; To enhance the classification loss for the task, the encoder C's ability to distinguish different pathogen subtypes is strengthened by using labeled pathogen typing tags, as shown in the formula: ,in, For the true pathogen category label of the k-th sample, The characteristics output by encoder C, This is the forward propagation function for the auxiliary classification head.

10. A method for classifying pediatric pneumonia pathogens based on causal representation learning according to claim 9, characterized in that, The three-stage progressive causal representation learning process further includes a third stage: task-specific causal fine-tuning. This stage applies the general causal representations learned in previous stages to the target task of pediatric pneumonia pathogen classification, achieving balanced and high-performance classification of different pathogen subtypes. Based on the weights of encoder C trained in stage two, a new final encoder D and a KAN-based classification head are introduced. Only the parameters of encoder D and the KAN classification head are trained, while the parameters of encoder C remain fixed throughout. Encoder D distills prior causal knowledge from encoder C to extract task-specific causal features that are fully adapted to pediatric pneumonia pathogen classification. The core task of the KAN classification head is to complete the final multi-classification of pathogens based on the refined causal features and output the probability results of the corresponding categories. The optimization objective of this stage consists of a weighted sum of two loss functions, with the total loss formula being: ,in, These are the weighting coefficients. The loss from task knowledge distillation, used for encoder D to fully inherit all causal prior knowledge learned by encoder C, is formulated as follows: ,in, For the causal features of the output of the frozen encoder C, Refined features specific to the encoder D output; Based on the focus loss of the KAN classification head, KAN is used to replace the traditional MLP fully connected classification head. Finally, to address the inherent class imbalance problem in the pediatric pneumonia dataset, focus loss (FocalLoss) is used as the classification optimization objective. By reducing the loss weight of easily classified samples, the classification bias caused by class imbalance is mitigated. The calculation formula is as follows: , in, For real category labels, The predicted class probabilities output by the KAN classification head. To focus on the parameters, the logit output by the KAN classification head is: The probabilities of each class are obtained by Softmax: ,in, The logit value of the c-th class output by the KAN classification header. The logit value for the j-th class output by the KAN classification header. It is a natural exponential function.