Method for constructing pathogenic bacteria dynamic recognition network based on time sequence images and clinical data

By constructing a deep learning dynamic network, the problems of multi-class feature separation and rigid resource allocation in multimodal data fusion were solved, which improved the accuracy of multi-label recognition and the adaptive allocation of resources, thereby enhancing the robustness and efficiency of pathogen identification.

CN122314352APending Publication Date: 2026-06-30THE FIRST AFFILIATED HOSPITAL OF BENGBU MEDICAL COLLEGE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
THE FIRST AFFILIATED HOSPITAL OF BENGBU MEDICAL COLLEGE
Filing Date
2026-04-26
Publication Date
2026-06-30

Smart Images

  • Figure CN122314352A_ABST
    Figure CN122314352A_ABST
Patent Text Reader

Abstract

This invention discloses a method for constructing a dynamic pathogen identification network based on time-series images and clinical data, belonging to the field of artificial intelligence technology. The method uses a dual-channel deep encoder composed of a 3D convolutional neural network and a recurrent neural network. After confidence-gated fusion, a fused feature vector is extracted. A multi-head cross-attention network with an inter-prototype contrast modulation mechanism is introduced to decompose the fused feature vector into a sparse superposition of category-specific components, achieving deep category attribution analysis of the mixed signal. A second network jointly analyzes the decomposition residuals and dimensional competition features. When the residual significance score exceeds a first preset threshold, a feedback signal is injected into the first network for a second round of conditional refinement decomposition, forming a deep learning dynamic network of decomposition-evaluation-guidance-re-decomposition. This invention can adaptively identify three scenarios: single infection, mixed infection, and weak signal infection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, specifically to a method for constructing a dynamic identification network for pathogens based on time-series images and clinical data. Background Technology

[0002] When performing multimodal data fusion analysis, especially when it is necessary to combine data from different sources, such as spatially structured data and time-evolving serialized indicator data, and when conducting joint analysis to identify multiple potential targets or states, the following limitations exist:

[0003] First, existing methods mostly adopt the "overall fusion-single-decision" paradigm, which implicitly assumes that each sample corresponds to only a single dominant category. However, in practical applications, scenarios where multiple categories coexist are common. When signals from multiple categories are superimposed in the fused features, existing methods struggle to effectively separate the feature contributions of each category, leading to the underidentification of coexisting categories.

[0004] Second, when there are significant differences in signal strength among different categories, the features of weaker categories are easily overwhelmed by stronger signals. Existing methods lack a comprehensive mechanism for in-depth attribution and residual analysis of fused features, resulting in insufficient ability to identify weak signal categories. Summary of the Invention

[0005] To address the problems existing in the prior art, this invention provides a method for constructing a dynamic identification network for pathogens based on temporal images and clinical data.

[0006] To achieve the above objectives, the technical solution of the present invention is as follows:

[0007] In a first aspect, this application discloses a method for constructing a dynamic identification network for pathogens based on temporal images and clinical data, comprising the following steps:

[0008] Acquire clinical data and 3D images, preprocess the clinical data, and obtain a standardized time series matrix;

[0009] A dual-channel depth encoder is constructed, which takes a 3D image and a normalized temporal matrix as input and outputs a fused feature vector. The encoder includes a pre-trained 3D convolutional neural network, a single-layer unidirectional gated recurrent unit, and confidence gating.

[0010] The first network is constructed, which is a multi-head cross-attention network that introduces a predefined category prototype matrix. Based on the fused feature vector, the contrast difference vector between each category prototype vector in the category prototype matrix and its nearest competing prototype is calculated, and then the attention mechanism is modulated to calculate the attention score. Based on this, a component matrix is ​​generated, the master hypothesis category is extracted from the component matrix, and the decomposition residual and dimensional competition features are calculated.

[0011] A second network is constructed, based on a multilayer perceptron. This network generates residual significance scores, dimension-guided vectors, and class affinity vectors by comparing the input decomposition residuals with dimensionality-competitive features, and then makes a determination.

[0012] When the residual significance score is lower than the first preset threshold, the main hypothesis category is output.

[0013] When the residual significance score is not lower than the first preset threshold, the decomposed residual is used as the fusion feature vector, and the dimension-guided vector and category affinity vector are injected as feedback signals into the attention score calculation process to generate the second component matrix and determine:

[0014] (1) When there are category rows in the second component matrix whose energy exceeds the second preset threshold, output the result of dual-category mixed infection;

[0015] (2) When the energy of the categoryless row in the second component matrix exceeds the second preset threshold, the corrected category is output based on the comprehensive evidence comparison of the two largest energy rows in the component matrix;

[0016] The dual-channel deep encoder, the first network, and the second network are trained end-to-end using a multi-task collaborative loss until convergence.

[0017] Secondly, this application discloses an electronic device, including: a memory, a processor, and a computer program stored in the processor and running on the processor, wherein the processor executes the computer program to implement the method as described above.

[0018] Thirdly, this application discloses a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the aforementioned method for constructing a dynamic identification network for pathogens based on time-series images and clinical data.

[0019] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0020] 1. Multi-label recognition capability: Through additivity feature decomposition and cascaded decision mechanism, it can explicitly model and identify multiple categories coexisting in multimodal data, effectively avoiding missed identification in multi-label scenarios;

[0021] 2. Weak signal perception capability: Through residual analysis and feedback-guided refinement decomposition, the network can focus on signal dimensions that were not fully explained in the initial decomposition, thereby improving the accuracy of weak signal category identification.

[0022] 3. Fine-grained discriminative ability: Through the inter-prototype contrast modulation mechanism, the network automatically focuses on the discriminative dimension that is most capable of distinguishing similar categories during feature decomposition, thereby improving the robustness of fine-grained classification;

[0023] 4. Dynamic computational efficiency: The cascaded triggering mechanism is adopted. For samples that are easy to judge, the inference process degenerates into a single forward propagation. Only for samples with signal competition or uncertainty is the refinement analysis triggered, thus realizing the adaptive allocation of computing resources. Attached Figure Description

[0024] The disclosure of this invention is illustrated with reference to the accompanying drawings. It should be understood that the drawings are for illustrative purposes only and are not intended to limit the scope of protection of this invention. In the drawings, the same reference numerals are used to refer to the same parts. Wherein:

[0025] Figure 1 This is a flowchart illustrating the overall process of the method of the present invention.

[0026] Figure 2 This is a schematic diagram of a dual-channel depth encoder structure;

[0027] Figure 3 This is a schematic diagram of the first network structure;

[0028] Figure 4 This is a schematic diagram of the second network. Detailed Implementation

[0029] It is readily understood that, based on the technical solution of this invention, those skilled in the art can propose various interchangeable structural methods and implementations without altering the essential spirit of the invention. Therefore, the following detailed embodiments and accompanying drawings are merely illustrative examples of the technical solution of this invention and should not be considered as the entirety of the invention or as limitations or restrictions on the technical solution of this invention.

[0030] In existing technologies, current fusion methods typically involve simple concatenation or weighted summation of image and temporal features, with weights often fixed or implicitly learned through attention mechanisms. When the data quality of a particular modality is low (e.g., image segmentation failure, extremely sparse temporal sampling), low-quality features still participate in the decision-making process with the same weight, becoming a source of noise. The lack of an explicit confidence gating mechanism prevents the dynamic adjustment of each modality's contribution to the fusion process based on its reliability, leading to a significant performance degradation when some modalities fail.

[0031] Prototype-based matching or metric learning methods typically calculate the global similarity between query features and prototypes across all categories, assigning equal importance to all feature dimensions. However, in pathogen identification, different categories often exhibit only subtle differences in a few key biomarkers or image texture dimensions. Existing methods lack mechanisms to automatically identify and focus on these "discriminative dimensions," resulting in limited accuracy in distinguishing easily confused category pairs (such as different species within the same genus), and the model is prone to misclassification between highly similar categories.

[0032] Existing models use the same computational graph for calculations regardless of the simplicity or complexity of the input samples, failing to achieve adaptive allocation of computational resources. For samples with clear and easily identifiable features, this "one-size-fits-all" approach leads to unnecessary computational waste; while for samples with complex features and competition, the accuracy may be affected by insufficient depth of a single analysis. Furthermore, there is a lack of a cascading mechanism to dynamically trigger refinement analyses based on the inherent complexity of the samples.

[0033] To address the aforementioned issues, this application proposes a deep learning dynamic network to tackle the challenges of distinguishing mixed infections and the rigidity of single models in handling uncertainty in clinical pathogen identification. The core concept is to treat the network not as a static classifier, but as a dynamic network with hypothesis-testing-correction capabilities. First, a master hypothesis with quantified uncertainty (residual) is generated through prototype-contrast attention modulation. Then, an independent saliency evaluation network determines whether this uncertainty is sufficient to undermine the master hypothesis. If so, a feedback-injected reanalysis loop is initiated, guiding the network to focus on key dimensions and candidate categories. Finally, based on the reanalysis results, an adaptive decision is made among "confirming the master hypothesis," "determining mixed infection," and "outputting a corrected category." This approach achieves a paradigm shift from "single-stage classification" to "iterative reasoning," improving the robustness of identification and the interpretability of decisions in complex infection scenarios.

[0034] After introducing the basic concept of the present invention, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0035] like Figure 1 The diagram illustrates a method for constructing a dynamic pathogen identification network based on temporal images and clinical data, including:

[0036] S1. Acquire clinical data and 3D images, preprocess the clinical data, and obtain a standardized time series matrix;

[0037] Specifically, by extracting, integrating, de-identifying, and transferring learning from publicly available datasets and hospital information systems, qualified 3D images and clinical data are obtained. By aligning the time nodes of the original irregularly sampled multivariate clinical time series data, a standardized time series matrix X is obtained.

[0038] Based on the clinical application scenario, T key time points are predefined, denoted as an ordered set τ={τ1,τ2,...,...} }, where T is the total number of key time points. (t=1,...,T) represents the time value at the t-th time node. In infection monitoring scenarios measured in days, a typical value is τ={1,3,7,14,21} days, corresponding to T=5. The number of key time nodes T and the values ​​of each node are predetermined by the application scenario and remain consistent during training and inference.

[0039] Let the original irregularly sampled time-series observation data matrix be . ,in [a,k] represents the measured value of the k-th clinical indicator at the a-th sampling time, where k∈{1,...,K} is the index of the clinical indicator and K is the total number of clinical indicators.

[0040] Will This is mapped to the aforementioned fixed time nodes using linear interpolation. For the time nodes... The interpolation formula for the k-th index at position k is:

[0041] ;

[0042] in and The adjacent ones in the original sampling sequence The sampling times before and after satisfy ≤ ≤ a and b are respectively and The corresponding time step index in the original sampling sequence; [a,k] and [b,k] represents the measured value of the k-th index at the corresponding time.

[0043] When the number of available time-series observations varies significantly among different samples, the following adaptive strategy is adopted according to the level of the actual number of available time nodes T:

[0044] When T≥4, the complete calculation process is executed and all modules work normally;

[0045] When T=2 or T=3, the gated loop unit in step S2 is expanded according to the actual T steps, and the timing coding vector... The amount of information is limited but still usable, and the logic of subsequent modules remains unchanged;

[0046] When T=1, the gated recurrent unit degenerates into a linear mapping of the input vector at a single time step. The amount of information is extremely limited. At this point... Set as zero vector (dimension) (See step S2.2 for the definition), so that the fused feature vector F output by the confidence-gated fusion in step S2 is entirely driven by image features. The network completes recognition by relying on the image encoding path, and the confidence-gated mechanism automatically adapts to this degradation mode during training.

[0047] S2. Construct a dual-channel depth encoder, which takes a 3D image and a normalized temporal matrix as input and outputs a fused feature vector. The encoder includes a pre-trained 3D convolutional neural network, a single-layer unidirectional gated recurrent unit, and confidence gating.

[0048] Specifically, such as Figure 2 As shown, this step constructs a three-dimensional convolutional neural network path and a recurrent neural network path, and outputs a fused feature vector F after confidence gating fusion.

[0049] S2.1 Image Coding Path

[0050] The input is a 3D image tensor I, with the shape of... × × ,in , , These represent the number of voxels in the image across the three spatial dimensions.

[0051] Pre-training of a 3D fully convolutional segmentation network (based on the 3D U-Net architecture):

[0052] Training objective: To learn to segment the target lesion region and key sub-regions from 3D images;

[0053] Training data: Independent 3D image datasets;

[0054] Training method: Use a combination of Dice loss and cross-entropy loss to train independently on supervised segmentation tasks until convergence;

[0055] Purpose of use: After training, the weights are fixed, and the model is used to generate target masks and calculate segmentation confidence in the main model.

[0056] A pre-trained 3D fully convolutional segmentation network (based on the 3D U-Net architecture) is used to segment the target region I, and the network outputs a target body mask. (shape × × The element takes values ​​of 0 or 1 (obtained by binarizing the output of the last Sigmoid layer of the segmentation network with a threshold of 0.5) and the key sub-region mask. (shape × × The element takes the value 0 or 1 (obtained by binarizing the corresponding output channel of the segmentation network), and Sigmoid is the activation function.

[0057] The probability graph of the sigmoid output of the last layer of the segmentation network is in The segmentation confidence scalar c is obtained by taking the mean value over all target voxels covered.

[0058] ;

[0059] in (x,y,z) represents the probability value of the segmentation probability map at the voxel coordinates (x,y,z), mean({·}) is the arithmetic mean of the elements in the set, c∈(0,1), and reflects the overall reliability of the segmentation result.

[0060] 3D residual convolutional neural network (based on 3D ResNet architecture) pre-training:

[0061] Training objective: To learn a general representation of 3D visual features;

[0062] Training method: Supervised classification pre-training or self-supervised pre-training on large-scale 3D image datasets;

[0063] Purpose of use: After training, fix the weights and use it as a feature extractor in the main model.

[0064] A pre-trained 3D residual convolutional neural network (based on the 3D ResNet architecture) is used to extract multi-level image features. Feature maps are extracted from the shallow layer (1st-2nd residual blocks, capturing texture and edge details), the middle layer (3rd-4th residual blocks, capturing local morphological structure), and the deep layer (5th residual block to the global average pooling layer, capturing overall semantic information). The feature maps from each layer are compressed to a fixed spatial size of 2×2×2 by adaptive 3D average pooling and then flattened and stitched together. The result is then passed through a fully connected network (output dimension...). =512) are fused and dimensionality reduced to obtain the image feature vector. Dimension =512.

[0065] S2.2 Timing Coding Path

[0066] Employing a single-layer, one-way gated recurrent unit (GRU, hidden state dimension) =64) Encode the normalized time series matrix X in time step order:

[0067] , ;

[0068] Where X[t,:] is the t-th row of the standardized time series matrix X, that is, the vector of observed values ​​(dimension K) of K clinical indicators at the t-th time node. This is the hidden state vector from the previous time step; for =64-dimensional zero vector initial state; These are all the learnable parameters of the GRU. The hidden state at the last time step is taken as the temporal encoding vector: Among them, dimension =64. When T=1, according to the data density adaptive strategy in step S1, let... =0 ( =64-dimensional zero vector).

[0069] S2.3 Confidence-Gated Fusion

[0070] like Figure 2 As shown, the image feature vector is gated and scaled using the segmentation confidence scalar c, then concatenated with the temporal coding vector, and compressed into a fused feature vector F through a two-layer fully connected network.

[0071] , ;

[0072] Among them, c· Represents scalar c and vector Element-wise multiplication (i.e.) Each component is multiplied by c), Concat( , () represents a vector concatenation operation, resulting in a concatenated vector with dimensions of . + =512+64=576. It is a two-layer fully connected network with the following structure: 576 → 256 (ReLU activated) → (No activation function), where =128 is the dimension of the fused feature vector. for All learnable parameters.

[0073] The technical effect of confidence gating: when image segmentation is reliable (c is close to 1), Normally involved in fusion, F-coded image and temporal information are combined. When image segmentation is unreliable (c close to 0), The signal is significantly suppressed, and F automatically degenerates into a representation dominated by time-series information, avoiding interference from low-quality image features in subsequent decomposition and judgment. This mechanism is similar to that at T=1. The zeroing strategy is complementary: when time-series data is insufficient, it relies on imagery; when imagery is unreliable, it relies on time-series data; and when both are available, it is jointly encoded.

[0074] S3. Construct a first network, which is a multi-head cross-attention network that introduces a predefined category prototype matrix. Based on the fused feature vector, calculate the contrast difference vector between each category prototype vector in the category prototype matrix and its nearest competing prototype, and then modulate the attention mechanism to calculate the attention score. Based on this, generate a component matrix, extract the master hypothesis category from the component matrix, and calculate the decomposition residual and dimensional competition features.

[0075] like Figure 3 As shown, this step decomposes the fused feature vector F into a sparse superposition of specific components of each pathogenic bacteria category. Its core lies in introducing an inter-prototype contrast modulation mechanism, which enables attention calculation to automatically focus on the discriminative dimension that can distinguish similar categories.

[0076] S3.1 Category Prototype Matrix

[0077] Define a global learnable category prototype matrix P with shape as follows: × ,in This represents the total number of pathogenic bacteria categories (a known hyperparameter). =64 represents the prototype space dimension. The s-th row vector p(s) of the category prototype matrix P (s=1,..., The vector, denoted as the class prototype vector of the s-th pathogenic bacteria category, has dimensions [not specified]. =64. The category prototype matrix P is learned end-to-end through backpropagation during training and is fixed and not updated during inference.

[0078] S3.2 Feature Segmentation

[0079] The fused feature vector F is uniformly divided into Each segment, among which =8 represents the number of heads in multi-head attention, per dimension. = / =128 / 8=16, therefore we get Query segment:

[0080] , ;

[0081] Where F[a:b] represents the subvector formed by taking the a-th to b-th components of F. For the h-th query segment, dimension =16.

[0082] S3.3 Prototype Projection

[0083] For each category prototype vector p(s), the key vector and value vector are generated using the following linear projection matrices:

[0084] Key projection (globally shared): ;

[0085] in The key projection matrix is ​​a globally shared matrix (learnable parameter), κ(s) is the key vector of the class prototype vector p(s), and the dimension is... =16.

[0086] Value projection (independent by head):

[0087] , ;

[0088] in The projection matrix (learnable parameters) for the h-th head is used, and the parameters of each head are independent. (s) is the value vector of the category prototype vector p(s) at the h-th head, with dimension... =16, corresponding to the query segment Dimensions are consistent.

[0089] S3.4 Prototype Inter-contrast Modulation Mechanism

[0090] For each category prototype vector p(s), determine its nearest competing category prototype vector p(s) in the category prototype matrix P. ):

[0091] ;

[0092] in For cosine similarity, The category number is assigned to the other category with the highest cosine similarity to p(s). ‖·‖2 represents the L2 norm (Euclidean norm) of the vector. When there are multiple categories with the same cosine similarity to p(s), the category number with the smaller number is selected.

[0093] To ensure training stability, an exponentially moving average copy of the class prototype matrix is ​​maintained. The update rules are as follows:

[0094] ;

[0095] in =0.999 is the decay coefficient, and P is the class prototype matrix of the current training step. Nearest neighbor calculation is based on... (Instead of the current P) execution, and the nearest neighbor relationship is every training epochs (default) =5) Recalculate once, keeping it fixed between two updates to avoid drastic changes in nearest neighbor relationships caused by each update.

[0096] Calculate the contrast vector: , where δ(s) is =64-dimensional vector, reflecting the difference between the class prototype vector p(s) and its nearest neighbor competing class prototype vector p(s). The characteristic direction of ).

[0097] By linearly projecting δ(s) and activating it with a Sigmoid function, a dimension modulation mask is generated:

[0098] ;

[0099] in To compare the modulation projection matrix (learnable parameters). The bias vector (learnable parameters). For dimension modulation mask, dimension =16.

[0100] In the query segment When assigning attention scores to keys corresponding to the category prototype vector p(s), the key vector is modulated using m(s):

[0101] ;

[0102] in For the h-th query segment ( =16-dimensional), κ(s) is the key vector of the class prototype vector p(s). =16-dimensional), ⊙ represents element-wise multiplication, κ(s)⊙m(s) denotes each component of the key vector being multiplied by the corresponding component of the modulation mask. ᵀ·(κ(s)⊙m(s)) is two The inner product of dimensional vectors (resulting in a scalar). =4 is a scaling factor to prevent the numerical range of attention scores from increasing with the dimension.

[0103] The technical effect of inter-prototype contrast modulation: The d-th component of m(s) reflects the category prototype vector p(s) competing with its nearest neighbor category prototype vector p(s). The degree of difference along the d-th projection dimension, where d is the feature dimension index, d=1,..., For dimensions with large differences, m(s)[d] approaches 1 after Sigmoid transformation, and the attention score remains intact. These are the dimensions that distinguish s from... The key discriminative dimensions. Dimensions with small differences have m(s)[d] close to 0, and their attention scores are suppressed; these dimensions lack discriminative power. This mechanism enables the first network to achieve targeted enhancement of decomposition accuracy among the most easily confused pathogenic bacteria categories, and the specific value of the modulation mask m(s) is automatically learned by the network during training through backpropagation, without relying on prior human knowledge.

[0104] S3.5 Attention Calculation and Component Matrix Generation

[0105] For each query segment (h=1,..., ), calculate its effect on all Attention weights for each category's prototype vector:

[0106] ;

[0107] The denominator is for all The Softmax normalization term is the summation of all categories, where j is the category index. α(h,s) represents the attention weight of the h-th query segment on the category prototype vector p(s).

[0108] For each head h and each class s, the value vector is scaled using attention weights to obtain the component contribution of class s on the h-th head:

[0109] ;

[0110] in (s) is the value vector of the class prototype vector p(s) at the h-th head. =16-dimensional), α(h,s) is the scalar attention weight. for =16-dimensional vector.

[0111] Will The components contributing to the size are pieced together segment by segment to restore the original structure. Dimensional vector:

[0112] ;

[0113] in [s,:] is = · =8×16=128 dimensional vector, representing the contribution of pathogenic bacteria category s to each dimension of the fused feature vector F. All The component vectors of each category are stacked row by row to form a component matrix. , shape is × .

[0114] S3.6 Master Hypothesis Generation and Competitive Feature Extraction

[0115] Calculate the energy (L2 norm) of each component category: ,in [s,:] is the component matrix The sth line ( =128-dimensional vector), e(s) is a scalar that reflects the contribution strength of the component of category s in the fused feature vector F.

[0116] The primary hypothesis is that the class with the highest energy is: In the extreme case where multiple categories have completely equal energy, the category with the smaller number is selected.

[0117] The competition category is the second largest in terms of energy. Similarly, when there are ties, the category number with the smaller number is taken.

[0118] Calculate the decomposition residuals: , where F and [ ,:] are all =128-dimensional vector. for =128-dimensional vector, representing the fused feature vector F that cannot be classified by the master hypothesis category. The residual signal is explained by the components.

[0119] Computation of dimensional competition features. To ensure gradient continuity, a smooth approximation is used instead of element-wise absolute values:

[0120]

[0121] in [ ,d] and [ [d] represents the component matrix. middle lines and The d-th component of the row, =1×10⁻ 8 It is a numerical stability constant, ensuring that it remains positive within the square root. for =128-dimensional vector, where the d-th component reflects the category of the principal hypothesis. Competing categories Decompose the magnitude of the difference along the d-th dimension of F.

[0122] Step S3, first round output: Main hypothesis category Decomposition of residuals ( =128 dimensions); dimensional competition features ( =128 dimensions); component matrix ( × (The hypothesis correction path used in step S5 is retained).

[0123] In backpropagation, the selection of s1 is approximated using a straight-through estimator: the forward propagation uses the hard selection result of argmax, while the backpropagation treats the gradient as a path along... The soft-select path backhaul, where The temperature parameter is used for linear annealing from 1.0 to 0.1.

[0124] S4. Construct a second network based on a multilayer perceptron. This network generates residual significance scores, dimension-guided vectors, and class affinity vectors by competing features between the input decomposition residuals and dimensions, and then makes a determination.

[0125] When the residual significance score is lower than the first preset threshold, the main hypothesis category is output.

[0126] When the residual significance score is not lower than the first preset threshold, the decomposed residual is used as the fusion feature vector, and the dimension-guided vector and category affinity vector are injected as feedback signals into the attention score calculation process to generate the second component matrix and determine:

[0127] (1) When there are category rows in the second component matrix whose energy exceeds the second preset threshold, output the result of dual-category mixed infection;

[0128] (2) When the energy of the categoryless row in the second component matrix exceeds the second preset threshold, the corrected category is output based on the comprehensive evidence comparison of the two largest energy rows in the component matrix;

[0129] like Figure 4 As shown, this step performs joint analysis on the decomposition residuals and competitive features to output a structured feedback signal. When the residuals are significant, the feedback is injected into S3 to perform a conditional second round of refined decomposition, forming a deep learning closed loop of decomposition-evaluation-guidance-re-decomposition.

[0130] S4.1 Joint Encoding of Residuals and Competitive Features

[0131] ,in and All =128-dimensional vectors, after concatenation Dimension 2 =256.

[0132] The shared backbone of the second network is a two-layer fully connected network:

[0133] , ;

[0134] in To share the first layer weight matrix of the backbone (learnable parameters). This is the first layer bias vector (learnable parameters). The intermediate hidden representation of the first layer output (dimension 128); This is the second-layer weight matrix (learnable parameters). This is the second-layer bias vector (learnable parameters). It is a residual abstract representation vector (64 dimensions). After sharing the main trunk, it is divided into three independent output headers.

[0135] S4.2 Significance of Residuals

[0136] After a fully connected layer and sigmoid activation, the residual significance score is output: ,in For weight vectors (learnable parameters). For scalar bias (learnable parameter). ᵀ· It is the inner product of two 64-dimensional vectors (the result is a scalar). It is a scalar with a value range of (0,1). A value close to 0 indicates that there is no structured signal in the residual. A value close to 1 indicates that the residual contains unused data. The significant signal components explained.

[0137] S4.3 Dimensional Guide Header

[0138] After a fully connected layer and sigmoid activation, the output dimension-guided vector is: ,in This is the weight matrix (learnable parameters). This is the bias vector (learnable parameter). for =128-dimensional vector, with each component taking values ​​in the range (0,1). [d] A value close to 1 indicates a residual. The d-th dimension contains meaningful signals, and a value close to 0 indicates that the dimension is mainly noise.

[0139] S4.4 Category Affinity Head

[0140] Through a fully connected layer and through Activation, output class affinity vector: ,in, This is the weight matrix (learnable parameters). This is the bias vector (learnable parameters). for A dimensional vector, the sum of its components is 1 and the values ​​of each component are in the range (0,1). [s] represents the prior probability estimate of the second network for classifying the residual signal into pathogenic bacteria category s.

[0141] The three output heads have completely different signal properties and are complementary to each other. Answer the question "Do we need to continue decomposing?" (Scalar-triggered judgment). Answer the question "In which dimensions are the residual signals concentrated?" (Dimensional guidance). Answer the question "Which categories are the residual signals most likely to belong to?" (Category-level guidance). Together, these three elements constitute the structured feedback from the second network to the first network.

[0142] S4.5 Cascade Trigger Detection

[0143] The first preset threshold is set to [0.3, 0.7], and its final value is determined based on the performance of the validation set after training is completed.

[0144] when ≥ When the residual is determined to be significant, the second round of refinement decomposition of the second network is triggered (step S4.6). < If the residual is deemed insignificant, proceed directly to step S5 to output a single-class result. During the training phase, to ensure that all learnable parameters of the second-round decomposition path can receive gradient updates, regardless of ρ1 and... Regardless of the size relationship, the second round of refinement decomposition in step S4.6 is forcibly executed. The above conditional triggering logic only takes effect during the inference phase.

[0145] S4.6 Feedback Injection and Second Round Refinement Decomposition of the Second Network

[0146] like Figure 4 As shown, when the cascade triggering condition is met, the feedback signal output by the second network is injected into the first network, affecting the residual. A second round of conditional decomposition is performed. The network structure of the second round of the second network shares all learnable parameters with the first round. , , , However, the input and attention calculation process accepts the following two modulations.

[0147] Second round of input: Residual Evenly divided into =8 segments, resulting in the second round of query segments:

[0148] , ,in, For layer normalization operation;

[0149] Dimensional modulation: guiding the dimension vector Classified by corresponding method In calculating the attention score in the second round, the segment... The corresponding segment is used as an additive bias:

[0150] ;

[0151] in Learnable scaling factor (scalar, initialized) =0.1). The segments with higher values ​​receive additional attention score gains, guiding the first network to focus on the dimensional region in the residual that the second network perceives as having concentrated signals. Indicates will The h-th subvector after being segmented in the same way as the query segment.

[0152] Category-level modulation: Before performing Softmax normalization on the second-round attention scores, the category affinity vector is... The logarithm is used as an additive bias:

[0153] ;

[0154] in Learnable scaling factor (scalar, initialized) =1.0), =1×10⁻ 8 To prevent logarithmic underflow, the numerical stability constant is j, which is the category index. Higher categories receive additional attention weight gain. Categories that are close to zero are effectively masked.

[0155] The process of generating the second component matrix is ​​completely consistent with step S3.5 (using...). Alternative Replace α(h,s) with α2(h,s), and output the second component matrix. , shape is × .

[0156] Step S4 Output: Residual Significance Score (Scalar); Second component matrix ( × The reasoning phase only occurs when cascading is triggered.

[0157] S4.7 Final Decision Output

[0158] like Figure 4 As shown, based on the residual significance score The value of is determined by the results of the two rounds of decomposition, and one of the three judgment results is output adaptively.

[0159] Define the energy threshold , The second preset threshold (strictly positive) is determined based on the performance of the validation set after training.

[0160] Determination Path 1: Single Consistent Pathogen Category

[0161] Triggering conditions: < , The first preset threshold;

[0162] Exported pathogenic bacteria categories are The confidence level is denoted as The calculation formula is:

[0163] ;

[0164] Determination Path Two: Mixed Infection with Two Categories

[0165] Triggering conditions: ≥ ,and There is energy exceeding the second preset threshold. The category row.

[0166] ,in Represents the second component matrix The energy (L2 norm) of each category of components reflects the contribution intensity of category s in the second round of decomposition;

[0167] Constraints: ( )≥ The exported pathogenic bacteria category is ( , The mixed infection results This is the second preset threshold.

[0168] Decision Path 3: Assumption Correction

[0169] Triggering conditions: ≥ ,but No category row has reached the required energy level. ;

[0170] Returning to the component matrix of the first round in step S3 Take the category corresponding to the second largest row of energy. = Calculate separately and Comprehensive evidence:

[0171] ;

[0172] in, for The component energy of category s, Let be the component energy of category s in the second component matrix. and These are learnable positive coefficients (positive through Softplus constraints).

[0173] Final Revision Category:

[0174] ;

[0175] When there are ties, the category number with the smaller number is selected.

[0176] S5. Use multi-task collaborative loss to train the dual-channel deep encoder, the first network, and the second network end-to-end until convergence.

[0177] S5.1 Acquisition and Augmentation of Multi-Label Training Samples

[0178] In the training data of this invention, single-labeled samples (single pathogenic infection) come from confirmed infection cases diagnosed by microbial culture. Each sample is labeled with a pathogenic bacteria category label y, y∈{1,..., Multi-labeled samples (mixed infection, labeled with two pathogenic bacterial categories). and , ≠ (This refers to mixed infection cases diagnosed through microbial culture.)

[0179] When the number of multi-label samples is limited, the following data augmentation strategy is adopted: randomly select two single-label samples of different classes from the training set, and obtain their fused feature vectors respectively. and Pseudo-multi-label training samples are synthesized by linear interpolation using a random mixing ratio μ.

[0180] ;

[0181] μ is randomly sampled from a uniform distribution in the interval (0.3, 0.7). The tag is ( , The synthesis operation is performed at a preset ratio in each training batch. =0.3 execution, the loss gradient of the mixed samples of the feature layer is used to update the parameters of the first network and the second network; the mixed infection encoding capability of the encoder parameters is trained by the end-to-end gradient of real multi-label samples, and the two paths are complementary.

[0182] S5.2 Loss Function Definition

[0183] This invention defines three main loss functions and one auxiliary regularization term.

[0184] First loss For classification loss. For a single-label sample (true label y):

[0185] ,in, This represents the energy corresponding to the true category. Represents the energy of the j-th category;

[0186] For multi-label samples (real labels) and ): , where σ(·) is the Sigmoid function and j is the category index.

[0187] Second loss For reconstruction loss. For single-label samples: , indicating the fusion of eigenvector F and component matrix The square of the L2 reconstruction error is the sum of all class rows. For multi-label samples, the reconstruction constraint is stronger: the sum of the components from only the two true classes should be approximately equal to F.

[0188] Define a global sparse regularization term: This encourages the component rows of non-real categories to approach the zero vector, thus promoting sparsity in the decomposition.

[0189] Third loss The cascaded decision loss consists of three sub-items: residual significance regression loss, class affinity guidance loss, and second-round classification loss.

[0190] , , ,in For the significance score of the residuals, It is the distribution of target categories. This is the second round of classification loss. The calculation method and Similarly, the energy of each category in the second component matrix C2. Alternative , It is the monitoring target value for the significance of residuals.

[0191] in Construct according to the following rules: For a single-label sample, if s1=y, then... =0, if s1≠y then =1; for multi-label samples, =1; for the assumed perturbation sample =1. Construct according to the following rules: For a single-label sample, if s1=y, then... For uniform distribution (each component is 1 / If s1≠y, then =0.9, and the remaining categories combined scored 0.1; for multi-label samples, = =0.45, and the remaining categories combined account for 0.1.

[0192] S5.3 Assuming a perturbation strategy

[0193] In training, using probability =0.2 Randomly classify the main hypothesis Replace with a set of non-real categories {s|s≠y,s∈{1,..., The error category of uniform random sampling in}} is denoted as Then with [ ,:] Calculate the disturbance residual: The data is then fed into a second network. It is assumed that the perturbation policy does not change the labels or content of the training data, but only injects incorrect intermediate hypotheses during forward propagation. Perturbation probability... In the latter half of training, the value linearly decreases from 0.2 to 0.05 but does not drop to zero. For the assumed perturbed samples, the cascaded decision loss only calculates... Sub-item ( =1), not calculated and The sub-items avoid interference from anomalous distributions of perturbation residuals in the training of category affinity guidance and the second round of classification. The learning of category-level guidance signals is entirely driven by real single-label and multi-label samples.

[0194] S5.4 Homoscedasticity Uncertainty Dynamic Task Weights

[0195] Learnable log-variance parameters are introduced for the three main losses. , , (All values ​​are initialized to 0), the total loss function is:

[0196] ;

[0197] Where L1= L2= L3= There are three main losses. exp( Let ) be the variance estimate of the observation noise for the i-th task, and exp(·) be the natural exponential function, 1 / (2·exp( )) represents the effective loss weight, where Sparse regularization term for all multi-labeled samples within a batch the sum of The hyperparameter is fixed and set to 0.01. For single-label samples, The corresponding item is not included. / 2 is the regularization term. Three parameters { It is updated jointly with all other learnable parameters of the network in the same optimization step.

[0198] S5.5 Optimized Settings

[0199] Using the Adam optimizer, initial learning rate =1×10⁻ 4 β1=0.9, β2=0.999. Cosine annealing learning rate scheduling is used, with the minimum learning rate... =1×10⁻ 6 Training batch size B=16. Class prototype matrix P, inter-prototype contrast projection matrix. and bias Key projection matrix Projection matrices for each head value { }, Second network weights and biases, Learnable scaling factor ( , , , Homoscedasticity uncertainty parameter ( , , )as well as All parameters All learn automatically through backpropagation during end-to-end training. Linear annealing from 1.0 to 0.1 according to the training progress, without participating in backpropagation, cascading trigger thresholds. With energy threshold It does not participate in backpropagation; its value is determined based on the validation set performance after training. Training termination condition: Reaching the preset maximum number of training rounds. Or the overall recognition accuracy of the validation set shows no improvement for 10 consecutive rounds.

[0200] S5.6 Gradient Collaboration Mechanism

[0201] In this invention, the gradients of the three main losses form multiple cooperative paths in the network. gradient from The energy ranking flows through the attention weights of the first network to the projection matrix in the inter-prototype contrast modulation. The class prototype matrix P is used to return to the fusion feature vector F through the query segment, thereby affecting the encoding network. The gradient flows from the reconstruction error to Each row in turn affects the projection parameters of each header value. } and the category prototype matrix P. The gradient flows from the second-round classification accuracy to First network, second round attention weights, dimensional guiding vector Category Affinity Vector Second network shared backbone, residual =F− [ This simultaneously affects the encoding quality of F and the decomposition quality of the first round of the first network. The aforementioned multiple gradient paths intersect at the class prototype matrix P in the second network, forming a multi-objective collaborative optimization.

[0202] In the diagnosis and treatment of critically ill patients with infections, rapid and accurate identification of pathogens (including single infections, mixed infections, and low-viral-load pathogens) is crucial for guiding antibiotic use. Current microbial culture methods take 24-72 hours, while patient chest CT images and laboratory data such as complete blood counts and inflammatory markers are available immediately. There is an urgent clinical need for an artificial intelligence system that can integrate imaging morphological features with dynamic host immune response data to provide reliable pathogen inference before culture results are available. In particular, it is essential to address the challenges of identifying mixed infections caused by common pathogens such as Staphylococcus aureus, Klebsiella pneumoniae, and Pseudomonas aeruginosa, as well as detecting weaker pathogens.

[0203] Adaptation and implementation of this application:

[0204] 1. Data Input:

[0205] 1.1 Spatial Structured Data: High-resolution CT scan images of the patient's chest, used as three-dimensional volume data input (I).

[0206] 1.2 Multi-source heterogeneous time series data: Laboratory test indicators for several consecutive days after the patient's admission, including white blood cell count, C-reactive protein, procalcitonin, neutrophil percentage, etc., were time-aligned and interpolated to form a standardized time series matrix (X).

[0207] 2. Network processing flow:

[0208] 2.1 Dual-channel depth encoder:

[0209] 2.1.1 Image Encoding Path: A pre-trained 3D ResNet is used to extract spatial semantic features such as texture, solid range, and ground-glass opacity of lung infection lesions from CT images, forming image feature vectors. Simultaneously, a 3D U-Net branch is used to calculate the confidence score for lung region segmentation.

[0210] 2.1.2 Temporal Coding Path: The GRU network is used to encode the temporal test indicators, capture the dynamic evolution pattern of the inflammatory indicators, and form a temporal coding vector.

[0211] Image features are gated using image segmentation confidence and fused with temporal coding vectors to obtain a fused feature vector (F) that comprehensively reflects "lesion morphology + host response".

[0212] 2.2 First Network:

[0213] Define a category prototype matrix (P), where each row represents the prototype vector of a pathogen category (such as Staphylococcus aureus, Pneumocystis pneumoniae, etc.) in the feature space.

[0214] The fused feature vector F is compared with the prototypes of each category to perform attention calculation. Key mechanism: When calculating the Staphylococcus aureus prototype attention, the network automatically focuses on key feature dimensions (e.g., certain image texture or inflammatory marker combination patterns) that can distinguish Staphylococcus aureus from its most similar competing species (such as Staphylococcus epidermidis).

[0215] Output a component matrix, where each row represents the "contribution" of the corresponding pathogen to the current infection feature. The initial decomposition produces the master hypothesis pathogen, the decomposition residual (R), and dimensional competition features.

[0216] 2.3 Second Network:

[0217] Analyze the significance of the residuals R. If the residuals are not significant, directly output the master hypothesis that the pathogen is a single infection.

[0218] If the residuals are significant (indicating that the initial decomposition failed to fully explain all features), two types of feedback signals will be generated: (1) Dimension-guided vector: indicating which feature dimensions the residual signals are mainly concentrated in (e.g., possibly a certain CT texture feature); (2) Category affinity vector: indicating which other pathogens the residual signals are most likely to belong to.

[0219] The feedback signal is injected into the network to perform a second round of conditional refinement decomposition on the residual R, generating the second component matrix.

[0220] 3. Output and Decision:

[0221] Path A (Single Infection): If the residual is not significant, directly output the master hypothesis pathogen (e.g., "Klebsiella pneumoniae").

[0222] Path B (Mixed Infection): If the residual is significant and the component energy of a non-master hypothesis category (such as "Pseudomonas aeruginosa") in the second round of decomposition exceeds the threshold, then the mixed infection result (such as "Klebsiella pneumoniae + Pseudomonas aeruginosa mixed infection") is output.

[0223] Path C (Hypothesis Correction / Weak Signal Detection): If the residual is significant but no strong energy category appears in the second round, the results may be corrected or the presence of a low-load pathogen signal may be indicated by combining the comprehensive evidence of the primary and secondary hypotheses in the first round of decomposition.

[0224] Addressing the issue of missed detection of mixed infections: Through additive decomposition and cascaded determination, the model can identify multiple pathogen signals present in images and indicators, avoiding the problem of traditional single-label classification models forcibly classifying mixed infections as a single dominant species.

[0225] Improving the detection rate of weak signals: For pathogens with low viral load and atypical imaging manifestations, their weak signals are retained in the residuals after the initial decomposition. Guided by the feedback of the second network, they are targeted to be enhanced and identified in the second round of decomposition.

[0226] Differentiating similar pathogens: Through a prototype-to-prototype contrast modulation mechanism, the model can automatically learn and focus on differentiating pathogen pairs with similar imaging features or inflammatory response patterns (such as different types of Candida), thus improving fine-grained identification capabilities.

[0227] In one embodiment, a computer device is provided, including a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the steps in the above embodiments.

[0228] In one embodiment, a computer-readable storage medium is provided storing a computer program that, when executed by a processor, implements the steps described above.

[0229] The technical scope of this invention is not limited to the content described above. Those skilled in the art can make various modifications and variations to the above embodiments without departing from the technical concept of this invention, and all such modifications and variations should fall within the protection scope of this invention.

Claims

1. A method for constructing a pathogenic bacteria dynamic recognition network based on time-series images and clinical data, characterized in that, Includes the following steps: Acquire clinical data and 3D images, preprocess the clinical data, and obtain a standardized time series matrix; A dual-channel depth encoder is constructed, which takes a 3D image and a normalized temporal matrix as input and outputs a fused feature vector. The encoder includes a pre-trained 3D convolutional neural network, a single-layer unidirectional gated recurrent unit, and confidence gating. The first network is constructed, which is a multi-head cross-attention network that introduces a predefined category prototype matrix. Based on the fused feature vector, the contrast difference vector between each category prototype vector in the category prototype matrix and its nearest competing prototype is calculated, and then the attention mechanism is modulated to calculate the attention score. Based on this, a component matrix is ​​generated, the master hypothesis category is extracted from the component matrix, and the decomposition residual and dimensional competition features are calculated. A second network is constructed, based on a multilayer perceptron. This network generates residual significance scores, dimension-guided vectors, and class affinity vectors by comparing the input decomposition residuals with dimensionality-competitive features, and then makes a determination. When the residual significance score is lower than the first preset threshold, the main hypothesis category is output. When the residual significance score is not lower than the first preset threshold, the decomposed residual is used as the fusion feature vector, and the dimension-guided vector and category affinity vector are injected as feedback signals into the attention score calculation process to generate the second component matrix and determine: (1) When there are category rows in the second component matrix whose energy exceeds the second preset threshold, output the result of dual-category mixed infection; (2) When the energy of the categoryless row in the second component matrix exceeds the second preset threshold, the corrected category is output based on the comprehensive evidence comparison of the two largest energy rows in the component matrix; The dual-channel deep encoder, the first network, and the second network are trained end-to-end using a multi-task collaborative loss until convergence.

2. The method according to claim 1, characterized in that, The steps for modulating the attention mechanism specifically include: After dividing the fused feature vector into multiple query segments, key vectors and value vectors are generated by projecting each category prototype vector in the category prototype matrix. For each category prototype vector, the nearest neighbor competing prototype with the highest cosine similarity in the category prototype matrix is ​​determined. The difference between the category prototype vector and the nearest neighbor competing prototype is calculated to generate a contrast difference vector. This vector is then projected through a linear projection matrix and activated by a Sigmoid function to generate a dimension modulation mask. The key vector is then modulated element-wise using the dimension modulation mask, and the attention score is calculated. The formula for calculating the attention score is as follows: ,in, Let be the h-th query segment, κ(s) be the key vector of the category prototype vector p(s), ⊙ be the element-wise multiplication, and m(s) be the dimension modulation mask. The scaling factor is T, and T is the transpose operator.

3. The method according to claim 2, characterized in that, The specific steps for extracting the master hypothesis categories from the component matrix and calculating the decomposition residuals and dimensional competing features include: The energy of each category in the component matrix is ​​calculated using the L2 norm. The category corresponding to the largest energy is selected as the master hypothesis category. The difference between the fused feature vector and the row vector of the master hypothesis category is calculated as the decomposition residual. The dimension competition feature is obtained by calculating the smooth approximation of the dimensional difference between the row with the largest energy and the row with the second largest energy in the component matrix.

4. The method according to claim 1, characterized in that, The specific steps for outputting the fused feature vector include: The three-dimensional convolutional neural network consists of a three-dimensional fully convolutional segmentation network and a three-dimensional residual convolutional neural network. The pre-trained three-dimensional fully convolutional segmentation network extracts target subject masks and key sub-region masks from the three-dimensional image, and calculates a segmentation confidence scalar based on segmentation quality. The pre-trained three-dimensional residual convolutional neural network extracts feature maps from the three-dimensional image and fuses them to reduce the dimensionality into image feature vectors. A single-layer unidirectional gated recurrent unit extracts a temporal coding vector from the standardized temporal matrix. The image feature vector is then gated and scaled using the confidence scalar and fused with the temporal coding vector. Finally, it is compressed into a fused feature vector through a multi-layer fully connected network.

5. The method according to claim 1, characterized in that, The dimensionality-guided vector and category affinity vector are injected as feedback signals into the attention score calculation process. The specific steps are as follows: The dimensional guiding vector is segmented and then modulated with an additive bias to adjust the attention score, calculated using the following formula: ,in, For learnable scaling factor, For the h-th query segment in the second round, For two The inner product of dimensional vectors (resulting in a scalar). Indicates will The h-th subvector after being segmented in the same way as the query segment; The modulated attention score is modulated using the category affinity vector in the form of a logarithmic additive bias.

6. The method according to claim 1, characterized in that, The steps of the comprehensive evidence comparison specifically include: The category corresponding to the second largest energy row in the component matrix is ​​taken as the candidate category. The combined evidence of the main hypothesis category and the candidate category is calculated respectively. The category with the largest combined evidence is output as the corrected category. The formula for calculating comprehensive evidence is as follows: ; in, express The component energy of category s, Representation of component matrix Category The row vector, Let F represent the component energy of category s in the second component matrix, and let F denote the fused feature vector. and These are learnable positive coefficients (positive through Softplus constraints).

7. The method according to claim 1, characterized in that, The multi-task collaborative loss includes three main losses: classification loss, reconstruction loss, and cascaded decision loss. By introducing a learnable log-variance parameter for each main loss, the effective weights of the three main losses are dynamically and adaptively adjusted using a homoscedasticity uncertainty mechanism.

8. The method according to claim 7, characterized in that, The method employs a hypothesis perturbation strategy during training, specifically including: randomly replacing the master hypothesis category with an error category with a preset probability during forward propagation, and then calculating the preset probability, which decreases linearly with each training round but does not drop to zero, so that the training continuously accepts the master hypothesis error scenario throughout the entire training process.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1 to 8.

10. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method as claimed in any one of claims 1 to 8.