Method and device for establishing a multi-modal information diagnosis and treatment system for digestive system diseases

The multimodal information diagnosis and treatment system has solved the problem of multimodal data fusion in digestive system diseases, realized integrated management and individualized treatment throughout the entire disease course, and improved the efficiency of diagnosis and treatment and data sharing capabilities for digestive system diseases.

CN120998466BActive Publication Date: 2026-06-23XIEHE HOSPITAL ATTACHED TO TONGJI MEDICAL COLLEGE HUAZHONG SCI & TECH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIEHE HOSPITAL ATTACHED TO TONGJI MEDICAL COLLEGE HUAZHONG SCI & TECH UNIV
Filing Date
2025-08-11
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies for the diagnosis and treatment of digestive system diseases suffer from strong heterogeneity of multimodal data, lack of efficient fusion and integration strategies, and incomplete or missing modalities, leading to incomplete analysis and making it difficult to achieve dynamic and global control over the progression of digestive system diseases.

Method used

The diagnostic and treatment system that adopts multimodal information collects multimodal information, extracts feature vectors and embeds labels, splices and maps them to a unified dimension, uses multi-head attention mechanism and cross-attention mechanism to fuse data, and combines medical knowledge graph and health intelligence platform to provide clinicians with personalized decision-making basis.

Benefits of technology

It enables integrated management of the entire course of digestive system diseases, improves the effectiveness of clinical intelligent auxiliary diagnosis and individualized treatment, supports large-scale data sharing and intelligent application throughout the process, and promotes the widespread application of smart health platforms.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120998466B_ABST
    Figure CN120998466B_ABST
Patent Text Reader

Abstract

The application provides a method for establishing a digestive system disease multi-modal information diagnosis and treatment system, S1, collecting multi-modal information for labeling and preprocessing; S2, extracting a feature vector and embedding a label to the multi-modal information according to the labeled information; S3, splicing and mapping the feature vector and the label to a unified dimension to obtain an enhanced feature vector; S4, fusing the enhanced feature vector to form a multi-modal feature matrix, performing linear mapping on the multi-modal feature matrix, obtaining a global fusion vector after weighted aggregation, and generating a fusion vector sequence; S5, enhancing the time sequence information of the global fusion vector sequence and the spatial information of the spatial correlation of the part-specific feature; interactive fusion to obtain a space-time joint feature; S6, classifying and predicting the disease stage or specific pathological type, and outputting a diagnosis result; S7, realizing semantic association between the diagnosis result and a medical knowledge graph, sharing data to an online health wisdom platform, and providing personalized decision-making basis for clinical doctors.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of medical diagnostic systems, and in particular to a method and apparatus for establishing a diagnostic and treatment system for multimodal information of digestive system diseases. Background Technology

[0002] Digestive system tumors have high incidence and mortality rates. Due to their complex pathogenesis and high heterogeneity across different stages and locations, effective management currently relies on precision medicine. In recent years, with the development of artificial intelligence technology, deep learning and multimodal modeling methods have been gradually introduced into the field of medical diagnosis and treatment. These methods enable the systematic integration and intelligent analysis of various types of medical data, including clinical information, medical images, pathological slides, and multi-omics data, demonstrating their potential for automated analysis and decision support.

[0003] However, the application of multimodal data fusion in the diagnosis and treatment of digestive system diseases still faces many challenges: 1. Multimodal data is highly heterogeneous, lacking efficient fusion and integration strategies; 2. Multimodal data is often missing or incomplete in actual collection, and existing technologies lack effective reconstruction and compensation mechanisms for modal missing data, making it difficult to ensure the integrity and robustness of clinical multimodal data analysis; 3. Existing methods often ignore the "spatiotemporal specificity" of digestive system diseases at different developmental stages and anatomical locations, making it difficult to achieve dynamic and global control over the progression of digestive system diseases.

[0004] Therefore, there is an urgent need for a method and device for establishing a multimodal information diagnostic and treatment system for digestive system diseases to overcome the above-mentioned deficiencies. Summary of the Invention

[0005] This invention proposes a method and apparatus for establishing a multimodal information diagnosis and treatment system for digestive system diseases. It can accurately acquire different stages (healthy, inflammatory, precancerous lesions, early cancer, and advanced cancer) and precise locations of digestive system diseases on a health intelligence platform of a medical diagnostic system, giving full play to the advantages of the health intelligence platform in early screening, accurate diagnosis and personalized treatment.

[0006] To solve the above-mentioned technical problems, the technical solution of the present invention is as follows: a method for establishing a multimodal information diagnosis and treatment system for digestive system diseases, comprising the following steps:

[0007] S1. Collect basic clinical information, endoscopic images, radiological images, pathological slide images, and multi-omics information from multiple modalities, and then label and preprocess them.

[0008] S2. Extract the feature vectors of the preprocessed multimodal information and embed labels into the multimodal information according to the labeled information;

[0009] S3. Concatenate the feature vector of the extracted multimodal information with the embedded labels of the multimodal information, and map the concatenated high-dimensional vector to a unified dimension to obtain an enhanced feature vector containing modality-specific information, anatomical location information and disease stage information.

[0010] S4. The enhanced feature vectors of all the spliced ​​multimodal information are fused to form a multimodal feature matrix. The multimodal feature matrix is ​​linearly mapped by the projection matrix to obtain the query, key, and value of the multi-head attention mechanism. The mask matrix is ​​constructed to calculate the weight of the multi-head attention mechanism. After weighted aggregation, a global fusion vector is obtained. Global fusion vectors belonging to the same anatomical location and disease stage are grouped to generate a fusion vector sequence.

[0011] S5. Based on the global fusion vector sequence, inject information about the disease stage to enhance the temporal information of the global fusion vector sequence. Through spatial coding, utilize the anatomical topological relationship and pathological co-occurrence relationship of the digestive tract to strengthen the spatial information of the spatial correlation of the site-specific features. Through the cross-attention mechanism, the temporal information and spatial information of the disease stage are deeply interacted and fused to obtain spatiotemporal joint features.

[0012] S6. Based on the spatiotemporal joint fusion features, the disease stage or specific pathological type of the patient is classified and predicted through a multi-task learning framework, and the diagnostic results are output by co-optimizing multiple losses.

[0013] S7. Construct a medical knowledge graph from medical knowledge, drug information, and multimodal information, semantically link diagnostic results with the medical knowledge graph, and share data to an online health intelligence platform for medical big data. This will facilitate personalized decision-making for clinicians based on individual patient characteristics and disease progression, and assist in clinical decision-making.

[0014] Preferably, the basic clinical information mentioned in step S1 is the basic clinical information of the patient's digestive system disease stage, wherein the disease stage includes healthy, inflammatory, precancerous lesions, early cancer and advanced cancer, and the multi-omics information includes metagenomics, metabolomics, transcriptomics, proteomics and peripheral blood whole genome information.

[0015] Preferably, the specific processing procedure for splicing in step S3 is as follows: The feature vectors of the multimodal information extracted in step S2 are spliced ​​with the labels embedded in the multimodal information, and then processed through a linear transformation function. The concatenated high-dimensional vectors are mapped to a unified dimension to obtain an enhanced feature vector containing modality-specific information, anatomical location information, and disease stage information. ,

[0016] (10)

[0017] Among them, in the above formula (10) For the first The original feature vectors extracted from each modality. , Embed vectors for anatomical sites Embedded vectors for disease stages.

[0018] Preferably, step S4 further includes step S41: enhancing the feature vector of all multimodal information concatenated in step S3. Fusion to form a matrix :

[0019] (11),

[0020] Among them, in the above formula (11) The multimodal feature matrix is ​​the result of splicing and fusion, with dimensions of . M is the number of modes, and d is the feature dimension;

[0021] Through projection matrix Directly splicing the matrix Perform a linear mapping to obtain the parameters required for the multi-head attention mechanism. , , :

[0022]

[0023] Among them, in the above formula (12) These are respectively composed of multimodal matrices The obtained Q, K, and V; These are trainable mapping matrices, used to generate queries. ,key ,value ;

[0024] Constructing the mask matrix Masking invalid information in a multi-head attention mechanism; attention weights calculated after masking. Recorded as:

[0025] (13)

[0026] Among them, in the above formula (13) For the mask matrix, the rows and columns corresponding to missing modal information are assigned minimum values, otherwise they are 0; Attention head index for multi-head attention mechanism ( =1,…,h, where h is the number of heads (h=8 in this paper). For attention head The trainable projection matrix; For the first The query, key, and value submatrix of each size; : Calculate the similarity between modal information (the larger the value, the stronger the correlation between the two modal information); The scaling factor is used to prevent the inner product value from becoming too large, which would cause the softmax gradient to vanish; softmax is normalized along the rows to make the sum of the weights of the attention mechanism equal to 1.

[0027] For the first The global fusion vector is obtained by weighted aggregation of head attention:

[0028] (14)

[0029] Among them, in the above formula (14) It is a sample The global fusion vector is obtained by embedding anatomical locations and disease stages, along with the fusion of features from various modalities. No. The output vector of each head; Indicates all The outputs are concatenated along the feature dimension. Represents the learnable weight matrix;

[0030] Global fusion vectors belonging to the same anatomical location and disease stage combination The data is aggregated to generate a fusion vector sequence of corresponding anatomical locations and disease stages. The calculation formula is as follows:

[0031] (15)

[0032] Among them, in the above formula (15) For global fusion vector, Indicates the sample index. , Label its anatomical location and disease stage. For all anatomical sites s Disease stage t The fusion sample vector set.

[0033] Preferably, step S4 further includes step S42: introducing a regularization loss to minimize the inter-head cosine similarity of multi-head attention, thus obtaining the multi-head attention diversity regularization loss. :

[0034] (16)

[0035] Among them, in the above formula (16) For the first The flattened vector of the weight matrix of each attention head; For the first Head and First The inner product of the head weight vectors is used to measure the similarity between the two vectors; The L2 norm is used to normalize the calculation of cosine similarity.

[0036] The intermediate fusion vector after multi-head attention concatenation and fusion is obtained by using residual connections and LayerNorm. ,

[0037]

[0038] Among them, in the above formula (17) This is the intermediate vector after residual normalization. For queries regarding multi-head attention mechanisms; right Pooling along the modal dimension, used for... Make residuals,

[0039] The final fused vector is then processed by a feedforward network. ,

[0040] (18)

[0041] Among them, in the above formula (18) The final output fusion vector is fed into subsequent missing reconstruction or downstream task modules; Layer Normalization normalizes the input vector to stabilize training.

[0042] For each available mode Through modal decoder network From the final output fusion vector Reconstructed features , dimension d :

[0043] (19)

[0044] The mean squared error loss is used to assess the quality of basic reconstruction, and the mean squared error loss of basic reconstruction is calculated. :

[0045] (20) Summarize all samples and modalities;

[0046] Among them, in the above formula (16) N Represents the total number of samples. i Representing the i One sample, MRepresents the total number of modes. m Representing the m One modality, For the first m The true features of a modality, with dimensions of d ; For modal indication, For indicator functions, Indicates the first Modality is available. If modal information is missing, it is unusable; L2 norm Basic reconstruction error.

[0047] Preferably, the decoder network Variance estimation of additional output reconstructed features Uncertainty-weighted loss of all samples and modalities reconstructed based on quantization. ,

[0048] (twenty one),

[0049] Among them, in the above formula (21) N Represents the total number of samples. i Representing the i One sample; M Represents the total number of modes. m Representing the m One modality, For the first Variance estimation of m-modal reconstruction features of samples;

[0050] Reconstructed feature set for all available samples and the set of true features The maximum mean difference (MMD) is used to measure distribution consistency, and the Gaussian kernel function is used to calculate the distribution alignment loss of all modal information. :

[0051]

[0052] in,

[0053] (twenty three)

[0054] in, For Gaussian kernel function, The width of the Gaussian kernel; For the first m Number of available samples for each modality; Sample Index

[0055] For the missing target mode Using the observed modes True characteristics Through cross-modal mapping function The reconstruction calculation was used to obtain the cross-modal distillation loss. :

[0056] (twenty four),

[0057] in, For the observed modes b To the missing target mode a Mapping function; observed modes Based on the missing mode Relevance selection; It is an L1 norm;

[0058] The above mean square error loss Uncertainty-weighted loss Distribution alignment loss Cross-modal distillation loss The weighted summation yields the overall reconstruction loss. The calculation formula is as follows:

[0059] (25),

[0060] Among them, in the above formula (25) For respectively Mean square error loss Uncertainty-weighted loss Distribution alignment loss Cross-modal distillation loss The weighting coefficients are used to balance the contributions of different losses.

[0061] Preferably, step S5 further includes step S51: based on the global fusion vector sequence described in step S4, injecting temporal information into each stage according to the disease progression stage:

[0062] For any two disease stage labels The difference between stages is defined as:

[0063] (26)

[0064] Using sine-cosine functions to generate timing position coding vectors ,

[0065] (27)

[0066] Among them, in the above formula (26) The disease stage label (an integer from 0 to 4); The difference between the label indices of the two disease stages is used to measure the relative distance between different disease stages; in the above formula (27) The frequency constant (k=1,…,16) controls the period of the sine / cosine function to ensure the coding distinctiveness of the differences between different disease stages; The stage position encoding vector maps discrete stage differences to continuous features using a sine-cosine function, providing sufficient discriminative power for differences between different stages and providing stage temporal information for subsequent Transformer models.

[0067] The input to the Transformer model is a fused vector sequence and positional encoding. By using the Transformer model to capture the dependencies between disease stages, the temporal characteristics of the evolution of different disease stages can be obtained. :

[0068] (28)

[0069] Among them, in the above formula (28) This represents a Transformer model with a 4-layer encoder, 8 attention heads per layer, and 256 hidden layer dimensions.

[0070] By limiting the drastic fluctuations in features between adjacent stages using the L1 norm, time-series features are calculated. Temporal smoothing loss :

[0071] (29);

[0072] Step S52: Construct an association map by utilizing the anatomical topological relationships and pathological co-occurrence relationships of the digestive tract through spatial coding to enhance the spatial correlation of site-specific features;

[0073] Construct the association graph using the following formula (30) G :

[0074] (30)

[0075] Among them, in the above formula (30) For a set of nodes, each node Each corresponds to a label for an anatomical location in the digestive tract. For the edge set, construct spatial relationships between characterizing sites based on anatomical adjacency or pathological co-occurrence relationships;

[0076] Spatial augmentation feature H′ is generated using a graph attention network (GAT):

[0077] (31),

[0078] Among them, in the above formula (31) Let G be the adjacency matrix of the association graph G, based on the association graph G The set of edges;

[0079] Enhance site-specific features through attention mechanisms :

[0080] (32),

[0081] Among them, the above formula (32) is described Embed vectors for anatomical sites to generate global vectors; : is the learnable part-conditional projection matrix; The key-value matrix of the attention mechanism; Output specific features for each body part;

[0082] Step S53: Input timing features and spatial enhancement features The cross-attention mechanism learns the mutual attention weights between temporal and spatial aspects, and then performs interactive fusion to obtain spatiotemporal joint features. :

[0083] (33),

[0084] Among them, in the above formula (33) This represents a 2-layer cross-attention network, with 8 attention heads per layer and 256 hidden layer dimensions; temporal features. As a query, spatial augmentation features As keys and values, it captures the correlation between disease stages and locations;

[0085] The spatiotemporal joint characteristics are obtained through fusion. Then, using contrastive learning loss By maximizing the similarity between positive sample pairs and minimizing the similarity between the anchor point and the negative sample, the specific calculation formula is as follows (30):

[0086] (34),

[0087] Among them, in the above formula (34) These are two spatiotemporal feature representations of "positive sample pairs," typically from different perspectives of the same disease site or the same disease stage. The negative sample set indicates that "the anchor point does not belong to the same disease site or stage in terms of spatiotemporal characteristics." This is a similarity measurement function; This is a temperature coefficient used to adjust the smoothness of the contrastive learning loss.

[0088] Preferably, step S6 further includes step S61: based on spatiotemporal fusion features The system uses a multi-task learning framework to classify and predict the patient's disease stage or specific pathological type, and outputs the predicted probability. :

[0089] (35),

[0090] Based on predicted probability Calculate cross-entropy loss :

[0091] (36)

[0092] Among them, in the above formula (36) For the first The spatiotemporal joint fusion features of each sample; MLP is a multilayer perceptron structure for a multi-task learning framework; The total number of samples; A collection of categories for all parts of the digestive tract and stages of disease; For the first i The true label of each sample belongs to category C. For the first Each sample is predicted to belong to the category. The probability of;

[0093] Step S62: Based on spatiotemporal fusion features Risk scores are obtained using a multi-task learning framework using a multi-layer perceptron (MLP). :

[0094] (37)

[0095] Based on the predicted risk score The Cox partial likelihood loss was calculated using the Cox proportional hazard model. :

[0096] (38),

[0097] Among them, in the above formula (38) For predicting probabilities The first in i The risk score of each sample is calculated by a multi-layer perceptron (MLP) from the multi-task learning framework. Spatiotemporal joint fusion features of individual samples Generate; in the above formula (34) A collection of patient indexes for the events that occurred; Risk set refers to the set of patients A set of patient indices that were still at risk when the event occurred; For Cox partial likelihood loss;

[0098] Step S63: Adjust the cross-entropy loss by weighting. Cox partial likelihood loss Comprehensive reconstruction losses Temporal smoothing loss Comparative learning loss and the loss of multi-head attention diversity regularization The specific calculation formula for collaborative optimization is as follows:

[0099] (39)

[0100] Among them, in the above formula (39) , , , , , : Weighting coefficients for various losses.

[0101] An electronic device includes a memory and a processor, wherein the memory stores a computer program that can run on the processor, and the processor executes the computer program to implement the steps of the method for establishing a diagnostic and treatment system for multimodal information of digestive system diseases.

[0102] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the method for establishing a diagnostic and treatment system for multimodal information on digestive system diseases.

[0103] Compared with the prior art, the beneficial effects of the present invention are: 1) It can realize the support of large-scale data sharing, full-cycle data security, and full-process intelligent application, providing high-quality clinical intelligent assistance for patients with digestive system tumors, thereby realizing integrated management of the entire course of tumor diseases and giving full play to the advantages of the health intelligence platform in early screening, accurate diagnosis and individualized treatment.

[0104] 2) The health intelligence platform can also enhance the awareness of medical staff and patients about the platform through various forms such as academic exchanges, case sharing and patient education, encourage their active participation, form a positive doctor-patient interaction, and accelerate the widespread application and popularization of the health intelligence platform in clinical practice.

[0105] 3) In addition, multi-modal data such as multi-omics, pathological slides, images, and clinical information can be dynamically integrated on the health intelligence platform to form a dynamic and interactive panoramic display at the molecular-cellular-organ-system level. This helps to reveal biological networks, disease evolution mechanisms, and therapeutic targets, providing a powerful data foundation and intelligent analysis tools for personalized diagnosis and treatment and target discovery. Attached Figure Description

[0106] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0107] Figure 1 A flowchart illustrating the method for establishing a multimodal information diagnostic and treatment system for digestive system diseases according to the present invention; Detailed Implementation

[0108] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.

[0109] Please refer to Figure 1 The first aspect of this invention provides a method for establishing a diagnostic and treatment system for multimodal information on digestive system diseases, comprising the following steps:

[0110] S1. Collect basic clinical information, endoscopic images, radiographic images, pathological slide images, and multi-omics information from patients, and then label and preprocess them.

[0111] Furthermore, the basic clinical information mentioned in step S1 includes the patient's basic clinical information regarding the stage of the digestive system disease (healthy, inflammatory, precancerous lesions, early-stage cancer, advanced-stage cancer) (e.g., age, sex, complete blood count, biochemical indicators, family medical history, etc.); endoscopic images include gastroscopy, colonoscopy, and capsule endoscopy images; imaging images include computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET-CT), and pathological slide images include microscopic images of pathological slides derived from the patient's digestive tract tissue; and multi-omics information includes, but is not limited to, metagenomics, metabolomics, transcriptomics, proteomics, and peripheral blood whole genome information.

[0112] The labeling in step S1 includes disease stage labeling and anatomical site labeling. The specific process of disease stage labeling is as follows: based on clinical guidelines and using pathological diagnosis as the gold standard, the disease stages are labeled as healthy (t=0), inflammatory (t=1), precancerous lesions (t=2), early cancer (t=3), and advanced cancer (t=4). The specific process of anatomical site labeling is as follows: labels s∈{1,2,…,L} are assigned to anatomical sites of the digestive tract. The label assignment is based on the international anatomical nomenclature standard Terminology Anatomica (TA), including but not limited to the cervical, thoracic esophagus, abdominal esophagus, cardia, fundus, body, antrum, pylorus, duodenum, jejunum, ileum, cecum, appendix, ascending colon, transverse colon, descending colon, sigmoid colon, and rectum.

[0113] The preprocessing of the basic clinical information, endoscopic images, imaging images, and pathological slide images mentioned in step S1 includes noise reduction, standardization, and normalization. The preprocessing of the multi-omics information mentioned in step S1 includes feature filtering, batch effect correction, standardization, and normalization.

[0114] More specifically, feature filtering: low-expression features are removed by threshold filtering, while features that are differentially expressed in more than 20% of the samples are retained; where the low-expression features refer to those features in a certain omics feature (such as the expression level of RAR protein in proteomics) whose expression level is extremely low or even zero (or close to the background value) in the vast majority of samples. These features are mostly caused by sequencing noise or technical errors, and contribute very little to downstream analysis, but instead increase noise.

[0115] Batch effect correction: The Combat algorithm is applied to eliminate experimental batch effects by estimating the specific offset and scaling factor of the features expressed by batch differences. The correction formula is as follows:

[0116] (1),

[0117] Among them, in the above formula (1) The original representation matrix representing multi-omics information. , R express n × p The matrix, n It refers to the sample size, specifically the number of patient samples in this cohort before batch correction. p It refers to the number of features, that is, the number of dimensions of multimodal information or multi-omics corresponding to each sample; Represents the first batch after correction The first sample 1 eigenvalue, Sample offset, It is a batch effect term. It is a standardization factor;

[0118] Standardization processing: for the corrected matrix Each column of features is standardized using the following formula (2) to obtain the standardized expression value. ,

[0119] (2),

[0120] Among them, in the above formula (2) Indicates the first The mean of the column, Indicates the first Standard deviation of the column.

[0121] Normalization:

[0122] For the standardized expression matrix Min-MaxNormalization is used to scale each feature to the interval [0,1]. The normalization formula is as follows:

[0123] (3),

[0124] Among them, in the above formula (3) The normalized expression matrix is ​​represented by the first... The first sample One characteristic, For standardized expression values; and They represent the number 1, 2, 3, 4, 5, 6, 7, 8, 9, The minimum and maximum values ​​of all samples after standardization of the features.

[0125] In specific implementation, normalization processing can also be achieved by selecting common methods in the field, such as quantile normalization, according to the data distribution. This invention does not limit this to any particular method.

[0126] S2. Extract the feature vectors of the preprocessed multimodal information and embed labels into the multimodal information according to the labeled information.

[0127] Furthermore, the feature vector for extracting multimodal information in step S2 includes:

[0128] Feature vectors are extracted from the preprocessed radiographic images. A 3D-ResNet50 model is used to extract radiographic feature vectors from the preprocessed radiographic images to capture three-dimensional spatial information. The image tensor of the preprocessed radiographic image is input into the 3D-ResNet50 model. Image feature vectors are generated using the 3D-ResNet50 model. :

[0129] (4),

[0130] Among them, in the above formula (4) This refers to preprocessed 3D image data (such as CT, MRI, PET-CT). The 3D-ResNet50 model extracts spatial features through 3D convolution and residual connections, generating d-dimensional feature vectors. The d-dimensional feature vector Characterizes the texture, shape, and spatial distribution characteristics of digestive tract organs;

[0131] Feature vectors are extracted from endoscopic images, and a 2D-ResNet50 model combined with attention pooling is used to analyze each spatial location of the preprocessed endoscopic image. First, calculate the unnormalized attention score. ,

[0132] Perform softmax function normalization to ensure And obtain spatial location Attention coefficient :

[0133] (5),

[0134] Among them, in the above formula (5) For endoscopic images in position Extracted two-dimensional feature vectors; This is a learnable attention weight vector used to evaluate the importance of the location; For all positions Sum of fractions;

[0135] The contribution of this region to the global features is calculated using the softmax function;

[0136] Attention coefficient Weighted summation of features at all locations generates a global feature vector for the endoscopic image. :

[0137] (6),

[0138] Among them, in the above formula (6) The global feature vector, which is aggregated by attention coefficient, represents the key lesions and tissue features of the endoscopic image. In this way, the global feature vector can be extracted from the preprocessed endoscopic image to highlight the key lesion area.

[0139] Feature vectors are extracted from pathological slide images, and a clustering-guided attention multiple instance learning model (CLAM) is used to generate feature vectors from the preprocessed pathological slide images. in patches A collection of image patches cropped and selected from high-resolution pathological sections;

[0140] Basic clinical information is extracted, and an embedding layer combined with a multilayer perceptron (MLP) is used to generate feature vectors from the preprocessed basic clinical information. , where c is a set of structured clinical variables (age, sex, complete blood count, biochemical indicators, family medical history, etc.). This is a continuous vector obtained after embedding the categorical variables.

[0141] Furthermore, the specific process for extracting the feature vector of multi-omics information in step S2 is as follows: The preprocessed omics information vector is input into the variational autoencoder (VAE). Extracting low-dimensional feature vectors ,

[0142] (7),

[0143] Among them, in the above formula (7) Map it to a low-dimensional feature vector This achieves dimensional compression and unification; z The latent representation of a VAE is given by the posterior probability distribution. Generate, parameterized by mean vector and standard deviation vector In conjunction with the identity matrix Construct the diagonal covariance matrix.

[0144] The aforementioned VAE can effectively reduce the dimensionality of omics data, retain core biological characteristics, and provide high-quality, low-dimensional feature vectors for multimodal fusion.

[0145] Furthermore, the specific process of embedding tags in step S2 is as follows: based on the embedding tags for different disease stages, and in accordance with clinical guidelines and pathological diagnostic criteria, five tags are established for different stages of gastrointestinal diseases. These correspond to the healthy, inflammatory, precancerous lesions, early-stage cancer, and advanced-stage cancer stages of the disease process, respectively, through embedded functions. Generation stage embedding vector :

[0146] (8),

[0147] Among them, in the above formula (8) Disease stage labels represent different stages of a disease; embedding functions It is a deep learning-enabled embedding layer that maps labels t to d-dimensional continuous stage embedding vectors. (Right now Belongs to d-dimensional real space , (representing the set of all d-dimensional real vectors), where the d-dimensional continuous vector... The disease stage characteristics were encoded, which facilitates the capture of disease progression trends in subsequent analysis;

[0148] Based on the embedding of anatomical site labels, a label is defined for each region of the digestive tract (according to the Terminologia Anatomica standard). , The total number of digestive tract sites is determined by an embedding function. Mapped to part embedding vector :

[0149] (9),

[0150] Among them, in the above formula (9) As anatomical site labels, representing specific anatomical regions of the digestive tract; embedding functions It is a deep learning-enabled embedding layer that embeds labels Mapped to d-dimensional continuous part embedding vectors (Right now Belongs to d-dimensional real space , (representing the set of all d-dimensional real vectors), where the d-dimensional continuous vector... The anatomical characteristics of the site (such as spatial location and tissue structure) are encoded to facilitate subsequent multimodal feature fusion; d is the dimension of the embedding vector, which is determined according to the complexity of the modal features and the needs of the subsequent fusion model. Typical values ​​are 128 or 256 to balance computational efficiency and feature representation ability.

[0151] S3. Concatenate the extracted multimodal information feature vector with the embedded labels of the multimodal information, and map the concatenated high-dimensional vector to a unified dimension d to obtain an enhanced feature vector containing modality-specific information, anatomical location information, and disease stage information. .

[0152] Furthermore, the specific splicing process described in step S3 is as follows:

[0153] The feature vectors based on the multimodal information extracted in step S2 (including low-dimensional feature vectors of multi-omics information) Image features Endoscopic image feature vector Feature vectors of pathological slide images and clinical information feature vectors The tags are concatenated with the embedded multimodal information and then processed through a linear transformation function. Mapping the concatenated high-dimensional vector to a unified dimension d yields an enhanced feature vector containing modality-specific information, anatomical location information, and disease stage information. ,

[0154] (10)

[0155] Among them, in the above formula (10) For the first The original feature vectors extracted from each modality. , Embed vectors for anatomical sites Embedded vectors for disease stages.

[0156] By splicing and linear mapping, this invention deeply integrates multimodal information features with anatomical location and disease stage information to generate a unified dimension feature representation, providing feature input for subsequent data fusion and disease diagnosis models.

[0157] Because multimodal information often exhibits significant differences in information scale and statistical distribution (images are high-dimensional spatial textures, omics are high-dimensional sparse vectors, and clinical information is structured tables), and because multimodal information contains both complementary, redundant, and even contradictory information, a multi-head attention fusion mechanism is employed. This mechanism automatically assigns weights to each modality / patch based on task relevance, avoiding information dilution caused by "average fusion." This allows for the discovery of dependencies and complementarities between modalities at a fine-grained level, while simultaneously suppressing the influence of noisy modalities.

[0158] S4. Fuse the enhanced feature vectors of all the multimodal information concatenated in step S3 to form a multimodal feature matrix. Perform a linear mapping on the fused multimodal feature matrix using a projection matrix to obtain the query, key, and value of the multi-head attention mechanism. Construct a mask matrix to calculate the weights of the multi-head attention mechanism. After weighted aggregation, obtain a global fusion vector. Group the global fusion vectors belonging to the same anatomical location and disease stage combination according to the anatomical location and disease stage combination to generate a fusion vector sequence. .

[0159] Furthermore, step S4 also includes step S41: enhancing the feature vector of all multimodal information concatenated in step S3. Fusion to form a matrix :

[0160] (11),

[0161] Among them, in the above formula (11) The multimodal feature matrix is ​​the result of splicing and fusion, with dimensions of . M is the number of modes, and d is the feature dimension;

[0162] Through projection matrix Directly splicing the matrix Perform a linear mapping to obtain the parameters required for the multi-head attention mechanism. (Query) (key), (value):

[0163]

[0164] Among them, in the above formula (12) These are respectively composed of multimodal matrices The obtained query (Q), key (K), and value (V); These are trainable mapping matrices used to generate "query", "key", and "value" respectively;

[0165] Constructing the mask matrix Masking invalid information in a multi-head attention mechanism; attention weights calculated after masking. Recorded as:

[0166] (13)

[0167] Among them, in the above formula (13) For the mask matrix, the rows and columns corresponding to missing modal information are assigned minimum values, otherwise they are 0; Attention head index for multi-head attention mechanism ( =1,…,h, where h is the number of heads (h=8 in this paper). For attention head The trainable projection matrix; For the first The query, key, and value submatrix of each size; : Calculate the similarity between modal information (the larger the value, the stronger the correlation between the two modal information); The scaling factor is used to prevent the inner product value from becoming too large, which would cause the softmax gradient to vanish; softmax is normalized along the rows to make the sum of the weights of the attention mechanism equal to 1.

[0168] For the first The global fusion vector is obtained by weighted aggregation of head attention. :

[0169] (14)

[0170] Among them, in the above formula (14) It is a sample The global fusion vector is obtained by embedding anatomical locations and disease stages, along with the fusion of features from various modalities. No. The output vector of each head; Indicates all The outputs are concatenated along the feature dimension. This represents the learnable weight matrix.

[0171] According to its corresponding anatomical location label s Disease stage labels t For all global fusion vectors Global fusion vectors belonging to the same anatomical location and disease stage can be obtained by using pooling techniques (such as average pooling, max pooling, etc.). The data is aggregated to generate a fusion vector sequence of corresponding anatomical locations and disease stages. The calculation formula is as follows:

[0172] (15)

[0173] Among them, in the above formula (15) For global fusion vector, Indicates the sample index. , Label its anatomical location and disease stage. Pool The pooling function can be average pooling, max pooling, or other statistical methods. For all anatomical sites s Disease stage t The fusion sample vector set.

[0174] Furthermore, step S4 also includes step S42: introducing a regularization loss to minimize the inter-head cosine similarity of multi-head attention, thus obtaining the multi-head attention diversity regularization loss. :

[0175] (16) can avoid the overlap of attention areas of different attention points;

[0176] Among them, in the above formula (16) For the first The flattened vector of the weight matrix of each attention head; For the first Head and First The inner product of the head weight vectors is used to measure the similarity between the two vectors; The L2 norm is used to normalize the calculation of cosine similarity.

[0177] The intermediate fusion vector after multi-head attention concatenation and fusion is obtained by using residual connections and LayerNorm. ,

[0178]

[0179] Among them, in the above formula (17) This is the intermediate vector after residual normalization. For queries related to multi-head attention mechanisms, the formula above is used. generate; right Pooling along the modal dimension (such as mean or maximum) is used to... Make residuals,

[0180] The final fused vector is then processed by a feedforward network (FFN). ,

[0181] (18)

[0182] Among them, in the above formula (18) The final output fusion vector is fed into subsequent missing reconstruction or downstream task modules; Layer Normalization normalizes the input vector to stabilize training; FFN (Feedforward Network) is typically a two-layer linear transformation with ReLU activation, used to increase the model's non-linear expressive power.

[0183] Because the actual clinical data collection process is uncontrollable and often involves missing or poor-quality modalities, direct splicing would inevitably lead to indiscriminate mixing, noise amplification, and a large number of discarded samples. Basic reconstruction can maintain feature completeness in modality-missing scenarios.

[0184] For each available mode Through modal decoder network (Typically a multi-layer neural network) From the final output fused vector Reconstructed features , dimension d :

[0185] (19)

[0186] The mean squared error (MSE) loss is used to assess the quality of basic reconstruction, and the mean squared error loss of basic reconstruction is calculated. :

[0187] (20) Summarize all samples and modalities;

[0188] Among them, in the above formula (20)N Represents the total number of samples. i Representing the i One sample, M Represents the total number of modes. m Representing the m One modality, For the first m The true features of a modality, with dimensions of d ; For modal indication, For indicator functions, Indicates the first Modality is available. If modal information is missing, it is unusable; It is the L2 norm, used to calculate the basic reconstruction error.

[0189] Decoder Network Variance estimation of additional output reconstructed features Uncertainty-weighted loss of all samples and modalities reconstructed based on quantization. ,

[0190] (twenty one),

[0191] Among them, in the above formula (21) N Represents the total number of samples. i Representing the i One sample; M Represents the total number of modes. m Representing the m One modality, For the first Variance estimation of m-modal reconstruction features of samples;

[0192] Reconstructed feature set for all available samples and the set of true features The maximum mean difference (MMD) is used to measure distribution consistency, and the Gaussian kernel function is used to calculate the distribution alignment loss of all modal information. :

[0193]

[0194] in,

[0195] (twenty three),

[0196] in, For Gaussian kernel function, The width of the Gaussian kernel is set based on the variance, mean, or median of the Euclidean distance between the feature vectors of all sample pairs, and can be further optimized through validation set grid search in practical applications; For the first m Number of available samples for a modality; p For sample index,

[0197] For the missing target mode Using the observed modes True characteristics Through cross-modal mapping function The reconstruction calculation was used to obtain the cross-modal distillation loss. :

[0198] (twenty four),

[0199] By summarizing all samples and missing modalities, the information loss caused by missing modalities can be mitigated.

[0200] in, For the observed modes b To the missing target mode a The mapping function, comprising an input layer (dimension d, ReLU activation), a hidden layer (256 nodes, ReLU activation), and an output layer (dimension d, linear activation), is pre-trained on a full modality dataset via supervised learning; observed modalities Based on the missing mode The selection of relevance (e.g., based on feature similarity or mutual information); It is the L1 norm (used to measure distillation error);

[0201] The above mean square error loss Uncertainty-weighted loss Distribution alignment loss Cross-modal distillation loss The weighted summation yields the overall reconstruction loss. The calculation formula is as follows:

[0202] (25),

[0203] Among them, in the above formula (25) The mean squared error loss is respectively Uncertainty-weighted loss Distribution alignment loss Cross-modal distillation loss The weighting coefficients are used to balance the contributions of different losses; weighting coefficients The initial values ​​are set to 1.0, 0.5, 0.2, and 0.3, which correspond to MSE loss, uncertainty-weighted loss, distribution alignment loss, and cross-modal distillation loss, respectively.

[0204] Step S4 above effectively integrates multimodal information through a multi-head attention mechanism, achieving complementarity and selection of features between different modalities, improving the expressive power and robustness of the fused features. By introducing multi-head attention diversity regularization, redundancy between attention heads is effectively avoided, improving the model's ability to decouple and distinguish multi-source heterogeneous features, enhancing the statistical consistency between reconstructed features and original features, and improving the ability to recover missing modalities.

[0205] S5. Based on the global fusion vector sequence described in step S4, inject information about the disease stage to enhance the temporal information of the global fusion vector sequence. Through spatial coding, utilize the anatomical topological relationships of the digestive tract (such as the esophagus and stomach being adjacent) and pathological co-occurrence relationships (such as the association of a lesion in a certain part with other parts) to strengthen the spatial information of the spatial correlation of the site-specific features. Through the cross-attention mechanism, the temporal information and spatial information of the disease stage are deeply interacted and fused to obtain spatiotemporal joint features, realizing spatiotemporal integration and generating unified features that fuse temporal and spatial information, providing information support for subsequent diagnosis and prediction.

[0206] Furthermore, step S5 also includes step S51: based on the global fusion vector sequence described in step S4. The time-series information is injected into each stage of disease progression (corresponding to healthy t=0, inflammation t=1, precancerous lesions t=2, early cancer t=3, and advanced cancer t=4), as follows:

[0207] First, for any two disease stage labels The difference between stages is defined as:

[0208] (26)

[0209] Then, a sine-cosine function is used to generate the temporal position coding vector. ,

[0210] (27)

[0211] Among them, in the above formula (26) The disease stage label (an integer from 0 to 4); The difference between the label indices of the two disease stages is used to measure the relative distance between different disease stages; in the above formula (27) The frequency constant (k=1,…,16) controls the period of the sine and cosine functions to ensure the coding distinguishability of the differences between different disease stages; The stage position encoding vector maps discrete stage differences to continuous features using sine and cosine functions, providing sufficient discriminative power for differences between different stages and providing stage temporal information for subsequent Transformer models;

[0212] The input to the Transformer model is a globally fused vector sequence and positional encodings. By using the Transformer model to capture the dependencies between disease stages (such as the influence of the inflammation stage on precancerous lesions), the temporal characteristics of the evolution of different disease stages can be obtained. :

[0213] (28) integrates the stage evolution characteristics from healthy to advanced cancer, which is beneficial for capturing the dynamic dependencies between disease stages.

[0214] Among them, in the above formula (28) This represents a Transformer model with a 4-layer encoder, 8 attention heads per layer, and 256 hidden layer dimensions. This represents the global fusion vector sequence.

[0215] By limiting the drastic fluctuations in features between adjacent stages using the L1 norm, time-series features are calculated. Temporal smoothing loss :

[0216] (29)

[0217] It can ensure a smooth transition between features of adjacent disease stages (such as inflammation to precancerous lesions), simulate the gradual evolution of disease, and avoid the model learning of jumps or spurious stage associations.

[0218] This step effectively models the dynamic evolution of disease stages through temporal encoding and attention mechanisms, and enhances the continuity and clinical relevance of features by incorporating temporal smoothing constraints. The parameters mentioned above (such as the number of Transformer layers, the number of attention heads, and the hidden dimension) can be flexibly adjusted according to actual task requirements and computational resources. The specific value can also be optimally searched through cross-validation.

[0219] Furthermore, step S5 also includes step S52: constructing an association map by using spatial coding to utilize the anatomical topological relationships (such as the esophagus and stomach being adjacent) and pathological co-occurrence relationships (such as the association between lesions in a certain part and other parts) of the digestive tract, thereby strengthening the spatial association of site-specific features.

[0220] Construct the association graph using the following formula (30) G :

[0221] (30)

[0222] Among them, in the above formula (30) For a set of nodes, each node Each location corresponds to a anatomical location in the digestive tract (e.g., thoracic esophagus, gastric body, duodenum, etc., totaling...). (parts); For edge sets, spatial relationships between characterizing sites are constructed based on anatomical adjacency (such as esophagus and stomach) or pathological co-occurrence relationships (such as the association between gastric body lesions and duodenal lesions);

[0223] Spatial augmentation feature H′ is generated using a graph attention network (GAT):

[0224] (31),

[0225] Among them, in the above formula (31) Let G be the adjacency matrix of the association graph G, based on the association graph G The set of edges; The graph attention network (configured as a 2-layer network with 8 attention heads, LeakyReLU activation function, slope 0.2, parameters can be adjusted to adapt to data scale) generates spatial augmentation features H′ by learning the attention weights between parts, which integrates the feature information of each part and its adjacent or related parts;

[0226] Enhance site-specific features through attention mechanisms :

[0227] (32),

[0228] Among them, the above formula (32) is described Embed vectors for anatomical sites to generate global vectors; : is a learnable site-conditional projection matrix that maps anatomical site embedding vectors to the attention space; The key and value matrix of the attention mechanism; the location-conditional attention weights are normalized by softmax to focus on key features related to the location; It outputs site-specific features, incorporating site context information (e.g., gastric body features focus more on gastric-related pathological characteristics).

[0229] This step enhances site-specific features by constructing an association graph structure and using attention mechanisms, effectively capturing the spatial correlation of digestive tract sites and enhancing the anatomical specificity of features.

[0230] Furthermore, step S5 also includes step S53: inputting timing features. (Including information on disease stage evolution) and spatial enhancement features (Including part-related information) to the cross-attention mechanism, through which the mutual attention weights of temporal and spatial information are learned, and interactive fusion is performed to obtain spatiotemporal joint features. :

[0231] (33),

[0232] Among them, in the above formula (33) This represents a 2-layer cross-attention network, with 8 attention heads per layer and 256 hidden layer dimensions (parameters can be adjusted to suit data size); temporal features. Spatial augmentation features as a query As a key and value (or spatial enhancement feature) As a query, time series characteristics As keys and values, they capture the correlation between disease stages and anatomical locations (such as the correlation between early-stage cancer and the gastric antrum).

[0233] The spatiotemporal joint characteristics are obtained through fusion. Then, using contrastive learning loss By maximizing the similarity between positive sample pairs and minimizing the similarity between the anchor point and the negative sample, the discriminativeness and consistency of the features are further enhanced. The specific calculation formula is as follows (30):

[0234] (34),

[0235] Among them, in the above formula (34) The two spatiotemporal feature representations that constitute a "positive sample pair" usually come from different perspectives of the same disease site or the same disease stage. The negative sample set represents the spatiotemporal characteristics that do not belong to the same disease site or stage as the anchor point; For similarity measurement functions (such as cosine similarity); This is a temperature coefficient used to adjust the smoothness of the contrastive learning loss.

[0236] This method uses spatiotemporal interaction and contrastive learning loss to deeply interact the temporal information of disease stages with the spatial information of anatomical sites, achieving spatiotemporal integration and generating a unified feature representation that integrates "when (stage)" and "where (spatial location)" information, which is beneficial to significantly improve the accuracy of disease diagnosis and prognosis.

[0237] S6. Based on the spatiotemporal joint fusion features obtained in step S6 By using a multi-task learning framework, the system classifies and predicts the disease stage or specific pathological type of the patient, and improves stability and generalization ability by co-optimizing multiple losses, which is conducive to outputting the best diagnostic results.

[0238] Furthermore, step S6 also includes step S61: based on spatiotemporal fusion features The system uses a multi-task learning framework to classify and predict the patient's disease stage or specific pathological type, and outputs the predicted probability. :

[0239] (35),

[0240] Based on predicted probability Calculate cross-entropy loss :

[0241] (36) is used to measure classification error;

[0242] Among them, in the above formula (36) For the first The spatiotemporal joint fusion features of each sample; MLP is a multilayer perceptron structure of a multi-task learning framework, which can output classification and stage category in parallel; denoted as the total number of samples; C represents the set of categories for digestive tract sites and disease stages (e.g., ascending colon - early cancer, etc.). For the first i The true label of each sample belongs to category C. For the first Each sample is predicted to belong to the category. The probability of.

[0243] Furthermore, step S6 also includes step S62: based on spatiotemporal fusion features... Risk scores are obtained using a multi-task learning framework using a multi-layer perceptron (MLP). :

[0244] (37)

[0245] Based on the predicted risk score The Cox partial likelihood loss was calculated using the Cox proportional hazard model. :

[0246] (38),

[0247] Among them, in the above formula (38) For predicting probabilities The first in i The risk score of each sample is calculated by a multi-layer perceptron (MLP) from the multi-task learning framework. Spatiotemporal joint fusion features of individual samples Generate; in the above formula (34) An index set of patients for whom an event (such as death or relapse) has occurred; Risk set refers to the set of patients A set of patient indices that were still at risk when the event occurred; By optimizing the Cox partial likelihood loss and improving the accuracy of risk ranking, we can obtain survival and prognostic results, which can help support physicians' survival analysis and prognostic decisions.

[0248] This step accurately estimates patient risk using the Cox proportional hazards model, supporting long-term prognostic prediction and personalized treatment decisions.

[0249] Furthermore, step S6 also includes step 63: adjusting the cross-entropy loss through weight adjustment. Cox partial likelihood loss Comprehensive reconstruction losses Temporal smoothing loss Comparative learning loss and the loss of multi-head attention diversity regularization The specific calculation formula for collaborative optimization is as follows:

[0250] (39) Improve model stability and generalization ability;

[0251] Among them, in the above formula (39) , , , , , : Weighting coefficients for various losses.

[0252] In a preferred embodiment of the present invention, weight adjustment is collaboratively optimized: weight Optimize experience initialization using the following strategies: Settings =1, λ3=0.5, λ4, λ5=0.1, λ6=0.3, ensuring that the main tasks are optimized first.

[0253] Through the aforementioned weight adjustment and synergistic optimization, the method of this invention can achieve the best diagnostic and prognostic prediction results in clinical scenarios with incomplete multimodalities and spatiotemporal heterogeneity.

[0254] S7. Construct a medical knowledge graph from medical knowledge, drug information, and multimodal information, semantically link diagnostic results with the medical knowledge graph, and share data to an online health intelligence platform for medical big data. This will facilitate personalized decision-making for clinicians based on individual patient characteristics and disease progression, and assist in clinical decision-making.

[0255] Personalized decision-making criteria include specific risk factors, preventive measures, and recommended early screening programs, which help doctors decide whether further examinations or preventive measures are needed.

[0256] The application of the health intelligence platform can support large-scale data sharing, full-cycle data security, and intelligent application throughout the entire process, providing high-quality clinical intelligent assistance for patients with digestive system tumors. This enables integrated management of the entire course of tumor diseases and fully leverages the advantages of the health intelligence platform in early screening, accurate diagnosis, and individualized treatment.

[0257] Furthermore, the health intelligence platform can enhance the awareness of medical staff and patients about the platform through various means such as academic exchanges, case sharing, and patient education, encourage their active participation, form a positive doctor-patient interaction, and accelerate the widespread application and popularization of the health intelligence platform in clinical practice.

[0258] Furthermore, the health intelligence platform can dynamically integrate multi-modal data such as multi-omics, pathological slides, images, and clinical information to form a dynamic and interactive panoramic display at the molecular-cellular-organ-system level. This helps to reveal biological networks, disease evolution mechanisms, and therapeutic targets, providing a powerful data foundation and intelligent analysis tools for personalized diagnosis and treatment and target discovery.

[0259] A third aspect of the present invention also provides an electronic device, including a memory and a processor, wherein the memory stores a computer program that can run on the processor, characterized in that the processor executes the computer program to implement the steps of the method for establishing a diagnostic and treatment system for multimodal information of digestive system diseases.

[0260] A fourth aspect of the present invention also provides a computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, it implements the steps of the method for establishing a diagnostic and treatment system for multimodal information on digestive system diseases.

[0261] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0262] In particular, according to some embodiments of this disclosure, the processes described above can be implemented as computer software programs. For example, some embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowchart. In such embodiments, the computer program can be downloaded and installed from a network via a communication device, or installed from a storage device, or installed from a ROM. When the computer program is executed by a processing device, it performs the functions defined above in the methods of some embodiments of this disclosure.

[0263] It should be noted that the computer-readable medium described in some embodiments of this disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium may be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, and portable compact disk read-only memory (CD-ROM). ROM, optical storage device, magnetic storage device, or any suitable combination thereof. In some embodiments of this disclosure, the computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in connection with an instruction execution system, apparatus, or device.

[0264] In some embodiments of this disclosure, the computer-readable signal medium may include a mission data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. This propagated mission data signal may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. The computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which may transmit, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.

[0265] In some implementations, clients and servers can communicate using any currently known or future-developed network protocol, such as HTTP (Hypertext Transfer Protocol), and can interconnect with digital task data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or future-developed networks.

[0266] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device. The aforementioned computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: determine the network connection status of the switch production line management application in response to detecting a query operation on a production collaboration document in the switch production line management application; replace the webpage entry information corresponding to the production collaboration document with target entry file information and load target webpage resource information in response to determining that the network connection status of the switch production line management application indicates an offline state, so as to display the webpage of the production collaboration document offline in the switch production line management application, wherein the target entry file information is the file information of the entry file corresponding to the webpage of the production collaboration document downloaded in advance, and the target webpage resource information is the resource information corresponding to the webpage stored locally; in response to determining that the network connection status of the switch production line management application indicates an online state and that the webpage resource information corresponding to the production collaboration document is not stored locally, download the webpage resource information of the webpage from the production line document server, wherein the webpage resource information includes an entry file and resource information; display the webpage of the production collaboration document in the switch production line management application according to the webpage resource information, and store the webpage resource information in a local database.

[0267] Computer program code for performing operations of some embodiments of this disclosure can be written in one or more programming languages ​​or a combination thereof, including product-oriented programming languages ​​such as Java, Smalltalk, and C++, and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0268] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for establishing a diagnostic and treatment system for multimodal information of digestive system diseases, characterized in that, Includes the following steps: S1. Collect basic clinical information, endoscopic images, radiological images, pathological slide images, and multi-omics information from multiple modalities, and then label and preprocess them. S2. Extract the feature vectors of the preprocessed multimodal information and embed labels into the multimodal information according to the labeled information; S3. Concatenate the feature vector of the extracted multimodal information with the embedded labels of the multimodal information, and map the concatenated high-dimensional vector to a unified dimension to obtain an enhanced feature vector containing modality-specific information, anatomical location information and disease stage information. S4. Fuse the enhanced feature vectors of all spliced ​​multimodal information to form a multimodal feature matrix. Perform linear mapping on the fused multimodal feature matrix through the projection matrix to obtain the query, key, and value of the multi-head attention mechanism. Construct a mask matrix to calculate the weight of the multi-head attention mechanism. After weighted aggregation, obtain the global fusion vector. Group the global fusion vectors belonging to the same combination of anatomical location and disease stage according to the combination of anatomical location and disease stage to generate a fusion vector sequence. S5. Based on the global fusion vector sequence, inject information about the disease stage to enhance the temporal information of the global fusion vector sequence. Through spatial coding, utilize the anatomical topological relationship and pathological co-occurrence relationship of the digestive tract to strengthen the spatial information of the spatial correlation of the site-specific features. Through the cross-attention mechanism, the temporal information and spatial information of the disease stage are deeply interacted and fused to obtain spatiotemporal joint features. S6. Based on spatiotemporal joint fusion features, a multi-task learning framework is used to classify and predict the disease stage or specific pathological type of the patient, and output diagnostic results by co-optimizing multiple losses. S7. Construct a medical knowledge graph from medical knowledge, drug information, and multimodal information, semantically link diagnostic results with the medical knowledge graph, and share data to an online health intelligence platform for medical big data. This will facilitate personalized decision-making for clinicians based on individual patient characteristics and disease progression, and assist in clinical decision-making.

2. The method for establishing a diagnostic and treatment system for multimodal information of digestive system diseases as described in claim 1, characterized in that, The disease stages described in step S1 include healthy, inflammatory, precancerous lesions, early-stage cancer, and advanced-stage cancer. The multi-omics information includes metagenomics, metabolomics, transcriptomics, proteomics, and peripheral blood whole-genome information.

3. The method for establishing a diagnostic and treatment system for multimodal information of digestive system diseases as described in claim 1, characterized in that, The specific concatenation process described in step S3 is as follows: The feature vectors of the multimodal information extracted in step S2 are concatenated with the embedded labels of the multimodal information, using a linear transformation function. The concatenated high-dimensional vectors are mapped to a unified dimension to obtain enhanced feature vectors containing modality-specific information, anatomical location information, and disease stage information. , , Among them, in the above formula (10) Let be the original feature vector extracted from the m-th modality. , Embed vectors for anatomical sites Embedded vectors for disease stages.

4. The method for establishing a diagnostic and treatment system for multimodal information of digestive system diseases as described in claim 3, characterized in that, Step S4 also includes step S41: enhancing the feature vector of all multimodal information concatenated in step S3. Fusion to form a matrix : (11), Among them, in the above formula (11) The multimodal feature matrix is ​​the result of splicing and fusion, with dimensions of . M is the number of modes, and d is the feature dimension; Through projection matrix Directly splicing the matrix Perform a linear mapping to obtain the parameters required for the multi-head attention mechanism. : (12), Among them, in the above formula (12) These are respectively composed of multimodal matrices The obtained Q, K, and V; These are trainable mapping matrices, used to generate queries. ,key ,value ; Constructing the mask matrix Masking invalid information in a multi-head attention mechanism; attention weights calculated after masking. Recorded as: (13), Among them, in the above formula (13) For the mask matrix, the rows and columns corresponding to missing modal information are assigned minimum values, otherwise they are 0; j For the attention head index of the multi-head attention mechanism, For attention head j The trainable projection matrix; For the first j The query, key, and value submatrix of each size; : Calculate the similarity of information between modalities; The scaling factor is used to prevent the inner product value from becoming too large, which would cause the softmax gradient to vanish; softmax is normalized along the rows to make the sum of the weights of the attention mechanism equal to 1. For the first j The global fusion vector is obtained by weighted aggregation of head attention: (14), Among them, in the above formula (14) It is a sample i The global fusion vector is obtained by embedding anatomical locations and disease stages, along with the fusion of features from various modalities. No. j The output vector of each head; Indicates all The output is concatenated along the feature dimension. Represents the learnable weight matrix; Global fusion vectors belonging to the same anatomical location and disease stage combination The data is aggregated to generate a fusion vector sequence of corresponding anatomical locations and disease stages. The calculation formula is as follows: (15), Among them, in the above formula (15) For global fusion vector, Indicates the sample index. Label its anatomical location and disease stage. For all anatomical sites s Disease stage t The fused sample vector set.

5. The method for establishing a diagnostic and treatment system for multimodal information of digestive system diseases as described in claim 4, characterized in that, Step S4 also includes step S42: introducing a regularization loss to minimize the inter-head cosine similarity of multi-head attention, thus obtaining the multi-head attention diversity regularization loss. : (16), Among them, in the above formula (16) For the first j The flattened vector of the weight matrix of each attention head; For the first j Head and First The inner product of the head weight vectors is used to measure the similarity between the two vectors; The L2 norm is used to normalize the calculation of cosine similarity. The intermediate fusion vector after multi-head attention concatenation and fusion is obtained by using residual connections and LayerNorm. , , Among them, in the above formula (17) This is the intermediate vector after residual normalization. For queries regarding multi-head attention mechanisms; right Pooling along the modal dimension, used for... Make residuals, The final fused vector is then processed by a feedforward network. , , Among them, in the above formula (18) The final output fusion vector is fed into subsequent missing reconstruction or downstream task modules; The Layer Normalization operation normalizes the input vector to stabilize training. For each available mode m Through modal decoder network From the final output fusion vector Reconstructed features , dimension d : (19), The mean squared error loss is used to assess the quality of basic reconstruction, and the mean squared error loss of basic reconstruction is calculated. : (20) Summarize all samples and modalities; Among them, in the above formula (20) N Represents the total number of samples. i Representing the i One sample, M Represents the total number of modes. m Representing the m One modality, For the first m The true features of a modality, with dimensions of d ; For modal indication, For indicator functions, Indicates the first m Modality is available. If modal information is missing, it is unusable; It is the L2 norm, used to calculate the basic reconstruction error.

6. The method for establishing a diagnostic and treatment system for multimodal information of digestive system diseases as described in claim 5, characterized in that, Decoder Network Variance estimation of additional output reconstructed features Uncertainty-weighted loss of all samples and modalities reconstructed based on quantization. ,  (21), Among them, in the above formula (21) N Represents the total number of samples. i Representing the i One sample; M Represents the total number of modes. m Representing the m One modality, For the first Variance estimation of m-modal reconstruction features of samples; Reconstructed feature set for all available samples and the set of true features The maximum mean difference (MMD) is used to measure distribution consistency, and the Gaussian kernel function is used to calculate the distribution alignment loss of all modal information. : (22), in, , in, For Gaussian kernel function, The width of the Gaussian kernel; For the first m Number of available samples for each modality; Sample Index For the missing target mode Utilizing the observed modes True characteristics Through cross-modal mapping function The reconstruction calculation was used to obtain the cross-modal distillation loss. : (24), in, For the observed modes b To the missing target mode a Mapping function; observed modes Based on the missing mode Relevance selection; It is an L1 norm; The above mean square error loss Uncertainty-weighted loss Distribution alignment loss Cross-modal distillation loss The weighted summation yields the overall reconstruction loss. The calculation formula is as follows: (25), Among them, in the above formula (25) The mean squared error loss is respectively Uncertainty-weighted loss Distribution alignment loss Cross-modal distillation loss The weighting coefficients are used to balance the contributions of different losses.

7. The method for establishing a diagnostic and treatment system for multimodal information of digestive system diseases as described in claim 6, characterized in that, Step S5 further includes step S51: Based on the global fusion vector sequence described in step S4, inject temporal information into each stage according to the disease progression stage: For any two disease stage labels The difference between stages is defined as: , Using sine-cosine functions to generate timing position coding vectors , , Among them, in the above formula (26) Labels for disease stages; The difference between the label indices of the two disease stages is used to measure the relative distance between different disease stages; in the above formula (27) As a frequency constant, it controls the period of the sine and cosine functions to ensure the coding distinguishability of the differences between different disease stages; The stage position encoding vector maps discrete stage differences to continuous features using sine and cosine functions, providing sufficient discriminative power for differences between different stages and providing stage temporal information for subsequent Transformer models; The input to the Transformer model is a fused vector sequence and positional encoding. By using the Transformer model to capture the dependencies between disease stages, the temporal characteristics of the evolution of different disease stages can be obtained. : (28), Among them, in the above formula (28) This represents a Transformer model with a 4-layer encoder, 8 attention heads per layer, and 256 hidden layer dimensions. By limiting the drastic fluctuations in features between adjacent stages using the L1 norm, time-series features are calculated. Temporal smoothing loss : (29); Step S52: Construct an association map by utilizing the anatomical topological relationships and pathological co-occurrence relationships of the digestive tract through spatial coding to enhance the spatial correlation of site-specific features; Construct the association graph using the following formula (30) G : , Among them, in the above formula (30) For a set of nodes, each node Each corresponds to a label for an anatomical location in the digestive tract. For the edge set, spatial relationships between sites are constructed based on anatomical adjacency or pathological co-occurrence relationships; Spatial augmentation feature H′ is generated using a graph attention network (GAT): (31), Among them, in the above formula (31) Let G be the adjacency matrix of the association graph G, based on the association graph G The set of edges; Enhance site-specific features through attention mechanisms : (32), Among them, the above formula (32) is described Embed vectors for anatomical sites to generate global vectors; : is the learnable part-conditional projection matrix; The key-value matrix of the attention mechanism; Output specific features for each body part; Step S53: Input timing features The spatial enhancement feature H′ is then incorporated into a cross-attention mechanism. This mechanism learns the mutual attention weights between temporal and spatial aspects, and the resulting spatiotemporal joint features are obtained through interactive fusion. : (33), Among them, in the above formula (33) The representation is a 2-layer cross-attention network with 8 attention heads per layer and 256 hidden layer dimensions; temporal features. As a query, the spatially enhanced feature H′ serves as both a key and a value, capturing the correlation between disease stage and location; The spatiotemporal joint characteristics are obtained through fusion. Then, using contrastive learning loss By maximizing the similarity between positive sample pairs and minimizing the similarity between the anchor point and the negative sample, the specific calculation formula is as follows (34): (34), Among them, in the above formula (34) , These are two spatiotemporal feature representations of positive sample pairs, usually from different perspectives of the same disease site or the same disease stage; The negative sample set indicates that the anchor points do not belong to the same disease site or stage in terms of spatiotemporal characteristics. This is a similarity measurement function; This is a temperature coefficient used to adjust the smoothness of the contrastive learning loss.

8. The method for establishing a diagnostic and treatment system for multimodal information of digestive system diseases as described in claim 1, characterized in that, Step S6 also includes step S61: based on spatiotemporal fusion features The system uses a multi-task learning framework to classify and predict the patient's disease stage or specific pathological type, and outputs the predicted probability. : (35), Based on predicted probability Calculate cross-entropy loss : (36), Among them, in the above formula (36) For the first The spatiotemporal joint fusion features of each sample; MLP is a multilayer perceptron structure for a multi-task learning framework; The total number of samples; A collection of categories for all parts of the digestive tract and stages of disease; For the first i The true label of each sample belongs to category C. For the first i Each sample is predicted to belong to the category. The probability of; Step S62: Based on spatiotemporal fusion features Risk scores are obtained using a multi-task learning framework using a multi-layer perceptron (MLP). : (37), Based on the predicted risk score The Cox partial likelihood loss was calculated using the Cox proportional hazard model. : (38), Among them, in the above formula (38) For predicting probabilities The first in i The risk score of each sample is calculated by a multi-layer perceptron (MLP) from the multi-task learning framework. The spatiotemporal joint fusion features of each sample are generated. In the above formula (38) A collection of patient indexes for the events that occurred; Risk set refers to the set of patients A set of patient indices that were still at risk when the event occurred; For Cox partial likelihood loss; Step S63: Adjust the cross-entropy loss by weighting. Cox partial likelihood loss Comprehensive reconstruction losses Temporal smoothing loss Comparative learning loss and the loss of multi-head attention diversity regularization The specific calculation formula for collaborative optimization is as follows: (39), Among them, in the above formula (39) : Weighting coefficients for various losses.

9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method for establishing a diagnostic and treatment system for multimodal information on digestive system diseases as described in any one of claims 1-8.

10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the method for establishing a diagnostic and treatment system for multimodal information on digestive system diseases as described in any one of claims 1-8.