A diagnostic model for inflammatory bowel disease subtypes and a method for constructing the same
By combining peripheral blood mononuclear cells and RNA editing-specific molecular indicators with machine learning, the invasiveness and signal confounding issues in the early diagnosis of inflammatory bowel disease have been resolved, enabling accurate differentiation and highly interpretable diagnosis between Crohn's disease and ulcerative colitis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- UNIV OF ELECTRONICS SCI & TECH OF CHINA
- Filing Date
- 2026-03-05
- Publication Date
- 2026-06-19
AI Technical Summary
Existing early diagnostic techniques for inflammatory bowel disease suffer from high invasiveness, mixed signals in tissue samples, and insufficient accuracy in subtype identification, making it difficult to achieve early screening and accurate classification.
Peripheral blood mononuclear cells (PBMCs) were used as the test sample. A multi-step data filtering process was constructed using RNA editing event-specific molecular indicators and machine learning algorithms to screen out population-conserved and biologically significant features, and to establish a non-invasive intelligent diagnostic model.
It enables accurate differentiation between Crohn's disease and ulcerative colitis, reduces examination discomfort, improves the feasibility of early screening and disease monitoring, and provides a highly interpretable and accurate diagnostic tool.
Smart Images

Figure CN122245701A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of biomedical detection technology, specifically relating to a diagnostic model for a subtype of inflammatory bowel disease and its construction method. Background Technology
[0002] Currently, early diagnosis and accurate classification of inflammatory bowel disease (IBD) face dual technical bottlenecks. First, traditional endoscopy, the gold standard, is highly invasive, making it difficult to use for early screening, and its diagnosis of atypical lesions is highly subjective. Second, while RNA-seq based on intestinal tissue samples, the mainstream technology aimed at finding objective molecular markers, avoids subjectivity, it relies on invasive biopsies, and the "bulk" data it measures is a mixture of signals from various cells. This makes the differentially expressed genes easily affected by confounding factors such as the composition of the sample cells and the different tissue sites sampled, making it difficult to distinguish the essential and robust characteristics of ulcerative colitis (UC) from Crohn's disease (CD), thus limiting its clinical translational potential.
[0003] To fundamentally address both the issues of "invasiveness" and "signal confounding," this invention innovatively shifts the detection sample from intestinal tissue to peripheral blood mononuclear cells (PBMCs). Obtaining PBMCs offers the advantages of being minimally invasive and reproducible, and as a homogeneous population of immune cells, they can more clearly reflect the systemic immune dysregulation state of IBD. Furthermore, the level of RNA editing in the PBMC transcriptome is quantified, an indicator that can reveal disease characteristics at the post-transcriptional regulatory level.
[0004] However, the new system faces technical challenges inherent in RNA editing events: (1) prevalent technical noise and low-frequency editing sites; and (2) numerous, non-shared, individual-specific editing sites in population studies. These factors collectively lead to excessively high dimensionality and redundancy in the initial data, which, if used directly for modeling, will obscure the true biological signals.
[0005] To address the aforementioned technical challenges, this invention establishes a multi-step early data filtering process. First, the raw data is cleaned by setting basic quality control standards: ① sites with a coverage of less than 10 are excluded to ensure statistical power; ② low-confidence sites with an edit frequency of less than 0.05 are filtered out to reduce technical noise and random errors. Based on this, to screen for features more prevalent within disease subgroups, only edit sites detected in at least 75% of samples within each group (e.g., CD group, UC group, healthy control group), i.e., sites with population "conservatism," are retained.
[0006] This screening process aims to combine the basic reliability of the data (coverage and frequency) with the population representativeness (conservatism) of the features, thereby effectively reducing data dimensionality and complexity in the early stages of analysis and initially focusing on RNA editing events that are more likely to reflect disease-related biological processes. This step provides a more stable and interpretable feature input foundation for subsequent construction of diagnostic models using machine learning methods such as random forests. Summary of the Invention
[0007] Technical problems to be solved: In view of the above-mentioned technical problems, the purpose of this invention is to provide a diagnostic model for inflammatory bowel disease subtypes and its construction method. By integrating non-invasive samples, RNA editing event-specific molecular indicators and machine learning algorithms, the core problems of high invasiveness, mixed tissue sample signals, difficulty in extracting high-dimensional data features, and insufficient subtype identification accuracy in existing IBD diagnostic technologies are solved. Finally, an intelligent diagnostic model based on peripheral blood mononuclear cells is constructed, which can accurately differentiate between Crohn's disease and ulcerative colitis.
[0008] Specifically, the technical problems to be solved by this invention and their solutions include:
[0009] 1. Addressing the problem of high invasiveness and difficulty in using existing diagnostic technologies for early screening: By using readily available peripheral blood mononuclear cells (PBMCs) as the test sample, replacing the invasive colonoscopy biopsy and tissue sampling relied upon by the current gold standard, the pain and risks of examination for patients are greatly reduced, making large-scale early screening and disease monitoring possible, and improving the compliance and feasibility of clinical application.
[0010] 2. Overcoming the shortcomings of tissue-based molecular typing techniques, such as mixed signals and insufficient specificity: By analyzing a more homogeneous cell population in PBMCs, the molecular signal mixture caused by complex cell types and differences in sampling sites in intestinal tissue samples is avoided, thus revealing the systemic immune characteristics of IBD more clearly.
[0011] 3. Establish a new strategy for robustly extracting disease features from high-dimensional, high-noise RNA editing data: To address the common problems in RNA editing site detection, such as technical noise, inefficient editing events, and inter-individual heterogeneity, we provide an effective multi-step data filtering and feature screening process to identify disease-related editing features with population conservation and biological significance from the source.
[0012] 4. Achieve accurate and objective differentiation of IBD subtypes (Crohn's disease and ulcerative colitis): Ultimately, by integrating the above-mentioned non-invasive samples, specific molecular indicators (RNA editing), and machine learning algorithms (random forest), an intelligent diagnostic model can be constructed that can automatically differentiate between CD and UC based on minimally invasive blood samples, providing a brand-new auxiliary diagnostic tool for clinical practice.
[0013] Technical solution: A method for constructing a diagnostic model for a subtype of inflammatory bowel disease, comprising the following steps:
[0014] S1. Data Acquisition and Preprocessing: PBMC transcriptome sequencing data were acquired from Crohn's disease patients, ulcerative colitis patients, and healthy controls. FastQC was used for preliminary quality assessment of the sequencing data, and MultiQC was used to summarize the results to understand the overall data quality. Trimmomatic software was used to filter the sequencing data, removing adapter sequences and low-quality bases to obtain high-quality sequencing reads. STAR alignment software was used to construct an index based on the human reference genome sequence GRCh38.p13 and the gene annotation file GENCODE v42. The pruned high-quality reads were aligned to the human reference genome to generate sorted BAM files.
[0015] S2. Systematic identification and annotation of RNA editing events and transcript quantification: Using the RNA editing identification software REDItools, BAM files were input to systematically identify A-to-I type RNA editing events. For each identified candidate editing site, its editing frequency was calculated, generating a sample × editing frequency matrix. The ANNOVAR tool was used, with the dbSNP database as a reference, to filter out known genomic polymorphic sites. The RSEM software was used to accurately quantify the transcript BAM files of the same batch of samples, obtaining the expression abundance estimate of each gene, generating a sample × gene expression matrix.
[0016] S3. High-dimensional feature screening for diagnosis: First, perform basic quality control filtering to delete editing sites that meet any of the following low confidence conditions: (1) the sequencing coverage of the site is less than 10, (2) the editing frequency of the site is <0.05; perform population conservation filtering through the sites of basic quality control to retain RNA editing sites with an intragroup occurrence rate of 75% in healthy controls, Crohn's disease and ulcerative colitis samples;
[0017] S4. Subtype-specific differential feature screening: For the sites filtered through population conservation in S3, nonparametric statistical tests were performed to analyze whether there were significant differences in RNA editing frequency between the ulcerative colitis group (UC) and the healthy control group (HC), and between the Crohn's disease group (CD) and the healthy control group (HC). Sites that reached a significant level in their respective comparisons were retained to form an initial set. Through a non-intersection operation, sites that were also significant in CD vs HC were removed from the UC-related sites, resulting in UC-specific differential sets and CD-specific differential sets. The two differential sets were merged to form a candidate subtype-specific site pool. All sites were subjected to statistical tests for the UC vs CD group to obtain the UC vs CD differential site set. The intersection of this differential site set with the candidate subtype-specific site pool was taken to obtain the sites that constitute the high-specificity diagnostic feature set.
[0018] S5. In order to reduce the dimensionality of the data, the Boruta feature selection algorithm was used to reduce the dimensionality of the high-specificity diagnostic feature set to obtain truly valuable features that can distinguish the three subtypes. The features were then sorted according to their importance, and the top 20 features were used for the final machine learning feature gene set.
[0019] S6. Using the machine learning feature set obtained in S5, a diagnostic model for the inflammatory bowel disease subtype is constructed through the random forest algorithm.
[0020] Preferably, in step S6, the modeling parameter ntree=200 is used to construct the diagnostic model using random forest.
[0021] The diagnostic model for the inflammatory bowel disease subtype constructed using the above method.
[0022] Beneficial effects:
[0023] 1. This invention can accurately distinguish between healthy individuals and patients with UC and CD from PBMC transcriptome sequencing data through RNA editing.
[0024] 2. This invention is based on PBMC sequencing data, which can be obtained from samples collected via routine venous blood sampling. This establishes the minimally invasive nature of the method and overcomes the invasiveness bottleneck of existing gold standard technologies. By utilizing data from PBMCs, a specific immune cell population, the aim is to directly capture IBD-related systemic immune response signals, avoiding the problem of blurred molecular characteristics caused by the high heterogeneity of cell types in intestinal tissue samples.
[0025] 3. This invention delves into the innovative molecular regulatory level, shifting from analyzing traditional differences in gene expression to focusing on posttranscriptional RNA editing. This indicator more directly reflects the fine-tuning and homeostatic changes in gene function, potentially revealing disease characteristics that are more stable and biologically significant than mRNA abundance. Simultaneously, it constructs a machine learning-resolvable feature space, transforming each sample into a feature vector composed of hundreds to thousands of RNA editing frequency values, providing a structured data foundation for subsequent intelligent diagnosis based on statistical regularity and pattern recognition.
[0026] 4. In response to the high noise and high individual heterogeneity of RNA editing data, this invention designs a three-step progressive feature screening process to extract stable and generalizable diagnostic features. This fundamentally ensures that the extracted features have high population representativeness and clinical reproducibility, which is the cornerstone for the strong generalization ability of subsequent diagnostic models.
[0027] 5. This invention uses a subtype-specific differential feature screening strategy to identify highly specific features. (1) Ensuring the subtype-specificity of features: The first step uses an "independent comparison, no intersection" approach to initially screen editing sites that may only undergo specific changes in a certain subtype, excluding sites that change in both diseases, and ensuring that there are differences in the HC group; (2) Enhancing the discriminative power of features: The second step requires that these "candidate specific sites" also show differences in direct comparison between UC and CD. This is a rigorous verification step to ensure that they have strong disease subtype differentiation power; (3) Constructing a highly specific feature set: The final features screened through this process are very likely to be key molecular events that drive or characterize the unique pathological process of Crohn's disease or ulcerative colitis, rather than general inflammation-related changes. This provides the optimal feature foundation for constructing a highly interpretable model that can reveal the intrinsic biological differences of diseases, rather than just statistical classification.
[0028] 6. This invention is based on a three-step progressive screening strategy for subtype-specific RNA editing sites. Through precise statistical design, it ensures that the features finally selected are not only related to the disease state (distinguished from healthy controls), but also specific biomarkers that can strongly distinguish Crohn's disease from ulcerative colitis, rather than common inflammation-related changes shared by both. This lays a characteristic foundation for achieving high-precision subtype identification. Attached Figure Description
[0029] Figure 1 A flowchart illustrating the construction of a diagnostic model for inflammatory bowel disease subtypes;
[0030] Figure 2 The distribution and number of RNA editing events in different subgroups of inflammatory bowel disease;
[0031] Figure 3 A comparison of global RNA editing levels among different subgroups of inflammatory bowel disease;
[0032] Figure 4 This is a comparison chart of principal component analysis of samples based on different molecular characteristics;
[0033] Figure 5 Schematic diagram of subtype-specific RNA editing site screening strategy
[0034] Figure 6 The top 20 features selected for modeling by the Boruta algorithm;
[0035] Figure 7 The receiver operating characteristic curve of the random forest diagnostic model on the independent test set;
[0036] Figure 8 The confusion matrix shows the correspondence between the model's predicted categories and the actual clinical diagnostic categories. Detailed Implementation
[0037] The present invention will be further described below with reference to embodiments. These embodiments are illustrative of the present invention, but the present invention is not limited to these embodiments:
[0038] Example 1
[0039] This embodiment describes data acquisition and preprocessing, and the specific method includes the following steps:
[0040] S1. PBMC transcriptome sequencing data were obtained from the GEO public database. Raw fastq files of PBMCs containing 60 patients with Crohn's disease (CD), 15 patients with ulcerative colitis (UC), and 12 healthy controls (HC) were obtained from the European Nucleotide Archive database. The raw sequencing data were initially assessed for quality using the FastQC tool, and the results were summarized using MultiQC to understand the overall quality of the data.
[0041] S2. Sequencing adapters and low-quality sequence trimming: Trimmomatic software is used to filter the raw data, removing adapter sequences and low-quality bases to obtain high-quality "clean" sequencing reads;
[0042] S3. Genome Index Construction: Using STAR alignment software, an index was constructed based on the human reference genome sequence (GRCh38.p13) and gene annotation files (GENCODE v42);
[0043] S4. Sequence alignment: Align the pruned high-quality reads to the human reference genome to generate a sorted BAM file.
[0044] Example 2
[0045] This embodiment describes the systematic identification and annotation of RNA editing events and transcript quantification. The specific method includes the following steps:
[0046] S1. Using the RNA editing identification software REDItools, input the preprocessed BAM file to systematically identify A-to-I (adenosine to inosine) type RNA editing events; for each identified candidate editing site, calculate its editing frequency using the formula: Editing frequency = (number of sequencing reads supporting the edited bases) / (total number of sequencing reads covering the site); this step generates an initial high-dimensional feature set for each sample, where each feature corresponds to an RNA editing frequency value for a specific genomic coordinate;
[0047] S2. To eliminate interference from single nucleotide polymorphisms, the ANNOVAR tool was used, with the dbSNP database as a reference, to filter out known genomic polymorphic sites: First, an ANNOVAR format file of a known SNP database was prepared; gene and site region annotations and filtering were performed on the editing sites of each sample to generate an editing frequency matrix constructed as sample (row) × gene (column); RSEM software was used to accurately quantify the transcript BAM files of the same batch of samples to obtain the expression abundance estimate of each gene, generating an expression matrix constructed as sample (row) × gene (column).
[0048] like Figure 2As shown, this figure displays the number of RNA editing sites detected in the healthy control group (HC), ulcerative colitis group (UC), and Crohn's disease group (CD), and presents the distribution of these sites in different genomic functional regions (such as UTR-3, exons, introns, etc.) in the form of a bar chart, intuitively reflecting the differences in the number of editing events between groups and the regional bias. Figure 1 ).
[0049] Example 3
[0050] This embodiment describes high-dimensional feature screening for diagnostic purposes, and the specific method includes the following steps:
[0051] S1. Basic quality control filtering: Delete all edit sites that simultaneously meet any of the following low confidence conditions: (1) the sequencing coverage of the site is less than 10; (2) the editing frequency of the site is less than 0.05;
[0052] S2. Population conservation filtering: Among the sites that have passed basic quality control, further population conservation screening is performed to retain only RNA editing sites that have been successfully detected in at least 75% of the samples across all subgroups.
[0053] Example 4
[0054] This embodiment focuses on global feature analysis and preliminary verification. The specific method includes the following steps:
[0055] S1. Overall Editing Level Analysis: To investigate the overall trend of RNA editing in IBD, the average editing frequency of all detected RNA editing sites in each sample was calculated to represent the "global editing level" of that sample. Wilson's test was then used to analyze the differences in global editing levels among the three groups: healthy control (HC), Crohn's disease (CD), and ulcerative colitis (UC) (see [link to analysis]). Figure 3 );
[0056] S2. Principal Component Analysis Visualization and Comparison: To objectively assess and compare the potential discriminative power of the two types of molecular features on the samples, principal component analysis was performed on matrices A (RNA editing frequency matrix) and B (gene expression matrix) generated in Example 2, respectively, and scatter plots of the sample distribution were drawn based on their first and second principal components (see...). Figure 4 ).
[0057] like Figure 3 As shown, the global RNA editing level in the healthy control group (HC) was significantly higher than that in the CD and UC groups (p < 0.05), while no significant difference was found between the CD and UC groups in this overall indicator. This finding suggests that the overall inhibition of RNA editing may be a common feature of IBD and a key mechanism promoting disease progression.
[0058] like Figure 4 As shown, in the PCA plot based on RNA editing frequency, the healthy control group (HC), Crohn's disease group (CD), and ulcerative colitis group (UC) exhibit clearer clustering and separation in two-dimensional space; in contrast, in the PCA plot based on gene expression levels, the distributions of the three groups show significant overlap. This visual comparison indicates that RNA editing data contains richer discriminative information related to disease subtype classification in its overall data structure. This preliminary analysis provides important preliminary data support and a feasible basis for subsequent screening of specific features from RNA editing data and construction of machine learning models.
[0059] Example 5
[0060] This embodiment describes a screening method based on subtype-specific differential features, which includes the following steps:
[0061] S1. Independent difference analysis and preliminary specific set acquisition: For the loci that passed the population conservation filter, nonparametric statistical tests (such as the Mann-Whitney U test) were performed to analyze whether there were significant differences (p < 0.05) in RNA editing frequency between the ulcerative colitis group (UC) vs. healthy control group (HC) and the Crohn's disease group (CD) vs. healthy control group (HC). Loci that reached a significant level (p < 0.05) in their respective comparisons were retained to form an initial set. By performing a "non-intersection" operation, loci that were also significant in CD vs. HC were removed from the UC-related loci, resulting in the UC-specific difference set (significant in UC vs. HC, but not significant in CD vs. HC) and the CD-specific difference set (significant in CD vs. HC, but not significant in UC vs. HC).
[0062] S2. Subtype Differential Validation: The two differential sets mentioned above are merged to form a "candidate subtype specific site pool";
[0063] S3. Intersection Screening: Perform statistical tests on all loci against the UC group and the CD group to obtain the UC vs CD differential loci set. Take the intersection of this differential loci set with the aforementioned "candidate subtype specific loci pool" to obtain the loci that constitute the "specific diagnostic feature set".
[0064] like Figure 5 The diagram shown illustrates a subtype-specific RNA editing site screening strategy.
[0065] Example 6
[0066] This embodiment uses the Boruta algorithm to reduce feature dimensionality and select core editing features that can distinguish the three subtypes. Specifically, the dataset is first divided into a training set and a test set according to a certain ratio (e.g., 7:3), and the Boruta algorithm is run on the training set for feature selection. After multiple rounds of iterative comparison, the algorithm finally selects the top 20 features as the core editing features for distinguishing the three subtypes. Figure 6 ).
[0067] This embodiment employs the Boruta algorithm for core feature selection, aiming to reduce data dimensionality and identify the RNA editing sites most relevant to the three subtype classifications. The Boruta algorithm is a feature selection method based on random forests. Its core principle is to construct a benchmark by creating shadow features (i.e., randomly shuffled copies of the original features). By comparing the importance scores of these "shadow features" with those of the original features, only those original features that are statistically significantly better than random guessing are retained. This process not only eliminates a large number of irrelevant and redundant noise features but also provides more biologically interpretable input variables for subsequent classification model construction, helping to reveal the underlying RNA editing regulatory mechanisms driving the three subtype classifications.
[0068] Example 7
[0069] This embodiment describes the construction, training, and validation of a machine learning diagnostic model. The specific method includes the following steps:
[0070] S. Construction of a Random Forest Model Based on Key Features from Example 6: The top 20 differential edit sites are obtained from Example 6 and grouped with the training and test sets consistent with Example 6; using the randomForest package, with these features in the training set as input and three-class labels as output, 5-fold cross-validation is used to optimize hyperparameters and select the model to construct a random forest classifier; by setting the number of decision trees (e.g., ntree=200) and calculating feature importance, a highly stable ensemble model is trained. This step transforms biological features into a diagnostic function that can make probabilistic predictions;
[0071] S3. Comprehensive Evaluation of Model Performance: The trained model is applied to an independent test set for a comprehensive and quantitative performance evaluation, including multi-class ROC analysis and AUC: A "one-to-many" strategy is used to calculate the receiver operating characteristic curves (ROCs) for the three classes HC, UC, and CD, and the corresponding area under the curves are obtained. The macro-average AUC is ultimately reported as the core indicator of the model's overall discriminative ability. Figure 7 And a confusion matrix is used to show the generated predicted class and the true class ( Figure 8 ).
[0072] Example 8
[0073] This embodiment demonstrates the diagnosis of inflammatory bowel disease subtypes using the diagnostic model constructed in Example 7. The specific method includes the following steps:
[0074] S1. Sample collection and pretreatment: Collect 2-3 mL of venous blood sample from the patient to be tested and place it in an EDTA anticoagulant tube. Within 4 h after collection, separate peripheral blood mononuclear cells (PBMCs) by density gradient centrifugation, or use whole blood directly for subsequent RNA extraction.
[0075] S2. RNA extraction and sequencing: Total RNA was extracted from the samples using the Trizol method or a commercial RNA extraction kit (such as the Qiagen RNeasyKit). The extracted RNA was then subjected to quality testing, and an RNA sequencing library was constructed and high-throughput sequencing was performed to obtain the transcriptome data of the samples.
[0076] S3. Feature extraction and quantification: Repeat the steps of Examples 1-2 to quantify RNA editing in the sequencing data, and extract the core RNA editing sites screened in Example 6;
[0077] S4. Model Loading and Probability Prediction: The feature vector obtained in S3 is input into the random forest diagnostic model constructed in Example 7. After receiving the input, the model outputs a three-dimensional probability value [P_HC, P_UC, P_CD], which represents the probability that the sample belongs to healthy control (HC), ulcerative colitis (UC), and Crohn's disease (CD), respectively, and satisfies P_HC + P_UC + P_CD = 1;
[0078] S5. Subtype Determination and Diagnostic Results: Based on the probability vector output from S4, the patient's inflammatory bowel disease subtype is determined according to the following interpretation rules:
[0079] (1) Crohn's disease (CD) diagnosis: If the P_CD value in the probability vector is the largest and the P_CD is significantly higher than the other two categories (usually referring to P_CD being the largest of the three), then the patient is diagnosed with Crohn's disease.
[0080] Example: The output result is [0.12, 0.19, 0.69]. Since P_CD=0.69 is the largest, it is judged as CD.
[0081] (2) Diagnosis of ulcerative colitis (UC): If the P_UC value in the probability vector is the largest and the P_UC is significantly higher than the other two categories (usually referring to P_UC being the largest of the three), then the patient is diagnosed with ulcerative colitis.
[0082] Example: The output result is [0.10, 0.81, 0.09]. P_UC=0.81 is the largest, so it is judged as UC;
[0083] (3) Healthy control (HC) or non-IBD judgment: If the P_HC value is the largest in the probability vector and the P_HC is significantly higher than the other two categories (usually referring to P_HC being the largest among the three), then the patient is judged to be a healthy control or without obvious IBD characteristics;
[0084] Example: The output result is [0.88, 0.06, 0.06]. P_HC=0.88 is the maximum, so it is judged as HC.
[0085] S6. Diagnostic Results Output: Outputs the final diagnostic report, clearly indicating the predicted probability of the patient's subtype classification (HC / UC / CD).
[0086] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Any person skilled in the art can make many possible variations and modifications to the technical solutions of the present invention, or modify them into equivalent embodiments, without departing from the spirit and technical essence of the present invention. Therefore, any simple modifications, equivalent substitutions, equivalent changes, and modifications made to the above embodiments based on the technical essence of the present invention, without departing from the content of the technical solutions of the present invention, shall still fall within the scope of protection of the present invention.
Claims
1. A method for constructing a diagnostic model for a subtype of inflammatory bowel disease, characterized in that, Includes the following steps: S1. Data Acquisition and Preprocessing: PBMC transcriptome sequencing data were acquired from Crohn's disease patients, ulcerative colitis patients, and healthy controls. FastQC was used for preliminary quality assessment of the sequencing data, and MultiQC was used to summarize the results to understand the overall data quality. Trimmomatic software was used to filter the sequencing data, removing adapter sequences and low-quality bases to obtain high-quality sequencing reads. STAR alignment software was used to construct an index based on the human reference genome sequence GRCh38.p13 and the gene annotation file GENCODE v42. The pruned high-quality reads were aligned to the human reference genome to generate sorted BAM files. S2. Systematic identification and annotation of RNA editing events and transcript quantification: Using the RNA editing identification software REDItools, BAM files were input to systematically identify A-to-I type RNA editing events, and the editing frequency was calculated for each identified candidate editing site; Using the ANNOVAR tool, with the dbSNP database as a reference, known genomic polymorphic sites were filtered out; Using RSEM software, the transcript BAM files of the same batch of samples were accurately quantified to obtain the expression abundance estimate of each gene, and the expression matrix of sample × gene was generated, namely matrix A and matrix B. Matrix A is the RNA editing frequency matrix, and matrix B is the gene expression matrix. S3. High-dimensional feature screening for diagnosis: First, perform basic quality control filtering to delete editing sites that meet any of the following low confidence conditions: (1) the sequencing coverage of the site is less than 10, (2) the editing frequency of the site is <0.05; perform population conservation filtering through the sites of basic quality control to retain RNA editing sites with an intragroup occurrence rate of 75% in healthy controls, Crohn's disease and ulcerative colitis samples; S4. Subtype-specific differential feature screening: For the sites filtered through population conservation in S3, nonparametric statistical tests were performed to analyze whether there were significant differences in RNA editing frequency between the ulcerative colitis group (UC) and the healthy control group (HC), and between the Crohn's disease group (CD) and the healthy control group (HC). Sites that reached a significant level in their respective comparisons were retained to form an initial set. Through a non-intersection operation, sites that were also significant in CD vs HC were removed from the UC-related sites, resulting in UC-specific differential sets and CD-specific differential sets. The two differential sets were merged to form a candidate subtype-specific site pool. All loci were subjected to statistical tests of UC vs CD groups to obtain the UC vs CD differential loci set. The intersection of this differential loci set with the candidate subtype specific loci pool was then obtained to form the high-specificity diagnostic feature set. S5. In order to reduce the dimensionality of the data, the Boruta feature selection algorithm was used to reduce the dimensionality of the high-specificity diagnostic feature set to obtain truly valuable features that can distinguish the three subtypes. The features were then sorted according to their importance, and the top 20 features were used for the final machine learning feature gene set. S6. Using the machine learning feature set obtained in S5, a diagnostic model for the inflammatory bowel disease subtype is constructed through the random forest algorithm.
2. The method for constructing a diagnostic model for a subtype of inflammatory bowel disease according to claim 1, characterized in that: In step S6, the modeling parameter ntree=200 is used to construct the diagnostic model using random forest.
3. A diagnostic model for a subtype of inflammatory bowel disease constructed according to the construction method described in claim 1 or 2.