A method for predicting comorbidities using semantic profiling.

The method addresses the limitation of existing comorbidity prediction by using hyperexpression analysis and Pearson correlation coefficients to identify shared biological mechanisms, enabling accurate comorbidity prediction without common genes.

JP7880100B2Active Publication Date: 2026-06-25GIL MEDICAL CENT +1

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
GIL MEDICAL CENT
Filing Date
2024-10-18
Publication Date
2026-06-25

Smart Images

  • Figure 0007880100000007
    Figure 0007880100000007
  • Figure 0007880100000008
    Figure 0007880100000008
  • Figure 0007880100000009
    Figure 0007880100000009
Patent Text Reader

Abstract

Provided is a prediction method and system that can determine with high accuracy whether a first disease and a second disease have a comorbid disease even if there are no common genes. [Solution] A method for predicting the possibility of co-morbid diseases includes the steps of: (a1) calculating k P values ​​using a gene list of a first disease and k functional gene sets; (a2) calculating k P values ​​using a gene list of a second disease and the k functional gene sets; (a1) calculating a first semantic profile using the k P values ​​calculated in step (a1); (a2) calculating a second semantic profile using the k P values ​​calculated in step (a2); and (a2) calculating a Pearson correlation coefficient using the first semantic profile and the second semantic profile.
Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] The present invention relates to a method for predicting comorbidities, and more specifically, to a method for predicting the likelihood of comorbidities in multiple diseases through semantic profiling. [Background technology]

[0002] The information provided herein is merely background information relating to the present invention and does not constitute prior art.

[0003] A comorbidity refers to another disease that appears in conjunction with a primary disease. While similar to the concept of complications, it differs in that the other disease is not related to the primary disease.

[0004] To indicate the degree of comorbidity, the Jaccard index method or the overlap coefficient (OC) method is used. These methods use genes known to be highly associated with specific diseases to indicate the degree of comorbidity between two diseases, and predict the likelihood of comorbidity in diseases based on genes associated with the onset of the two diseases.

[0005] Specifically, the index value is determined based on the number of disease genes (overlapping genes) that are common to two diseases. This leads to the problem that if no common disease genes are present, the likelihood of comorbidity cannot be calculated.

[0006] On the other hand, even without a common disease gene, if there is a shared biological mechanism in the development of the two diseases, comorbidities can occur between the two diseases. However, these methods have the problem of not being able to explain comorbidities for which no common disease gene has been found. [Overview of the Initiative] [Problems that the invention aims to solve]

[0007] Therefore, the present invention aims to provide a prediction method and system that can determine with high accuracy whether or not there are co-occurring diseases in disease A and disease B, even without a common gene.

[0008] The various problems that this invention aims to solve are not limited to those mentioned above, and other problems not mentioned will be clearly understood by an ordinary person from the following description. [Means for solving the problem]

[0009] According to one embodiment of the present invention, a method for predicting the likelihood of comorbidity in a first disease and a second disease, comprising: (a1) the hyperexpression analysis modules 10, 10' calculate k P values ​​using a gene list of the first disease and k functional gene sets; (a2) the hyperexpression analysis modules 10, 10' calculate k P values ​​using a gene list of the second disease and the k functional gene sets; and (b1) the semantic profile calculation modules 12, 12' calculate a first semantic profile (P) using the k P values ​​calculated in step (a1). A (b) The steps of (a) and (b) the semantic profile calculation modules 12, 12' calculate a second semantic profile (P B (c) The steps of calculating the first semantic profile (P A ) and the second semantic profile (P B The present invention provides a method that includes the step of calculating the Pearson correlation coefficient using ).

[0010] Furthermore, in steps (a1) and (a2) according to one embodiment of the present invention, the P value is preferably calculated using formula 1.

[0011] Furthermore, in step (b1) according to one embodiment of the present invention, the first semantic profile (P A) is a vector of order k obtained by substituting the k P values ​​calculated in step (a1) into the exponential function, and in step (b2), the second semantic profile (P B ) is preferably a k-th order vector obtained by substituting the k P values ​​calculated in step (a2) into the exponential function.

[0012] Furthermore, in step (c) according to one embodiment of the present invention, the Pearson correlation coefficient is preferably calculated by formula 2, where n in formula 2 is the number of functional gene sets.

[0013] Furthermore, a prediction method according to another embodiment of the present invention further includes a step (a0) performed prior to step (a1), in which the core subset extraction module 16' selects some genes from a first disease gene list and some genes from a second disease gene list.

[0014] Furthermore, in prediction methods according to other embodiments of the present invention, the core subset extraction module 16' selects some genes using Mixture Model Regression. [Effects of the Invention]

[0015] As described above, according to one embodiment and other embodiments of the present invention, there is an advantage in that the presence or absence of comorbidities in disease A and disease B can be determined with high accuracy, even without a common gene. [Brief explanation of the drawing]

[0016] [Figure 1] This is a block diagram of a prediction system according to one embodiment of the present invention. [Figure 2] This figure shows the data flow processed by a prediction system according to one embodiment of the present invention. [Figure 3] This is a block diagram of a prediction system according to another embodiment of the present invention. [Figure 4]A flowchart of a prediction method using a prediction system according to an embodiment or other embodiments of the present invention.

Embodiments for Carrying Out the Invention

[0017] Hereinafter, some embodiments of the present invention will be described in detail with reference to exemplary drawings. When assigning reference numerals to the components of each drawing, the same reference numerals are used for the same components as much as possible even if they are different drawings. Also, in describing the present invention, when it is determined that a detailed description of related known configurations or functions will obscure the gist of the present invention, the detailed description thereof will be omitted.

[0018] In describing the components of the embodiments of the present invention, reference numerals such as first, second, i), ii), a), b), etc. are used. Those reference numerals are only for distinguishing the components from other components, and the essence, order, sequence, etc. of the components are not limited by those reference numerals. In this specification, when a part "includes" or "comprises" a certain component, it does not mean excluding other components unless otherwise specified, but means that other components may be further included.

[0019] In the present invention, the plurality of genes included in the disease gene list means a set of genes involved in the disease.

[0020] Explanation of the coexisting disease prediction system FIG. 1 is a block diagram of a prediction system according to an embodiment of the present invention. FIG. 2 is a diagram showing the flow of data processed by the prediction system according to an embodiment of the present invention.

[0021] Referring to Figures 1 and 2, a prediction system 1 according to one embodiment of the present invention will be described. The prediction system 1 is configured to calculate the Pearson correlation coefficient and predict the presence or absence of comorbidities. To this end, the prediction system 1 according to one embodiment of the present invention includes all or part of the hyperexpression analysis module 10, the semantic profiling module 12, and the Pearson correlation coefficient calculation module 14.

[0022] The hyperexpression analysis module 10 calculates p-values ​​in conjunction with multiple databases 2 and 3. Specifically, the hyperexpression analysis module 10 is configured to calculate p-values ​​using the gene lists for disease A and disease B stored in the gene list database 2, and the functional gene sets stored in the functional gene set database 3.

[0023] Here, the gene list for disease A may include genes A1, B2, A3, A4, etc. Similarly, the gene list for disease B may include genes B1, B2, B3, A4, etc. Note that the names A1, B1, etc., are used here to indicate that each gene is distinct.

[0024] Furthermore, a functional gene set includes a gene ontology (GO) or gene path. A gene ontology is a type of database consortium, a structured model for studying gene function, where individual genes are associated with biological processes, molecular functions, and cellular components. A gene path refers to a biological database that represents the dynamic relationships or interactions between biological elements such as proteins, genes, and cells in a network format.

[0025] The over-representation analysis module 10 calculates the p-value using the over-representation analysis method.

[0026] The hyperrepresentation analysis method is a method of calculating the p-value using Table 1 and Formula 1.

[0027] [Table 1]

[0028]

number

[0029] In Table 1, a plus sign indicates that the gene is included in the disease list or functional gene set, while a minus sign indicates that the gene is not included. For example, 'a' shows the number of human genes that are included in the gene list for disease A and are also included in the gene ontology. 'b' shows the number of human genes that are not included in the gene list for disease A but are included in the gene ontology. 'c' shows the number of human genes that are included in the gene list for disease A but are not included in the gene ontology. 'd' shows the number of human genes that are neither included in the gene list for disease A nor in the gene ontology.

[0030] If a+b+c+d=n, then n represents the total number of genes in a human being.

[0031] Using a, b, d, and n, the p-value is calculated using formula 1.

[0032] An overexpression analysis module 10 according to one embodiment of the present invention compares a gene list for one disease with multiple functional gene sets.

[0033] For example, if p-values ​​are calculated using a gene list for disease A and a gene set containing gene ontology 1 to gene ontology 1000, 1000 p-values ​​for disease A will be calculated. If the calculation process for disease A is repeated for disease B, 1000 p-values ​​for disease B will be calculated.

[0034] The 1000 p-values ​​calculated for disease A and the 1000 p-values ​​calculated for disease B are sent to the semantic profiling module 12.

[0035] The semantic profiling module 12 receives 1000 p-values ​​for disease A and 1000 p-values ​​for disease B, and uses them to calculate a semantic profile.

[0036] Here, the semantic profile refers to the result obtained by representing 1000 p-values ​​for disease A and 1000 p-values ​​for disease B as vectors and substituting them into an exponential function.

[0037] The semantic profile is expressed as follows. Note that the numbers shown in this invention are for illustrative purposes only and will vary depending on the type of disease, the number and type of functional gene sets. Exp(P-vector A)=(1.0,2.7,1.8,1.1,…) Exp(P-vector B)=(1.3,2.1,2.6,2.7,…)

[0038] The Pearson correlation coefficient calculation module 14 calculates the semantic profile (hereinafter referred to as P) in disease A. A ) and the semantic profile in disease B (hereinafter referred to as P B It is configured to calculate the Pearson correlation coefficient (PCC) of ). Here, the formula for calculating the PCC using PA and PB is shown in Equation 2.

[0039]

number

[0040] In equation 2, A and B represent diseases, respectively, and n represents the number of functional gene sets used in calculating the p-value. For example, as mentioned earlier, if a gene ontology of 1000 genes is used, then n will be 1000.

[0041] PCC is a numerical representation of the degree of similarity between the semantic profiles of gene groups for two diseases. A positive PCC value indicates a higher comorbidity rate, while other values ​​indicate a lower comorbidity rate.

[0042] According to the prediction system 1 according to an embodiment of the present invention, regardless of the number of genes common to disease A and disease B, when disease A and disease B share similar biological mechanisms, the semantic profiles of the two diseases (P A , P B ) have a relatively high PCC. Therefore, the prediction system 1 according to an embodiment of the present invention has the advantage of being able to determine the presence or absence of co-existing diseases in disease A and disease B even without common genes.

[0043] FIG. 3 is a block diagram of a prediction system according to another embodiment of the present invention.

[0044] As shown in FIG. 3, the prediction system 1' according to another embodiment of the present invention further includes a core subset extraction module 16' in addition to the overexpression analysis module 10', the semantic profile calculation module 12', and the Pearson correlation coefficient calculation module 14'.

[0045] The core subset extraction module 16' is configured to extract a core subset. Specifically, when two disease gene lists show similarity to each other, instead of using the entire gene list of each disease, the P-value can be calculated using only a part of the genes in the list. Therefore, the core subset extraction module 16' extracts only a part of the genes in the list.

[0046] When extracting the core subset, it is preferable to use the Mixture Model Regression method.

[0047] FIG. 4 is a flowchart of a prediction method using the prediction system according to an embodiment or another embodiment of the present invention.

[0048] As shown in FIG. 4, the prediction system 1 according to an embodiment of the present invention can predict co-existing diseases in the following order.

[0049] The hyperexpression analysis modules 10 and 10' perform hyperexpression analysis on disease A and disease B (S410).

[0050] Here, the hyperexpression analysis modules 10 and 10' work in conjunction with multiple databases 2 and 3 to calculate p-values.

[0051] Specifically, the p-value for a given disease can be calculated using the gene list for each disease stored in the gene list database 2 and at least one functional gene set stored in the functional gene set database 3. The over-representation analysis module 10 calculates the p-value using the over-representation analysis method. The p-value is calculated using Table 1 and Formula 1.

[0052] The number of p-values ​​is determined by the number of functional gene sets. That is, if hyperexpression analysis is performed on n functional gene sets and the gene list of disease A, all n p-values ​​will be determined. If the same process is performed on the gene list of disease B, the n p-values ​​for disease B will be calculated.

[0053] The semantic profiling module 12 calculates a semantic profile using the n p-values ​​for disease A and the n p-values ​​for disease B calculated in step S410 (S420).

[0054] Specifically, the semantic profiling module 12 substitutes the n P values ​​for the calculated disease A into an exponential function and generates an n-th order vector (P A It is represented as ). Also, the n P values ​​for disease B calculated are substituted into the exponential function and an n-th order vector (P B This is represented by ). Here, the two vectors mentioned above are called semantic profiles according to the present invention.

[0055] The Pearson correlation coefficient calculation module 14 uses semantic profile P A , P B The Pearson correlation coefficient is calculated using (S430).

[0056] Here, the Pearson correlation coefficient is calculated using Equation 2. According to the present invention, by applying the value of the Pearson correlation coefficient to logistic regression analysis, a model is constructed that predicts whether or not two diseases have a comorbidity relationship, and the constructed model is applied to the determination of comorbidities in clinical practice. In order to train the prediction model, conventionally known comorbidity relationships between two diseases are used as class labels.

[0057] Steps S410 to S430 are common steps in the prediction method using prediction systems 1 and 1' according to one embodiment of the present invention and other embodiments.

[0058] A prediction method using prediction system 1' according to another embodiment of the present invention further includes a core subset extraction step (S400) performed prior to step S410.

[0059] The core subset extraction module 16' extracts some genes from the gene list of disease A and some genes from the gene list of disease B. Note that the number of genes extracted from each list must be the same.

[0060] On the other hand, a preferred method for extracting some genes from each list is to use mixed model regression analysis. Using mixed model regression analysis, a list of arbitrarily extracted genes is generated, and it is checked whether a higher Pearson correlation coefficient can be obtained compared to when the entire gene set is used.

[0061] If the above method determines an extracted gene list with a higher Pearson correlation coefficient than the entire gene list, then that extracted gene list is considered to be a set of genes further related to the development of comorbidities in the two diseases. Such a gene list is determined to be a core gene set.

[0062] Performance of the comorbidity prediction system Table 2 shows the accuracy of comorbidity prediction results using prediction systems 1 and 1' according to one embodiment and another embodiment of the present invention (where accuracy is between 0 and 1), and the accuracy of comorbidity prediction results using a conventional method.

[0063] [Table 2]

[0064] In Table 2, "n" represents the number of disease pairs. "RR thres" indicates the relative risk threshold, "JI" shows the result calculated using the Jackard coefficient method, "OC" shows the result calculated using the overlap coefficient method, and "Sab" shows the result calculated using the isolation measurement method proposed by Menche et al. In other words, JI, OC, and Sab show the results using conventional methods. Here, the relative risk is the result of calculating the degree of comorbidities in diseases based on data from one million people published by the Board of Review and Assessment. That is, Table 2 is a table comparing how well the results can be predicted using only gene sets when the degree of comorbidities is obtained from actual clinical data using relative risk.

[0065] "GOBP" shows the results obtained by the prediction system 1 according to one embodiment of the present invention using a prediction method that utilizes a gene ontology as a functional gene set; "GOMF" shows the results obtained by the prediction system 1 according to one embodiment of the present invention using a prediction method that utilizes GOMF as a functional gene set; and "Reactome" shows the results obtained by the prediction system 1 according to one embodiment of the present invention using a prediction method that utilizes the Reactome gene set as a functional gene set. In other words, GOBP, GOMR, and Reactome show the results obtained by the method using the prediction system 1 according to one embodiment of the present invention.

[0066] "LR" indicates the results obtained using the prediction method with prediction system 1' according to another embodiment of the present invention.

[0067] As shown in Table 2, the results for GOBP, GOMF, Reactome, and LR consistently show higher values ​​than those of conventional methods, indicating superiority over conventional methods.

[0068] The above description merely illustrates the technical concept of the present invention, and any person with ordinary skill in the art to which the present invention pertains could make various modifications and alterations without departing from the essential characteristics of the present invention. Therefore, this embodiment illustrates the technical concept of the present invention and does not limit it. The scope of protection of the present invention should be interpreted as per the claims, and all technical concepts within the scope equivalent thereto should be interpreted as being included in the present invention. [Explanation of Symbols]

[0069] 1,1' Comorbidity Prediction System 2 Gene List Database 3. Functional Gene Set Database 10,10' Overexpression Analysis Module 12, 12' Semantic profiling operation module 14, 14' Pearson Correlation Coefficient Calculation Module 16' Core Subset Extraction Module

Claims

1. A method for predicting the likelihood of comorbidity in the first and second diseases, (a1) The hyperexpression analysis module (10, 10') calculates k P values ​​using the gene list of the first disease and k sets of functional genes, (a2) The hyperexpression analysis module (10, 10') calculates k P values ​​using the gene list of the second disease and the k sets of functional genes, (b1) The semantic profiling module (12, 12') uses the k P values ​​calculated in step (a1) to form a first semantic profile (P A The steps include: (b2) The semantic profile calculation module (12, 12') uses the k P values ​​calculated in step (a2) to calculate a second semantic profile (P B The steps include: (c) The Pearson correlation coefficient calculation module (14) calculates the first semantic profile (P A ) and the second semantic profile (P B This includes the step of calculating the Pearson correlation coefficient using ), In steps (a1) and (a2) above, the P value is calculated using formula 1. In step (b1), the first semantic profile (PA) is a k-th order vector obtained by substituting the k P values ​​calculated in step (a1) into an exponential function, In step (b2) above, the second semantic profile (P B) is a k-th order vector obtained by substituting the k P values ​​calculated in step (a2) above into an exponential function. method. [Math 1] a: The number of human genes that are included in the disease gene list and are also included in the functional gene set. b: The number of human genes that are not included in the disease gene list but are included in the functional gene set. c: The number of human genes that are included in the disease gene list but not in the functional gene set. d: The number of human genes that are not included in the disease gene list and are not included in the functional gene set. n: Total number of human genes

2. In step (c) above, the Pearson correlation coefficient is calculated using equation 2, where n is the number of functional gene sets. The method according to claim 1. [Math 2]

3. The following steps are performed before step (a1): (a0) The core subset extraction module (16') further includes the step of selecting some genes from a first disease gene list and selecting some genes from a second disease gene list, The method according to claim 1 or 2.

4. The aforementioned core subset extraction module (16') selects some genes using Mixture Model Regression. The method according to claim 3.