A multi-task hepatitis b drug screening method and system based on knowledge graph assistance

By using knowledge graph-based temporal connections and hyperedge definitions, the problem of multi-target interactions in drug combination evaluation is solved, enabling accurate assessment of drug sensitivity and combination potential, and improving the accuracy of drug screening and cross-task generalization ability.

CN120452598BActive Publication Date: 2026-06-26SHANGRAO SHAJIANG HIGH TECH BIOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGRAO SHAJIANG HIGH TECH BIOLOGY CO LTD
Filing Date
2025-04-28
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively express the complex interaction mechanisms between drug combinations and multi-target proteins, and entity vectors, which represent the strength of unfused temporal associations and neighborhood interaction patterns, limit the cross-task generalization ability of drug sensitivity prediction and combination effect assessment.

Method used

Based on knowledge graphs, this study establishes temporal connections, identifies sets of multiple entities working together, defines hyperedge connections, expands node connection methods, calculates entity vector representations, and assesses drug sensitivity and combinatorial potential. By combining biological pathway and protein interaction data, it quantifies the matching degree and temporal association characteristics between drugs and viral variants.

Benefits of technology

It enhances the accuracy of drug combination assessment, reduces assessment bias caused by outdated historical data, identifies highly similar and highly rated competing drug pairs, and improves the precision of drug combination potential assessment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120452598B_ABST
    Figure CN120452598B_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of knowledge graph, in particular to a multi-task hepatitis B drug screening method and system based on knowledge graph assistance, comprising the following steps: based on hepatitis B virus genotype sequence data recorded over time, patient drug use records and drug sensitivity. In the present application, multi-dimensional dynamic time series data such as virus genotype sequence data, patient drug use records and drug sensitivity are integrated, the interaction events between entities are marked by time stamp, the dynamic characteristics such as virus variation track and drug efficacy change are embedded in the graph node attributes, so that the knowledge representation can reflect the time dependence in the real scene. Based on biological pathway annotation information and protein interaction data, the hyperedge connection multi-entity set is defined, the limitation of traditional knowledge graph which only supports binary relationship is expanded, the drug combination and multi-target synergistic mechanism are explicitly modeled, and the misjudgment of combination effect caused by the simplification of interaction relationship is avoided.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of knowledge graph technology, and in particular to a multi-task hepatitis B drug screening method and system based on knowledge graph assistance. Background Technology

[0002] Knowledge graphs are a technology that models entities and their relationships using graph structures. They use semantic networks to perform structured representation and reasoning on heterogeneous data from multiple sources, and their core functions include entity extraction, relationship mining, graph embedding, and dynamic temporal modeling.

[0003] Current technologies rely on binary relation edges to describe entity interactions, which struggles to express the collaborative or competitive effects of multiple entities, such as the complex interaction mechanisms between drug combinations and multi-target proteins. Simplified modeling can easily miss key combination effect signals. Entity vector representations are typically based on optimizing features for a single task objective, failing to incorporate dynamic contextual information such as temporal correlation strength and neighborhood interaction patterns. This limits the generalization ability of features across cross-task scenarios; for example, drug sensitivity prediction and combination effect assessment require repeated training of different models. Therefore, improvements are needed. Summary of the Invention

[0004] The purpose of this invention is to overcome the shortcomings of existing technologies and to propose a knowledge graph-assisted multi-task hepatitis B drug screening method and system.

[0005] To achieve the above objectives, the present invention adopts the following technical solution: a knowledge graph-assisted multi-task hepatitis B drug screening method, comprising the following steps:

[0006] Based on hepatitis B virus genotype sequence data recorded over time, patient medication records, and drug sensitivity, temporal connections between entities are established to generate a temporal hepatitis B knowledge graph structure.

[0007] Based on the aforementioned time-series hepatitis B knowledge graph structure, and combined with the input biological pathway annotation information, protein interaction data, and known drug combination effect records, the set of multiple entities acting together in the time-series hepatitis B knowledge graph structure is identified, a hyperedge connecting the multiple entity set is defined, the multi-entity interaction set and the hyperedge definition are obtained, and based on the multi-entity interaction set and the hyperedge definition, the connection mode of the nodes in the time-series hepatitis B knowledge graph structure is expanded to establish a hepatitis B knowledge base that integrates hyperedge interactions.

[0008] Based on the hepatitis B knowledge base with fusion hyperedge interaction, the initial vector expression of each hepatitis B virus entity, drug entity, and gene entity is calculated to obtain the initial entity vector representation. Based on the initial entity vector representation, the vector expression is updated and adjusted by aggregating the neighborhood node information and the connection information of different types of edges in the graph, and a multi-task hepatitis B entity feature vector is established.

[0009] Based on the multi-task hepatitis B entity feature vector, the feature vector corresponding to the target hepatitis B virus variant and the feature vector corresponding to the candidate drug are selected, and the drug sensitivity score between the two is estimated to obtain a list of viral strain drug sensitivity scores. Based on the list of viral strain drug sensitivity scores and the multi-task hepatitis B entity feature vector corresponding to the candidate drug combination, the antagonistic effect of the drug combination is evaluated, and the potential evaluation result of the hepatitis B drug combination is obtained.

[0010] Preferably, the steps for obtaining the temporal hepatitis B knowledge graph structure are as follows:

[0011] Integrate hepatitis B virus genotype sequence data, patient medication records, and drug sensitivity data; extract timestamp information from each data point; convert the timestamp information into numerical time labels in a unified time format; and generate a timestamp-associated dataset.

[0012] Based on the timestamp-associated dataset, the interval between time tags is calculated, and continuous time periods are divided according to the interval and a preset time window threshold. In each time period, an association edge is established between the hepatitis B virus genotype entity and the drug entity. The weight value of the time association edge is calculated, and a weighted time association edge set is generated.

[0013] Based on the weighted set of time-related edges, the edges with weight values ​​greater than a preset edge weight threshold are connected to the corresponding entities to form a time-series hepatitis B knowledge graph structure.

[0014] Preferably, the steps for obtaining the multi-entity interaction set and the hyperedge definition are as follows:

[0015] By integrating entity nodes, biological pathway annotation information, protein-protein interaction data, and drug combination effect records in the aforementioned time-series hepatitis B knowledge graph structure, gene regulatory pathway identifiers in the biological pathway annotation information and drug target gene set identifiers in the drug combination effect records are extracted to generate a multi-source dataset.

[0016] Based on the multi-source dataset, the intensity of the combined effect of the gene entity set and the drug target gene set within the same time window is calculated.

[0017] Based on the strength of the interaction, gene sets and drug target gene sets with interaction strength greater than a preset threshold are selected, and the hyperedges of the connection sets are defined to generate multi-entity interaction sets and hyperedge definitions.

[0018] Preferably, the steps for acquiring the hepatitis B knowledge base integrating hyper-edge interaction are as follows:

[0019] Extract the node connection relationships between the multi-entity interaction set and the hyperedge definition and the temporal hepatitis B knowledge graph structure; traverse each hyperedge in the hyperedge definition; establish bidirectional connection relationships between the hyperedge and all nodes in the corresponding entity set; and generate a hyperedge connection relationship table.

[0020] Based on the hyperedge connection table, it is detected whether there are isolated nodes or redundant connections in the hyperedge connection. If there are isolated nodes, missing connections are supplemented according to protein interaction data. If there are redundant connections, duplicate hyperedge connections are merged according to a preset redundancy threshold to generate a verified hyperedge connection table.

[0021] Based on the verified hyperedge connection table, the hyperedge connection relationships are mapped and replaced with the original node connection methods in the temporal hepatitis B knowledge graph structure, and the connection types and attributes between nodes are updated to form a hepatitis B knowledge base that integrates hyperedge interactions.

[0022] Preferably, the step of obtaining the initial entity vector representation is as follows:

[0023] Traverse the hepatitis B virus entity, drug entity, and gene entity nodes in the hepatitis B knowledge base that integrates hyperedge interactions, extract the interaction event timestamps recorded in the time association edges of each node, count the total number of times each node is associated with the multi-entity interaction set in the hyperedge information, record the number of times the interaction event timestamp sequence is associated with the hyperedge, and generate a node time-series interaction log.

[0024] Based on the node time-series interaction log, with the first interaction timestamp as the benchmark, the time difference between each subsequent timestamp and the benchmark is calculated to generate a time difference sequence. The frequency of each node appearing simultaneously with other entities in the hyperedge information is counted, and the ratio of the combination frequency to the length of the time difference sequence is calculated to generate a set of multi-entity co-occurrence intensity factors.

[0025] Based on the set of multi-entity co-occurrence intensity factors, the time difference sequence is divided into windows with a period of 30 days. The mean and variance of the time difference within each window are calculated. The mean, variance, and co-occurrence intensity factors are concatenated to output the initial entity vector representation.

[0026] Preferably, the steps for obtaining the multi-task hepatitis B entity feature vector are as follows:

[0027] Traverse each entity node in the initial entity vector representation, extract the set of directly connected neighboring nodes in the hepatitis B knowledge base with fused hyperedge interaction, record the entity type, edge type and connection count of the neighboring nodes, and generate a set of neighboring node information.

[0028] Based on the neighborhood node information set, the proportion of connection times of different edge types in the neighborhood of each entity node is counted, and the proportion is multiplied by the preset priority coefficient of the entity type of the neighboring node to generate a set of edge type weight factors.

[0029] Based on the set of edge type weight factors, the initial entity vector representation is concatenated with the entity vectors in the neighboring node information set according to the weight factors to generate a multi-task hepatitis B entity feature vector.

[0030] Preferably, the steps for obtaining the virus strain drug susceptibility score list are as follows:

[0031] Traverse the target hepatitis B virus variant and candidate drug entity in the multi-task hepatitis B entity feature vector, extract the virus variant feature vector, which includes the expression intensity of gene mutation sites and the temporal correlation intensity, and the drug feature vector, which includes the target effect intensity and metabolic half-life, to generate a set of virus-drug feature vector pairs.

[0032] Based on the set of virus-drug feature vector pairs, the sensitivity scores of virus and drug feature vectors are calculated.

[0033] Based on the sensitivity scores, a list of drug sensitivity scores for virus strains is generated by sorting the sensitivity scores from high to low.

[0034] Preferably, the steps for obtaining the potential assessment results of the hepatitis B drug combination are as follows:

[0035] Traverse each combination in the candidate drug combination list, extract the identifiers of all drug entities within the combination, obtain the sensitivity scores of each drug to the target virus variant from the virus strain drug sensitivity score list, and simultaneously extract the feature vectors of drug entities from the multi-task hepatitis B entity feature vector to generate a drug combination feature-score dataset.

[0036] Based on the drug combination feature-score dataset, the cosine similarity between the feature vectors of each pair of drugs in the combination is calculated. If the cosine similarity between any two drugs in the combination is higher than the preset mechanism overlap threshold and the sensitivity scores of the two drugs are higher than the preset single drug effective threshold, then it is determined that the drug pair has a target competitive antagonistic effect, and an antagonistic effect label set is generated.

[0037] Based on the antagonistic effect marker set, combinations containing at least one antagonistic drug pair are removed, and the remaining combinations are sorted to generate a potential assessment result for hepatitis B drug combinations.

[0038] This invention provides a hepatitis B drug screening system, comprising:

[0039] The temporal graphing module establishes temporal connections between entities based on hepatitis B virus genotype sequence data recorded over time, patient medication records, and drug sensitivity, generating a temporal hepatitis B knowledge graph structure.

[0040] The hyperedge fusion module, based on the time-series hepatitis B knowledge graph structure, and combining the input biological pathway annotation information, protein interaction data, and known drug combination effect records, identifies sets of multiple entities acting together in the time-series hepatitis B knowledge graph structure, defines hyperedges connecting these sets of entities, obtains the multi-entity interaction sets and hyperedge definitions, and expands the connection methods of nodes in the time-series hepatitis B knowledge graph structure based on these multi-entity interaction sets and hyperedge definitions, thereby establishing a hepatitis B knowledge base that integrates hyperedge interactions.

[0041] The representation learning module, based on the hepatitis B knowledge base with fused hyperedge interaction, calculates the initial vector expression of each hepatitis B virus entity, drug entity, and gene entity to obtain the initial entity vector representation. Based on the initial entity vector representation, it updates and adjusts the vector expression by aggregating the neighborhood node information and the connection information of different types of edges in the graph, and establishes a multi-task hepatitis B entity feature vector.

[0042] The effect prediction module, based on the multi-task hepatitis B entity feature vector, selects the feature vector corresponding to the target hepatitis B virus variant and the feature vector corresponding to the candidate drug, estimates the drug sensitivity score between the two, obtains a list of viral strain drug sensitivity scores, and evaluates the antagonistic effect of the drug combination based on the list of viral strain drug sensitivity scores and the multi-task hepatitis B entity feature vector corresponding to the candidate drug combination, and obtains the potential evaluation result of the hepatitis B drug combination.

[0043] Compared with the prior art, the advantages and positive effects of the present invention are as follows:

[0044] This invention integrates multi-dimensional dynamic time-series data, including viral genotype sequence data, patient drug usage records, and drug sensitivity data. Inter-entity interaction events are time-stamped, and dynamic features such as viral mutation trajectories and changes in drug efficacy are embedded into graph node attributes, enabling knowledge representation to reflect the time dependence in real-world scenarios. Based on biological pathway annotation information and protein-protein interaction data, hyperedge connections are defined to link multiple entity sets, extending the limitations of traditional knowledge graphs that only support binary relationships. This explicitly models the synergistic mechanisms of drug combinations and multiple targets, avoiding misjudgments of combined effects due to simplified interaction relationships. Initial entity vectors represent temporal association features such as relative time intervals within a time window and hyperedge co-occurrence strength factors. Vector representations are dynamically updated through neighborhood node information aggregation and edge type weight adjustment, enhancing the feature distinguishability of entities in different task scenarios. Virus strain drug sensitivity scoring combines cosine similarity and time decay factors to quantify the matching degree between drugs and viral variants on key targets and data timeliness, reducing evaluation bias caused by outdated historical data. Antagonistic effect assessment uses dual constraints of mechanism overlap thresholds and single-drug effectiveness thresholds to identify competing drug pairs with high similarity and high scores. Attached Figure Description

[0045] Figure 1 This is a schematic diagram of the steps of the present invention. Detailed Implementation

[0046] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0047] Please see Figure 1 This invention provides a technical solution: a knowledge graph-assisted multi-task hepatitis B drug screening method, comprising the following steps:

[0048] Based on hepatitis B virus genotype sequence data recorded over time, patient medication records, and drug sensitivity, temporal connections between entities are established to generate a temporal hepatitis B knowledge graph structure.

[0049] Based on the temporal hepatitis B knowledge graph structure, combined with the input biological pathway annotation information, protein interaction data and known drug combination effect records, we identify the set of multiple entities that interact in the temporal hepatitis B knowledge graph structure, define the hyperedge connecting the multi-entity set, obtain the multi-entity interaction set and the definition of the hyperedge, and based on the multi-entity interaction set and the definition of the hyperedge, expand the connection mode of the nodes in the temporal hepatitis B knowledge graph structure, and establish a hepatitis B knowledge base that integrates hyperedge interactions.

[0050] Based on the hepatitis B knowledge base with fusion hyperedge interaction, the initial vector expression of each hepatitis B virus entity, drug entity, and gene entity is calculated to obtain the initial entity vector representation. Based on the initial entity vector representation, the vector expression is updated and adjusted by aggregating the neighborhood node information and the connection information of different types of edges in the graph, and a multi-task hepatitis B entity feature vector is established.

[0051] Based on the multi-task hepatitis B entity feature vector, the feature vector corresponding to the target hepatitis B virus variant and the feature vector corresponding to the candidate drug are selected, and the drug sensitivity score between the two is estimated to obtain a list of viral strain drug sensitivity scores. Based on the list of viral strain drug sensitivity scores and the multi-task hepatitis B entity feature vector corresponding to the candidate drug combination, the antagonistic effect of the drug combination is evaluated, and the potential assessment results of the hepatitis B drug combination are obtained.

[0052] The steps for obtaining the temporal hepatitis B knowledge graph structure are as follows:

[0053] Integrate hepatitis B virus genotype sequence data, patient medication records, and drug sensitivity data; extract timestamp information from each data point; convert the timestamp information into numerical time labels in a unified time format; and generate a timestamp-associated dataset.

[0054] Based on a timestamp-associated dataset, the interval between timestamps is calculated. Continuous time periods are then divided according to the interval and a preset time window threshold. Within each time period, an association edge is established between hepatitis B virus genotype entities and drug entities, using the following formula:

[0055] ;

[0056] Calculate the weight values ​​of time-related edges ,in The time tag representing entity r, The time tag representing entity s As the reference value for the time interval, Generate a weighted set of time-related edges to determine the time deviation tolerance coefficient;

[0057] Based on a weighted set of time-related edges, edges with weight values ​​greater than a preset edge weight threshold are connected to their corresponding entities to form a time-series hepatitis B knowledge graph structure.

[0058] Specifically, this involves integrating hepatitis B virus (HBV) genotype sequence data (e.g., FASTA format nucleotide sequences) from clinical electronic medical record systems, gene sequencing platforms, and drug susceptibility testing laboratories; detailed patient medication records (including drug name, dosage, and start and end dates of use); and in vitro or clinical drug susceptibility test results (e.g., phenotypic analysis results). (Analysis of drug resistance sites by value or genotype) First, the timestamp information associated with each recorded event is accurately extracted from various data sources, covering the gene sequencing date, the first and last drug use dates, the sensitivity test execution date, etc. These diverse timestamps (such as 'YYYY-MM-DD', 'MM / DD / YYYYHH:MM') are uniformly converted into standard numerical timestamps, specifically using the Unix timestamp format of International Standard Time (UTC), that is, the total number of seconds elapsed from 00:00:00 on January 1, 1970 to the time of the event. For example, the date '2023-10-26 14:30:00' is converted into the value 1698301800, ensuring the consistency and computability of the time data. Then, the original data is associated with the converted numerical timestamps to construct a structured dataset, in which each record contains the original entity information (such as virus sequence ID, drug ID, patient ID) and its corresponding standardized numerical timestamp, generating a timestamp-associated dataset.

[0059] formula: The advantage of this formula is that it uses a Gaussian kernel function to quantify the time points of two events. and It considers not only the absolute value of the time interval, but also assesses the strength of the association between this interval and a typical expected time interval. The degree of proximity of the time intervals was considered, while also taking into account the acceptable range of variation in time intervals. This approach assigns higher weights to event pairs that more closely follow expected patterns in time (e.g., the common time span for drug sensitivity testing after viral mutation), while assigning lower weights to event pairs with excessively short or long time intervals. This allows for a more accurate capture of the dynamic and time-dependent relationship between hepatitis B virus mutations, drug use, and changes in sensitivity, providing a quantitative basis for constructing a temporal knowledge graph that reflects real biomedical processes.

[0060] The timestamp representing entity r is the time point of the event associated with entity r (e.g., the determination time of a hepatitis B virus genotype sequence). This parameter is numerical, in seconds, and is obtained directly from the timestamp association dataset in the previous step. For example, if the timestamp associated with the hepatitis B virus genotype G1896A sequencing event is found to be '2023-05-10 09:00:00', converting it to a Unix timestamp yields... .

[0061] The timestamp representing entity s is the time point of the event associated with entity s (e.g., the time of lamivudine drug sensitivity testing for the aforementioned viral genotype). This parameter is also numerical, in seconds, and is directly extracted from the timestamp association dataset in the previous step. For example, if the timestamp associated with the lamivudine drug sensitivity testing event is found to be '2023-08-15 11:30:00' from the timestamp association dataset, converting it to a Unix timestamp yields... .

[0062] This is the baseline value for the time interval, representing the expected or most typical time interval between entity r and entity s. This parameter is numerical, in seconds, and its setting should be based on the understanding of hepatitis B clinical diagnosis and treatment and virological research, reflecting the standard cycle or average interval of relevant events. Acquisition method: Statistically analyze the time intervals of a large number of relevant event pairs (such as viral sequencing and subsequent drug susceptibility testing) in a timestamp-linked dataset, and calculate the mean or median of these intervals. For example, analyze 1000 pairs of time interval data between "viral genotype sequencing" and "the immediately following first relevant drug susceptibility test," and calculate the average interval duration. If the calculated average interval is 95 days, then... Second.

[0063] The time deviation tolerance coefficient represents the deviation of the time interval between entity r and entity s from the reference value. The acceptable level or common fluctuation range. This parameter is numerical, in seconds, and reflects the dispersion of the time interval between events. How to obtain: In the calculation... Simultaneously, the standard deviation of the time interval data used is calculated. For example, based on the time interval data of the above 1000 pairs of events, the calculated standard deviation is 25 days. Second.

[0064] Preset time window threshold: This threshold is used to initially filter potentially related event pairs, and calculations are performed only between event pairs with a time interval less than this threshold. For example, a threshold of 180 days can be set. The setting is based on clinical experience or data analysis to determine a sufficiently long timeframe to include the vast majority of meaningful associated events, while avoiding the calculation of unnecessary long-term event pairs. For example, analyzing the time interval distribution of all event pairs in the dataset and selecting values ​​covering 95% of the intervals, if 95% of the intervals are less than 170 days, a threshold of 180 days can be set. Second).

[0065] Substitute the parameters into the formula to obtain the weight values. This represents considering typical time intervals. Tianhe Deviation Tolerance Under the circumstances, it occurred in Viral genotype sequencing events at specific time points and those occurring in The temporal correlation strength between drug susceptibility test events at specific time points.

[0066] Based on the weighted time-related edge set calculated in the previous step, this set includes all entity pairs (such as hepatitis B virus genotype entities and drug entities) within a preset time window (e.g., 180 days) and their corresponding time-related weights. The process of edge filtering involves setting a preset edge weight threshold. This threshold aims to distinguish between strong and weak associations, retaining connections with significant temporal significance while filtering out weakly correlated connections or those potentially caused by noise. This threshold can be determined by analyzing all calculated edge weights. The distribution characteristics of the values ​​are completed, for example, by plotting a histogram or cumulative distribution function of the weights, observing the inflection points of the weight distribution, or setting a value based on domain knowledge that can retain, for example, the top 70% of strongly correlated edges. For instance, if the calculated weight values ​​are mainly distributed between 0 and 0.016, analysis reveals that the time intervals corresponding to edges with weights below 0.005 are similar to the baseline interval. A large deviation indicates little correlation, while a weight higher than 0.01 indicates that the time intervals are very close. The correlation is strong, so a compromise value can be chosen, such as setting a preset edge weight threshold of 0.008. This threshold can be calculated by taking the 30th percentile of all calculated weights (keeping the top 70%) or by setting it based on expert experience. Then, each edge record in the weighted time-related edge set is traversed, and its weight value is compared. Compared to the preset edge weight threshold of 0.008, if If the edge is significant, then the edge and the two entities it connects (e.g., the viral genotype entity G1896A and the drug entity lamivudine) are included in the final graph structure. If the edge is not connected, discard it and do not connect it. Then, summarize all the related edges that pass the threshold and their corresponding entity nodes to form a set of nodes (representing hepatitis B virus genotypes, drugs, and possibly patients) and weighted edges (representing event relationships that meet the temporal association strength threshold). Finally, a networked temporal hepatitis B knowledge graph structure is constructed.

[0067] The steps to obtain the multi-entity interaction set and the definition of the hyperedge are as follows:

[0068] Integrate entity nodes, biological pathway annotation information, protein interaction data and drug combination effect records from the temporal hepatitis B knowledge graph structure, extract gene regulatory pathway identifiers from the biological pathway annotation information and drug target gene set identifiers from the drug combination effect records, and generate multi-source datasets.

[0069] Based on multi-source datasets, the combined effect strength of the gene entity set and the drug target gene set within the same time window is calculated using the following formula:

[0070] ;

[0071] in, For the combined effect strength, For the set of genes annotated in the i-th biological pathway, Let K be the set of target genes for the action of drug combination effect in the drug combination effect record. This represents the interaction frequency between genes m and n in protein-protein interaction data. Let be the frequency normalization constant, and take . The maximum value;

[0072] Based on the strength of interaction, gene sets and drug target gene sets with interaction strength greater than a preset threshold are selected, and hyperedges connecting these sets are defined to generate multi-entity interaction sets and hyperedge definitions.

[0073] Specifically, this process integrates various entity nodes (such as hepatitis B virus genotypes, drugs, and genes) from the previously constructed temporal hepatitis B knowledge graph structure, combined with biological pathway annotation information obtained from external public databases (e.g., downloading pathway definitions related to liver diseases and viral infections from the KEGG database, such as hsa05161: Hepatitis B pathway, which contains a series of gene identifiers), protein-protein interaction (PPI) data, and drug combination effect records. Based on this, key identifier information is systematically extracted. Specifically, unique identifiers for each pathway and its associated gene set (represented by standard gene symbols or EntrezID lists) are extracted from the biological pathway annotation information. At the same time, for each drug or drug combination, the target gene set identifiers with clear effects are extracted from the drug combination effect records. These data from different sources (temporal knowledge graph entities, pathway gene sets, PPI data, and drug target gene sets) are standardized and mapped, and a unified identifier system is established (e.g., using EntrezGeneID). This constructs a multidimensional, heterogeneous dataset containing entity information, pathway information, PPI information, and drug target information, generating a multi-source dataset.

[0074] formula: The advantage of this formula is that it assesses a biological pathway (composed of a set of genes) by combining two dimensions. (representative) and the target of a drug (or combination of drugs) (composed of gene sets) The strength of the functional association between (representatives). Part 1: The Jakarta Index. This directly quantifies the degree of overlap between pathway genes and drug target genes; the higher the overlap, the more direct the potential functional association. Part Two, It incorporates protein-protein interaction information; specifically, it considers genes in regions where the pathway overlaps with the drug target. The strength of the interaction between ) (e.g., average interaction frequency), and through logarithmic transformation and smoothing, as well as maximum frequency. Normalization is used to measure the activity level of this internal interaction. Multiplying the two results in the strength of the combined effect. It not only reflects direct gene member overlap, but also takes into account the degree of functional collaboration within these overlapping genes, thus providing a more comprehensive assessment of the potential biological interaction strength between pathways and drug target sets.

[0075] Representing the A biological pathway annotation is a collection of genes contained within it. This is a set of gene identifiers. It is obtained by querying a standard biological pathway database. For example, querying the KEGG database retrieves a list of genes related to the "Hepatitis B" pathway (hsa05161). Example: It contains 5 genes.

[0076] Representing the A drug combination is a set of target genes that a drug or drug combination has a clearly defined effect on in drug combination effect records. This is also a set of gene identifiers. Acquisition methods include querying drug databases to obtain known drug targets, or extracting target gene information related to the efficacy of a specific combination from drug combination effect research literature and databases. For example, for the drug combination entecavir + lamivudine, the set of host or viral genes it primarily affects is determined based on literature and database information. Example: It contains 5 genes.

[0077] This represents the frequency of interactions between genes m and n in protein-protein interaction data. First, determine the intersection. Then, query the protein-protein interaction database to find all different gene pairs within the intersection. Interaction frequency (For example, based on the amount of experimental evidence), calculate the average of these frequencies. Example: For and The intersection is A query of the BioGRID database yielded 8 interaction frequency records between 'GeneC' and 'GeneX'. Since the intersection contains only two genes, a single pair ('GeneC', 'GeneX'), therefore... .

[0078] The frequency normalization constant is defined as the frequency observed in the entire dataset. The maximum value of the average interaction frequency of gene pairs within the intersection (referring to the average interaction frequency of gene pairs within the intersection). Acquisition method: Calculate all considered pathway-drug target gene pairs. Intersecting gene pairs average interaction frequency Find the maximum value among them. This requires checking all relevant database entries. and Combine the above The calculation process. Example: After analyzing all pathways and drug combinations, the average interaction frequency of the largest intersection gene pairs was found to be 150. Then set... .

[0079] Calculation process: using the example parameters above and Calculate the intersection :

[0080] ;

[0081] Calculate the size of the intersection :

[0082] ;

[0083] Calculate the union :

[0084] ;

[0085] Calculate the size of the union :

[0086] ;

[0087] Calculation of the Jakarta index:

[0088] ;

[0089] Obtain the average interaction frequency of gene pairs within the intersection. (Based on the example retrieval process):

[0090] ;

[0091] Obtain the frequency normalization constant :

[0092] ;

[0093] Calculate the logarithmic part:

[0094] ;

[0095] ;

[0096] ;

[0097] Calculate the final interaction strength :

[0098] ;

[0099] The result indicates that the calculated combined action strength biological pathways (e.g., Hepatitis B pathway) and drug combinations A quantitative assessment of the functional associations between target gene sets. This value combines the degree of overlap between pathways and target gene sets (Jaccard index of 0.25) and the average interaction strength within overlapping genes (contributing approximately 0.075 to several factors). The higher the value, the stronger the association between the pathway and the set of drug targets, and the greater the potential functional synergy or impact. In the next step, this... The value will be compared with a preset threshold to determine whether the pathway-drug target combination constitutes a set of multi-entity interactions of interest. For example, if If the value is greater than the threshold, then the two sets are considered to have a significant common effect, and a hyperedge can be defined to connect them.

[0100] Based on the series of combined action intensities calculated in the previous step Value, each The value corresponds to a set of gene entities (from biological pathways). ) and drug target gene set (from drug For pairing, the filtering and definition steps are performed. First, a preset threshold is set to determine whether the intensity of the interaction is significant. The determination of this threshold needs to be based on the pairing of the two groups. Statistical analysis of value distribution or considerations of biological significance, such as calculating all The distribution of values ​​is used to select the top 10% (i.e., the 90th percentile) as the threshold, or an empirical threshold is set based on known strong interaction cases. The specific calculation process is as follows: collect all calculated values... Sort the values ​​and find the value corresponding to the 90th percentile. If the value is 0.045, the preset threshold can be set to 0.05. Then, iterate through all calculated values. Value of the gene set pair , each The value is compared with a preset threshold of 0.05. If... Then it is considered that the set of gene entities With drug target gene set There exists a sufficiently strong mutual influence between these sets. Selected as the basis for multi-entity interactions, for each selected pair of sets, a hyperedge is defined. This hyperedge is designed to explicitly represent the interaction in which the two sets participate as a whole. The objects connected by the hyperedge are all the gene entity nodes that constitute the two sets (i.e., (all gene nodes in the array), or you can connect hyperedges to represent pathways. Nodes and representative drugs The nodes (if these are defined as independent entities in the knowledge graph) are summarized and recorded, and all set pairs that pass the threshold filtering and their defined hyperedge information are collected and recorded to form a clear list of multi-entity interaction sets and corresponding hyperedge definition descriptions, thus generating multi-entity interaction sets and hyperedge definitions.

[0101] The steps for acquiring a hepatitis B knowledge base integrating hyper-edge interaction are as follows:

[0102] Extract the node connection relationships in the multi-entity interaction set, the definition of hyperedge, and the temporal hepatitis B knowledge graph structure. Traverse each hyperedge in the definition of hyperedge and establish bidirectional connection relationships between the hyperedge and all nodes in the corresponding entity set to generate a hyperedge connection relationship table.

[0103] Based on the hyperedge connection table, it is detected whether there are isolated nodes or redundant connections in the hyperedge connection. If there are isolated nodes, the missing connections are supplemented according to the protein interaction data. If there are redundant connections, duplicate hyperedge connections are merged according to the preset redundancy threshold to generate a verified hyperedge connection table.

[0104] Based on the validated hyperedge connection table, the hyperedge connection relationships are mapped and replaced with the original node connection methods in the temporal hepatitis B knowledge graph structure, and the connection types and attributes between nodes are updated to form a hepatitis B knowledge base that integrates hyperedge interactions.

[0105] Specifically, information about existing nodes and edges in the "multi-entity interaction set and hyperedge definition" and the "temporal hepatitis B knowledge graph structure" generated in the previous steps is extracted. Specifically, each defined hyperedge and its associated entity set (containing data from specific biological pathways) are read from the "multi-entity interaction set and hyperedge definition". Gene entities and from specific drug target sets The system first accesses the "Time-Sequence Hepatitis B Knowledge Graph Structure" to obtain the identifiers of all entity nodes (including genes, drugs, viral variants, etc.) and their existing connections. Then, for each hyperedge definition record in the "Multi-Entity Interaction Set and Hyperedge Definition," it first creates a new node representing the hyperedge itself in the knowledge graph data structure (e.g., assigns it a unique hyperedge ID and marks its type as 'Hyperedge'). Next, it searches for the entity set associated with the hyperedge (e.g., the set of entities associated with the hyperedge). The node identifier for each entity member in the set. For each entity node, a connection is established from the entity node to the hyperedge node, and a connection is established from the hyperedge node back to the entity node, forming a bidirectional connection. These two connections can be assigned specific type labels, such as 'participatesInHyperedge' and 'hasParticipant'. The connection relationships between these newly established hyperedges and entities are recorded, including information such as hyperedge ID, entity ID, and connection type. All hyperedge connection information is summarized to generate a structured hyperedge connection relationship table.

[0106] Based on the "hyperedge connection table" generated in the previous step, the connection relationships are verified and refined. First, isolated node detection is performed. Each hyperedge in the table and its connected set of entity nodes are traversed to check if there is an entity node that is only associated with other nodes through the hyperedge, but lacks the expected direct connection relationship within the entity set connected by the hyperedge (i.e., between other entities participating in the hyperedge). For example, a gene node belongs to a hyperedge set, but at the protein interaction level, it is expected to have a direct association with other genes belonging to the same set, but there is no corresponding connection. If such potential isolated nodes are detected, protein-protein interaction (PPI) data (e.g., querying STRING database version 11.5 or the latest version of the BioGRID database) is used to supplement the possible missing data. Specifically, the process involves querying whether there are known protein interactions between an isolated node and other nodes within the same hyperedge set, and setting an interaction confidence threshold. This threshold is based on the score or evidence strength provided by the PPI database, for example, using the comprehensive score from the STRING database, with a threshold of 0.7. This threshold is determined by analyzing the distribution of known highly correlated hepatitis B-related protein interaction scores. Values ​​that can filter out low-confidence connections (e.g., scores below 0.7) while retaining medium- to high-confidence connections are selected. Only when the queried PPI score is greater than 0.7 is a standard binary edge representing that PPI added to the knowledge graph connecting the two entity nodes. Next, redundant connection detection is performed, focusing on identifying repeated hyperedges connecting similar or identical entity sets, and calculating any two hyperedges... and The set of entities connected and The similarity between them is determined using the Jacobian index. A measurement is performed, with a preset redundancy threshold set to 0.9. This threshold is based on the assumption that when the overlap between two hyperedge-connected node sets exceeds 90%, they are considered to represent highly similar macroscopic interaction patterns and can be merged. This threshold is determined by analyzing the Jakarta exponent distribution of all hyperedge pairs in the data, selecting a critical point that can distinguish between high overlap and general overlap (e.g., selecting the 95th percentile of the distribution; if it is 0.88, it is rounded up to 0.9). If the value is greater than 0.9, then the edge will be crossed. Mark as redundant and merge its connection into (The original connection to) The physical node is now connected Then remove After the isolated node supplementation connection and redundant hyperedge merging processing, the final set of connection relationships is obtained, which generates a verified hyperedge connection relationship table.

[0107] Based on the "Verified Hyperedge Connection Table" generated in the previous step, these verified and refined hyperedge connections are integrated into the original "Time-Series Hepatitis B Knowledge Graph Structure." Mapping replacement and update operations are then performed. Specifically, each record in the "Verified Hyperedge Connection Table," which describes the connection between a hyperedge node and its participating entity nodes, or supplementary binary PPI connections between entities, is traversed. This connection information is added to the graph representation of the "Time-Series Hepatitis B Knowledge Graph Structure" (e.g., updating the adjacency matrix, adjacency list, or edge list data structure). For hyperedge connections, they do not directly replace existing binary edges based on time association or other relationships; instead, they are added as a new connection type. That is, the connection methods between nodes are expanded. Existing binary edges (e.g., edges representing time association) and newly added hyperedge connections (representing participation in multi-entity interactions) can coexist. This change is reflected by updating the connection types and attributes between nodes. For example, for a gene node, its connection list may now contain temporal edges pointing to other genes, temporal edges pointing to drugs, and 'participatesInHyperedge' type edges pointing to one or more hyperedge nodes. At the same time, the attributes of the edges may also need to be updated. For example, an attribute can be added to the nodes participating in the hyperedge to record the list of hyperedge IDs to which they belong, or an attribute can be added to the hyperedge node itself to describe the biological significance represented by the hyperedge (such as associated pathways and drug combinations). In this way, the information on the collective behavior of multiple entities represented by the hyperedge is seamlessly integrated into the knowledge graph, rather than simply replacing the original connections. The resulting network structure is a hepatitis B knowledge base that integrates hyperedge interaction information, has more diverse connection methods, and better reflects the interaction characteristics of complex biological systems.

[0108] The steps to obtain the initial entity vector representation are as follows:

[0109] Traverse the hepatitis B virus entity, drug entity, and gene entity nodes in the hepatitis B knowledge base that integrates hyperedge interactions, extract the interaction event timestamps recorded in the time association edges of each node, count the total number of times each node is associated with the multi-entity interaction set in the hyperedge information, record the number of times the interaction event timestamp sequence is associated with the hyperedge, and generate a node time-series interaction log.

[0110] Based on the node time-series interaction log, with the first interaction timestamp as the benchmark, the time difference between each subsequent timestamp and the benchmark is calculated to generate a time difference sequence. The frequency of each node appearing simultaneously with other entities in the hyperedge information is counted, and the ratio of the combination frequency to the length of the time difference sequence is calculated to generate a set of multi-entity co-occurrence intensity factors.

[0111] Based on the set of co-occurrence intensity factors for multiple entities, the time difference sequence is divided into windows with a period of 30 days. The mean and variance of the time difference within each window are calculated, and the three sets of values ​​of mean, variance and co-occurrence intensity factors are concatenated to output the initial entity vector representation.

[0112] Specifically, the system systematically accesses all target entity nodes stored in the "Hepatitis B Knowledge Base with Integrated Hyperedge Interactions," including hepatitis B virus entities (such as specific genotype or variant identifiers), drug entities (such as entecavir and lamivudine), and gene entities (such as human or viral gene identifiers). For each entity node, its associated edge information is examined in detail, and time-related edges that record the timestamps of interaction events are selected (these edges are usually established in the early stages of knowledge graph construction based on the time information in the original data). The timestamp values ​​(e.g., Unix timestamp format) attached to these edges are precisely extracted. For the same node associated with... Multiple timestamps are arranged in chronological order to form a timestamp list. At the same time, the participation of a node in hyperedge interactions is checked. By searching for 'hasParticipant' type connections pointing to the node or 'participatesInHyperedge' type connections originating from the node, the total number of different hyperedges associated with the node is counted. The ordered interaction event timestamp sequence of each node and the total number of hyperedges it has participated in are recorded in a structured manner to generate a corresponding node time-series interaction log for each hepatitis B virus entity, drug entity, and gene entity node in the knowledge base.

[0113] Based on the "node time-series interaction log" generated for each entity in the previous step, the time-series features and co-occurrence features are calculated. First, for the interaction event timestamp sequence of each node... Select the first timestamp in the sequence. This serves as the time reference point for the first recorded interaction with this entity, and then all subsequent timestamps are calculated. (in From 1 to Relative to the reference point The time difference is calculated as follows: These differences (in seconds or converted to days) constitute an ordered sequence of time differences. This sequence reflects the relative temporal distribution of entity interaction events. Next, the total number of times the node participated in hyperedge interactions (denoted as ) is extracted from the corresponding "node temporal interaction log". This number represents the frequency with which the node forms a multi-entity interaction set with other entities in the knowledge base. Then, the "multi-entity co-occurrence strength factor" of the node is calculated, which is calculated by adding the total number of hyper-edge associations. Divide by the length of the time difference sequence (That is, the total number of subsequent interaction events, if) The factor is 0 or can be set as needed, i.e., factor = This ratio quantifies the average intensity of a node's participation in complex interactions within a unit interaction event. It aggregates the "multi-entity co-occurrence intensity factors" calculated for all hepatitis B virus entities, drug entities, and gene entity nodes in the knowledge base to generate a set of multi-entity co-occurrence intensity factors.

[0114] Based on the previously calculated "multi-entity co-occurrence intensity factor" (from the "multi-entity co-occurrence intensity factor set") and the corresponding "time difference sequence" for each node, an initial vector representation for each entity is constructed. First, for each node's "time difference sequence"... Divide the time window, using a fixed 30-day period as the time window cycle (for example, the first window contains...). Value at Interaction within seconds, the second window contains (Interactions within seconds, and so on, covering the entire time span), for each defined time window, calculate all time differences falling within that window. The arithmetic mean and variance are calculated. If no timestamp falls within a certain window, the mean and variance can be set to 0 or processed using interpolation methods. In this way, each node will obtain a sequence composed of the means of all windows (e.g., ...). ) and a sequence consisting of all window variances (e.g. Then, these three sets of numerical information are concatenated: the mean sequence of all windows, the variance sequence of all windows, and the "multi-entity co-occurrence intensity factor" (as a scalar feature) of the node are connected together to form a long vector. For example, if the time span covers... If there are multiple windows, the generated vector structure will be: The vector dimension is This method aggregates all vectors generated from hepatitis B virus entities, drug entities, and gene entities, and outputs the initial entity vector representation.

[0115] The steps for obtaining the feature vector of a multi-task hepatitis B entity are as follows:

[0116] Traverse each entity node in the initial entity vector representation, extract the set of directly connected neighboring nodes in the hepatitis B knowledge base with integrated hyperedge interaction, record the entity type, edge type and connection count of the neighboring nodes, and generate a set of neighboring node information.

[0117] Based on the neighborhood node information set, the proportion of connection times of different edge types in the neighborhood of each entity node is counted, and the proportion value is multiplied by the preset priority coefficient of the entity type of the neighboring node to generate a set of edge type weight factors.

[0118] Based on the set of edge type weight factors, the initial entity vector representation is concatenated with the entity vectors in the neighboring node information set according to the weight factors to generate a multi-task hepatitis B entity feature vector.

[0119] Specifically, for each entity node (including hepatitis B virus entities, drug entities, and gene entities) with an "initial entity vector representation" in the "Hepatitis B Knowledge Base with Integrated Hyperedge Interactions," a neighborhood information extraction operation is performed. For the currently processed central entity node, the graph structure data of the knowledge base is queried to obtain the set of all one-hop neighboring nodes directly connected to the central node. For each neighboring node in the set, its key information is recorded in detail, including the unique identifier of the neighboring node and the entity type of the node (e.g., distinguishing between 'Gene', 'Drug', 'VirusVariant', 'Hyperedge', etc.). The central node is then connected to the neighboring node. The system collects the edge types (e.g., 'temporal_association' represents temporal association, 'ppi_interaction' represents protein-protein interaction, 'participatesInHyperedge' represents entity participation in hyperedge, 'hasParticipant' represents hyperedge containing entity, etc.), as well as the number of connections between these specific types of edges between nodes (counted if parallel edges are allowed; otherwise, the count is 1). It then organizes and stores the collected identifiers, entity types, edge types, and connection counts of all neighboring nodes for each central entity node, generating a set of neighboring node information for that node.

[0120] Based on the "neighborhood node information set" generated for each central entity node in the previous step, the weighting factors used for subsequent neighborhood information aggregation are calculated. First, a statistical analysis is performed on the "neighborhood node information set" of the current central node to calculate the total number of neighborhood connections (i.e., the sum of all neighborhood connection counts). Then, for each different edge type appearing in the neighborhood (such as 'temporal_association', 'ppi_interaction', 'participatesInHyperedge', etc.), the total number of connections for that type of edge is counted, and the proportion of each edge type's connection count to the total number of neighborhood connections is calculated. Finally, a preset priority coefficient for the neighborhood node entity type is introduced. The coefficients are set according to the importance of different entity types in the hepatitis B drug screening task. For example, 'Drug' type nodes, which are directly related to the drug screening target, have the highest priority and are assigned a coefficient of 1.2; 'VirusVariant' type nodes are second highest and are assigned a coefficient of 1.1; 'Gene' type nodes are assigned a coefficient of 1.0; while the abstract 'Hyperedge' type nodes have a relatively low priority and are assigned a coefficient of 0.9. These coefficient values ​​are initially set based on domain knowledge and can be fine-tuned during model training using validation set results. For example, these coefficient values ​​can be adjusted through grid search or gradient-based optimization to maximize the performance of downstream tasks. Next, the edge type weight factor is calculated according to an interpretive method: for each edge type of the center node... Find all types The edges connected to the given edge are used to calculate the average priority coefficients of the neighboring nodes. Then, the percentage value of that edge type The average priority coefficient of its corresponding neighboring nodes Multiplying them together yields the weight factor for that edge type, i.e. This calculation is performed on all edge types in the neighborhood of the central node, and the weight factors of each edge type are summarized to generate the set of edge type weight factors for the central node.

[0121] Using the "edge type weight factor set" generated in the previous step for each central node, and combining it with the "initial entity vector representation" of the central node and its neighboring nodes, we perform weighted aggregation and concatenation of neighborhood information to form the final feature vector. The specific operation is as follows: For a central entity node... To obtain its own "initial entity vector representation". It then accesses its "neighborhood node information set" to obtain all neighboring nodes. The "initial entity vector representation" Meanwhile, for connections and Determine the edge type. And search for the corresponding weight factor from the "set of edge type weight factors". (here) It is based on the calculated edge type (Associated weight factors), calculate a weighted aggregate representation of neighborhood information by each neighborhood node. initial entity vector Multiply by the weight factor corresponding to the edge type Then, the weighted vectors of all neighboring nodes are summed to obtain the aggregated neighborhood vector. Finally, perform the splicing operation to connect the central node. Its own initial entity vector Its aggregated neighborhood vector Connecting along dimensions creates longer, more information-rich vectors. This concatenated vector It integrates the node's own temporal sequence and hyperedge participation characteristics (from...) ) and the structural and type information of its neighborhood environment (from weighted aggregation) This process is repeated for all target entity nodes in the knowledge base to generate multi-task hepatitis B entity feature vectors, which serve as the multi-task learning optimization feature representation of the entity node.

[0122] The steps to obtain the list of drug susceptibility scores for viral strains are as follows:

[0123] Traverse the target hepatitis B virus variant and candidate drug entity in the multi-task hepatitis B entity feature vector, extract the virus variant feature vector, which includes the expression intensity of gene mutation sites and the temporal correlation intensity, and compare it with the drug feature vector, which includes the target effect intensity and metabolic half-life, to generate a set of virus-drug feature vector pairs.

[0124] Based on the set of virus-drug feature vector pairs, the sensitivity score of virus and drug feature vectors is calculated using the following formula:

[0125] ;

[0126] in, Sensitivity rating, This is the dot product of the feature vectors of the virus and the drug, measuring the positive correlation between the two along key feature dimensions. This is the product of the magnitudes of the vectors, used to normalize the dot product value, and outputs a cosine similarity in the range [-1, 1]. The interval between the recording time of viral variant genotypes and the time of drug trials (in months). The maximum time interval in the dataset;

[0127] Based on the sensitivity score, a list of drug sensitivity scores for virus strains is generated by sorting the sensitivity scores from high to low.

[0128] Specifically, the process iterates through the set of "multi-task hepatitis B entity feature vectors" generated in previous steps, selecting specific target hepatitis B virus variant entities (e.g., selecting specific drug-resistant variants such as L180M+M204V based on user input or a preset list) and a set of candidate drug entities (e.g., including currently clinically commonly used or under investigation drugs such as entecavir, tenofovir, and lamivudine). For each selected target hepatitis B virus variant entity, its corresponding "multi-task hepatitis B entity feature vector" is extracted. This vector, generated in previous steps, contains encoding information about the characteristics of the viral variant, such as the expression intensity of gene mutation sites and the strength of time association, among other features. For each candidate drug entity, its corresponding "multi-task hepatitis B entity feature vector" is also extracted. This vector encodes the characteristics of the drug, such as the target efficacy and metabolic half-life. The extracted target hepatitis B virus variant feature vector is paired with the feature vector of each candidate drug. For example, if there is a target virus variant V1 and three candidate drugs D1, D2, and D3, then three pairs are formed: (Vector(V1), Vector(D1)), (Vector(V1), Vector(D2)), and (Vector(V1), Vector(D3)). All these pairs are collected to generate a set of virus-drug feature vector pairs.

[0129] formula: The advantage of this formula lies in its innovative combination of semantic similarity between entity feature vectors learned from knowledge graphs and the time difference information of event occurrences, used to predict the sensitivity of viruses to drugs. The first part is the calculation of cosine similarity (…). It utilizes the virus feature vectors learned in previous steps. and drug feature vector The potential correlation between vectors is assessed by using directional consistency in high-dimensional space. This correlation encodes complex biomedical information (such as the effects of viral mutations, drug mechanisms of action, and temporal patterns). The closer the vector directions (the higher the cosine similarity), the higher the potential drug sensitivity. The second part is the time decay factor. It incorporates a time dimension consideration. It is the time interval between the recording time of viral genotype data and the drug evaluation time (such as the time of drug sensitivity testing or the time of drug administration). This factor makes the calculated sensitivity score lower the longer the time interval (the older the viral data), thus simulating the reality that the virus may develop drug resistance over time, making the prediction results more timely and clinically relevant.

[0130] Representative target hepatitis B virus variant The multi-task hepatitis B entity feature vector. This is a numerical vector calculated in the previous main step, whose dimensions and values ​​reflect the comprehensive characteristics of this viral variant. Acquisition method: From the "multi-task hepatitis B entity feature vector" set generated in the previous step, based on the viral variant. The identifier is extracted directly. For example, for virus variant V1, its feature vector is... The vector dimension is determined by the aforementioned vector construction method.

[0131] Representative candidate drugs The multi-task hepatitis B entity feature vector. This is also a numerical vector calculated in the previous main step, reflecting the drug... Comprehensive characteristics. Acquisition method: From the "Multi-task Hepatitis B Entity Feature Vector" set, based on drug... The identifier is extracted directly. For example, for drug D1, its feature vector is... Vector dimension and same.

[0132] Representing virus variants Genotype recording time and drugs The time interval between trials or evaluations, in months. Acquisition method: Requires searching for information related to virus variants in the knowledge base or metadata. The most relevant timestamp associated with an entity (e.g., sequencing date) ) and drugs Assess relevant reference timestamps (e.g., drug susceptibility testing date or clinical use start date). Calculate the time difference between the two and convert it to months. Calculation example: virus The sequencing date was January 15, 2024, for the drug. The sensitivity test was conducted on July 20, 2024. The time interval is approximately six months and a few days. The number of days is then divided by the average number of days in a month (e.g., 30.44) to get the number of months. For example, if the interval is 185 days, then... moon.

[0133] This represents the maximum time interval between the observed virus record time and the drug evaluation time in the dataset, expressed in months. Acquisition method: This requires analyzing all historical data used to build the knowledge base or all relevant virus-drug timestamp pairs in the current evaluation dataset to identify... The maximum value is then converted to months. For example, after analyzing a dataset containing thousands of records, the largest time interval was found to be 5 years and 2 months. moon.

[0134] Calculation process: Using example parameters: moon, moon;

[0135] Calculate the dot product :

[0136] ;

[0137] ;

[0138] Calculate the vector magnitude :

[0139] ;

[0140] ;

[0141] Calculate the vector magnitude :

[0142] ;

[0143] ;

[0144] Calculate the product of modulo lengths :

[0145] ;

[0146] The cosine similarity calculation part:

[0147] ;

[0148] Calculation of time decay factor:

[0149] ;

[0150] ;

[0151] Calculate the final sensitivity score :

[0152] ;

[0153] The result indicates that the calculated sensitivity score... This is a quantitative prediction of the sensitivity of the target hepatitis B virus variant V1 to candidate drug D1. A score close to 1 indicates that, based on the strong similarity between the feature vectors learned by the model (cosine similarity approximately 0.92) and a relatively short time interval (decay factor approximately 0.90), the viral variant is predicted to have high sensitivity to the drug. Theoretically, the score can range from -1 to 1 (if the time decay factor were allowed to be 0 or negative, but this design ensures it is non-negative). A higher score indicates a better inhibitory effect of the drug on the virus. In the next step, this score will be used to rank all candidate drugs.

[0154] Based on the sensitivity scores between all target hepatitis B virus variants and each candidate drug calculated in the previous step. The results of these ratings are then organized and sorted. Specifically, all calculated ratings are collected, and each rating is associated with a specific viral variant. and a specific candidate drug These (virus variant identifiers, drug identifiers, sensitivity scores) The triples are treated as a list, and then the list is processed using a standard sorting algorithm (such as quicksort or mergesort), with the sorting based on sensitivity scores. The values ​​are arranged in descending order, with the highest-scoring drug pair at the top and the lowest-scoring drug at the bottom. The sorted list clearly shows the order of predicted sensitivity of each candidate drug against the target hepatitis B virus variant, generating a list of viral strain drug sensitivity scores.

[0155] The steps for obtaining the results of the hepatitis B drug combination potential assessment are as follows:

[0156] Traverse each combination in the candidate drug combination list, extract the identifiers of all drug entities within the combination, obtain the sensitivity scores of each drug to the target virus variant from the virus strain drug sensitivity score list, and simultaneously extract the feature vectors of drug entities from the multi-task hepatitis B entity feature vector to generate a drug combination feature-score dataset.

[0157] Based on the drug combination feature-score dataset, the cosine similarity between the feature vectors of each pair of drugs in the combination is calculated. If the cosine similarity between any two drugs in the combination is higher than the preset mechanism overlap threshold and the sensitivity scores of the two drugs are higher than the preset single drug effective threshold, then it is determined that the drug pair has a target competitive antagonistic effect, and an antagonistic effect label set is generated.

[0158] Based on the antagonistic effect marker set, combinations containing at least one antagonistic drug pair are removed, and the remaining combinations are sorted to generate the potential assessment results of hepatitis B drug combination.

[0159] Specifically, the process iteratively processes a user-provided or predefined "candidate drug combination list," which contains drug combinations whose potential needs to be evaluated, such as [(entecavir, adefovir), (tenofovir, lamivudine), (entecavir, telbivudine)]. For each drug combination in the list (e.g., when processing the combination (entecavir, adefovir),), the unique identifiers (i.e., entecavir ID and adefovir ID) of all individual drug entities contained within the combination are first identified and extracted. Then, these drug identifiers are used to query the "virus strain drug sensitivity score list" generated in the previous step to find the sensitivity score values ​​of these individual drugs (entecavir, adefovir) corresponding to the current target hepatitis B virus variant. and Meanwhile, based on the same drug identifier, the complete feature vectors Vector (Entecavir) and Vector (Adefovir) corresponding to each drug (Entecavir and Adefovir) are retrieved and extracted from the "Multi-task Hepatitis B Entity Feature Vector" set. The identifier of each candidate drug combination, the sensitivity score of each drug to the target virus, and the feature vector of each drug are integrated together to create a record for each combination. The records of all combinations are summarized to generate a drug combination feature-score dataset.

[0160] Based on the "Drug Combination Feature-Scoring Dataset" generated in the previous step, the antagonistic effect of each drug combination is evaluated. The specific operation is as follows: For each drug combination record in the dataset, firstly, all possible drug pairs within the combination are determined (for a binary combination such as (A, B), there is only one pair (A, B); for a ternary combination (A, B, C), there are three pairs (A, B), (A, C), and (B, C)). Then, for each drug pair (e.g., drug A and drug B), the following judgment process is performed: First, the cosine similarity between the feature vector Vector(A) of drug A and the feature vector Vector(B) of drug B is calculated. This calculation is achieved by dividing the vector dot product by the product of the vector magnitudes, resulting in a value between -1 and 1. The first step is to determine the similarity between the two drugs in the feature space. The second step is to compare the calculated cosine similarity with a "preset mechanism overlap threshold". This threshold is used to determine whether there is a high degree of mechanism or target similarity between the drugs. The threshold can be set based on the distribution analysis of the feature vector similarity of known antagonistic (especially competitive antagonistic) and non-antagonistic drug pairs, and a value that can better distinguish between the two types of cases can be selected. For example, it can be set to 0.8. This value is determined by analyzing the observation that the vector similarity of competitive antagonistic drug pairs labeled in literature reports or databases (such as DrugComb) is usually higher than 0.8, while the similarity of synergistic or unrelated drug pairs is low. The third step is to check the sensitivity scores of each of the two drugs (A and B) in the drug pair to the target viral variant. and The first step is to determine whether both drugs have a "preset single-drug effective threshold." This threshold is used to ensure that the antagonistic effect of both drugs is considered only when both drugs have a certain therapeutic effect. The threshold setting can refer to the overall distribution of sensitivity scores or clinical judgment criteria. For example, if the score range is -1 to 1, the effective threshold can be set to 0.5, which represents a prediction of moderate or higher sensitivity. This value is determined based on the analysis of the distribution of effective drug scores in historical data. For example, if the 60th percentile of the score distribution is 0.48, then it is set to 0.5. The fourth step is to determine the antagonistic effect. A drug pair is determined to have a potential target competitive antagonistic effect only if it meets both of the above conditions at the same time, namely, "the cosine similarity is higher than the preset mechanism overlap threshold of 0.8" and "the sensitivity scores of both drugs are higher than the preset single-drug effective threshold of 0.5". This determination result (e.g., marking the drug pair as 'antagonistic') is recorded. After the determination of all drug pairs in a combination is completed, the information of all drug pairs marked as 'antagonistic' is collected to generate an antagonistic effect label set.

[0161] Based on the "antagonistic effect marker set" generated in the previous step, the original "candidate drug combination list" is screened and sorted. First, each drug combination in the "candidate drug combination list" is traversed, and it is checked whether any drug pair in the combination is marked as having an antagonistic effect by the "antagonistic effect marker set". If a combination contains at least one drug pair marked as antagonistic, the combination is removed from the candidate list. The remaining combinations are the drug combinations that do not have a significant risk of target competitive antagonism in the prediction. Then, these remaining drug combinations after screening are sorted. The sorting can be based on the comprehensive performance of the sensitivity scores of each drug in the combination against the target viral variant, such as calculating the sensitivity scores of all drugs in each remaining combination. The average or sum of the values ​​are sorted in descending order according to this comprehensive score. The higher the comprehensive score, the greater the overall predictive potential of the combination. The list of drug combinations after removing antagonistic risks and sorting is used as the final evaluation output to generate the hepatitis B drug combination potential evaluation results.

[0162] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention in any other way. Any person skilled in the art may make changes or modifications to the above-disclosed technical content to create equivalent embodiments that can be applied to other fields. However, any simple modifications, equivalent changes, and modifications made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the protection scope of the present invention.

Claims

1. A knowledge graph-assisted multi-task hepatitis B drug screening method, characterized in that, Includes the following steps: Based on hepatitis B virus genotype sequence data recorded over time, patient medication records, and drug sensitivity, temporal connections between entities are established to generate a temporal hepatitis B knowledge graph structure. Based on the aforementioned time-series hepatitis B knowledge graph structure, and combined with the input biological pathway annotation information, protein interaction data, and known drug combination effect records, the set of multiple entities acting together in the time-series hepatitis B knowledge graph structure is identified, a hyperedge connecting the multiple entity set is defined, the multi-entity interaction set and the hyperedge definition are obtained, and based on the multi-entity interaction set and the hyperedge definition, the connection mode of the nodes in the time-series hepatitis B knowledge graph structure is expanded to establish a hepatitis B knowledge base that integrates hyperedge interactions. Based on the hepatitis B knowledge base with fusion hyperedge interaction, the initial vector expression of each hepatitis B virus entity, drug entity, and gene entity is calculated to obtain the initial entity vector representation. Based on the initial entity vector representation, the vector expression is updated and adjusted by aggregating the neighborhood node information and the connection information of different types of edges in the graph, and a multi-task hepatitis B entity feature vector is established. Based on the multi-task hepatitis B entity feature vector, the feature vector corresponding to the target hepatitis B virus variant and the feature vector corresponding to the candidate drug are selected, and the drug sensitivity score between the two is estimated to obtain a list of viral strain drug sensitivity scores. Based on the list of viral strain drug sensitivity scores and the multi-task hepatitis B entity feature vector corresponding to the candidate drug combination, the antagonistic effect of the drug combination is evaluated, and the potential evaluation result of the hepatitis B drug combination is obtained. The steps for obtaining the temporal hepatitis B knowledge graph structure are as follows: Integrate hepatitis B virus genotype sequence data, patient medication records, and drug sensitivity data; extract timestamp information from each data point; convert the timestamp information into numerical time labels in a unified time format; and generate a timestamp-associated dataset. Based on the timestamp-associated dataset, the interval between timestamps is calculated. Continuous time periods are then divided according to the interval and a preset time window threshold. Within each time period, an association edge is established between hepatitis B virus genotype entities and drug entities, using the following formula: ; Calculate the weight values ​​of time-related edges ,in The time tag representing entity r, The time tag representing entity s As the reference value for the time interval, Generate a weighted set of time-related edges to determine the time deviation tolerance coefficient; Based on the weighted time-related edge set, the related edges with weight values ​​greater than the preset edge weight threshold are connected to the corresponding entities to form a time-series hepatitis B knowledge graph structure. The steps for obtaining the multi-entity interaction set and the hyperedge definition are as follows: By integrating entity nodes, biological pathway annotation information, protein-protein interaction data, and drug combination effect records in the aforementioned time-series hepatitis B knowledge graph structure, gene regulatory pathway identifiers in the biological pathway annotation information and drug target gene set identifiers in the drug combination effect records are extracted to generate a multi-source dataset. Based on the multi-source dataset, the combined effect strength of the gene entity set and the drug target gene set within the same time window is calculated using the following formula: ; in, For the combined effect strength, For the set of genes annotated in the i-th biological pathway, Let K be the set of target genes for the action of drug combination effect in the drug combination effect record. This represents the interaction frequency between genes m and n in protein-protein interaction data. Let be the frequency normalization constant, and take . The maximum value; Based on the strength of the interaction, a set of genes and a set of drug target genes with a strength of interaction greater than a preset threshold are selected, and a hyperedge connecting the sets is defined to generate a multi-entity interaction set and a hyperedge definition. The steps for obtaining the virus strain drug susceptibility score list are as follows: Traverse the target hepatitis B virus variant and candidate drug entity in the multi-task hepatitis B entity feature vector, extract the virus variant feature vector, which includes the expression intensity of gene mutation sites and the temporal correlation intensity, and the drug feature vector, which includes the target effect intensity and metabolic half-life, to generate a set of virus-drug feature vector pairs. Based on the set of virus-drug feature vector pairs, the sensitivity score of virus and drug feature vectors is calculated using the following formula: ; in, Sensitivity rating, This is the dot product of the feature vectors of the virus and the drug, measuring the positive correlation between the two along key feature dimensions. This is the product of the magnitudes of the vectors, used to normalize the dot product value, and outputs a cosine similarity in the range [-1, 1]. The interval between the recording time of viral variant genotypes and the time of drug trials (in months). The maximum time interval in the dataset; Based on the sensitivity scores, a list of drug sensitivity scores for virus strains is generated by sorting the sensitivity scores from high to low.

2. The knowledge graph-assisted multi-task hepatitis B drug screening method according to claim 1, characterized in that, The steps for obtaining the hepatitis B knowledge base integrating hyper-edge interaction are as follows: Extract the node connection relationships between the multi-entity interaction set and the hyperedge definition and the temporal hepatitis B knowledge graph structure; traverse each hyperedge in the hyperedge definition; establish bidirectional connection relationships between the hyperedge and all nodes in the corresponding entity set; and generate a hyperedge connection relationship table. Based on the hyperedge connection table, it is detected whether there are isolated nodes or redundant connections in the hyperedge connection. If there are isolated nodes, missing connections are supplemented according to protein interaction data. If there are redundant connections, duplicate hyperedge connections are merged according to a preset redundancy threshold to generate a verified hyperedge connection table. Based on the verified hyperedge connection table, the hyperedge connection relationships are mapped and replaced with the original node connection methods in the temporal hepatitis B knowledge graph structure, and the connection types and attributes between nodes are updated to form a hepatitis B knowledge base that integrates hyperedge interactions.

3. The knowledge graph-assisted multi-task hepatitis B drug screening method according to claim 1, characterized in that, The steps for obtaining the initial entity vector representation are as follows: Traverse the hepatitis B virus entity, drug entity, and gene entity nodes in the hepatitis B knowledge base that integrates hyperedge interactions, extract the interaction event timestamps recorded in the time association edges of each node, count the total number of times each node is associated with the multi-entity interaction set in the hyperedge information, record the number of times the interaction event timestamp sequence is associated with the hyperedge, and generate a node time-series interaction log. Based on the node time-series interaction log, with the first interaction timestamp as the benchmark, the time difference between each subsequent timestamp and the benchmark is calculated to generate a time difference sequence. The frequency of each node appearing simultaneously with other entities in the hyperedge information is counted, and the ratio of the combination frequency to the length of the time difference sequence is calculated to generate a set of multi-entity co-occurrence intensity factors. Based on the set of multi-entity co-occurrence intensity factors, the time difference sequence is divided into windows with a period of 30 days. The mean and variance of the time difference within each window are calculated. The mean, variance, and co-occurrence intensity factors are concatenated to output the initial entity vector representation.

4. The knowledge graph-assisted multi-task hepatitis B drug screening method according to claim 1, characterized in that, The steps for obtaining the multi-task hepatitis B entity feature vector are as follows: Traverse each entity node in the initial entity vector representation, extract the set of directly connected neighboring nodes in the hepatitis B knowledge base with fused hyperedge interaction, record the entity type, edge type and connection count of the neighboring nodes, and generate a set of neighboring node information. Based on the neighborhood node information set, the proportion of connection times of different edge types in the neighborhood of each entity node is counted, and the proportion is multiplied by the preset priority coefficient of the entity type of the neighboring node to generate a set of edge type weight factors. Based on the set of edge type weight factors, the initial entity vector representation is concatenated with the entity vectors in the neighboring node information set according to the weight factors to generate a multi-task hepatitis B entity feature vector.

5. The knowledge graph-assisted multi-task hepatitis B drug screening method according to claim 1, characterized in that, The steps for obtaining the results of the hepatitis B drug combination potential assessment are as follows: Traverse each combination in the candidate drug combination list, extract the identifiers of all drug entities within the combination, obtain the sensitivity scores of each drug to the target virus variant from the virus strain drug sensitivity score list, and simultaneously extract the feature vectors of drug entities from the multi-task hepatitis B entity feature vector to generate a drug combination feature-score dataset. Based on the drug combination feature-score dataset, the cosine similarity between the feature vectors of each pair of drugs in the combination is calculated. If the cosine similarity between any two drugs in the combination is higher than the preset mechanism overlap threshold and the sensitivity scores of the two drugs are higher than the preset single drug effective threshold, then it is determined that the drug pair has a target competitive antagonistic effect, and an antagonistic effect label set is generated. Based on the set of antagonistic effect markers, combinations containing at least one antagonistic drug pair are removed, and the remaining combinations are sorted to generate a potential assessment result for hepatitis B drug combinations.

6. The hepatitis B drug screening system based on the knowledge graph-assisted multi-task hepatitis B drug screening method according to any one of claims 1-5, characterized in that, include: The temporal graphing module establishes temporal connections between entities based on hepatitis B virus genotype sequence data recorded over time, patient medication records, and drug sensitivity, generating a temporal hepatitis B knowledge graph structure. The hyperedge fusion module, based on the time-series hepatitis B knowledge graph structure, and combining the input biological pathway annotation information, protein interaction data, and known drug combination effect records, identifies sets of multiple entities acting together in the time-series hepatitis B knowledge graph structure, defines hyperedges connecting these sets of entities, obtains the multi-entity interaction sets and hyperedge definitions, and expands the connection methods of nodes in the time-series hepatitis B knowledge graph structure based on these multi-entity interaction sets and hyperedge definitions, thereby establishing a hepatitis B knowledge base that integrates hyperedge interactions. The representation learning module, based on the hepatitis B knowledge base with fused hyperedge interaction, calculates the initial vector expression of each hepatitis B virus entity, drug entity, and gene entity to obtain the initial entity vector representation. Based on the initial entity vector representation, it updates and adjusts the vector expression by aggregating the neighborhood node information and the connection information of different types of edges in the graph, and establishes a multi-task hepatitis B entity feature vector. The effect prediction module, based on the multi-task hepatitis B entity feature vector, selects the feature vector corresponding to the target hepatitis B virus variant and the feature vector corresponding to the candidate drug, estimates the drug sensitivity score between the two, obtains a list of viral strain drug sensitivity scores, and evaluates the antagonistic effect of the drug combination based on the list of viral strain drug sensitivity scores and the multi-task hepatitis B entity feature vector corresponding to the candidate drug combination, and obtains the potential evaluation result of the hepatitis B drug combination.