A knowledge graph construction method based on clinical test big data

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By screening and calculating the central interval and coefficient of variation from clinical data, and constructing and optimizing the connection edges of the knowledge graph, the problem of inaccurate screening of disease-specific test indicators in existing technologies is solved, and high-precision screening of disease-specific test indicators and diagnostic guidance are achieved.

CN122290984APending Publication Date: 2026-06-26WENZHOU MEDICAL UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: WENZHOU MEDICAL UNIV
Filing Date: 2026-02-03
Publication Date: 2026-06-26

Application Information

Patent Timeline

03 Feb 2026

Application

26 Jun 2026

Publication

CN122290984A

IPC: G16H50/30; G16H50/70; G06F16/36; G06N5/022

AI Tagging

Technology Topics

Disease Evaluation result

Technical Efficacy Phrases

Guaranteed stabilityImprove screening accuracy

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Beam bottom longitudinal reinforcement protective layer supporting device
CN224495599UImprove stability reduce construction costs Screw thread Building construction
A heating device for powder purification in a czochralski method
CN224362915UGuaranteed stabilityImprove high temperature performancePolycrystalline material growth By pulling from melt
A robust PCBA structure and server
CN224460102UGuaranteed stabilityless prone to damageScrew thread Server
滤筒清洗设备
CN224506513UGuaranteed stabilityavoid uniformity
A paper strip reversing mechanism
CN224350075UArticle delivery

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing knowledge graph construction methods struggle to accurately identify disease-specific test indicators when faced with complex clinical scenarios, leading to a decline in diagnostic guidance value, an inability to effectively distinguish between different diseases, and the illusion of a "universal indicator."

Method used

By obtaining test data from clinical data on the target disease group, other disease groups, and healthy control groups, calculating the central tendency and dispersion, screening out a preliminary set of distinguishing items, calculating the coefficient of variation and stability value, constructing an initial knowledge graph, and optimizing the distinguishing strength of the connecting edges through multi-cohort validation, a disease-specific test indicator knowledge graph is finally formed.

Benefits of technology

It significantly improves the screening accuracy and clinical reliability of disease-specific test indicators, providing a scientific basis for precision diagnosis.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122290984A_ABST

Patent Text Reader

Abstract

This application provides a method for constructing a knowledge graph based on big data from clinical laboratory testing, belonging to the field of next-generation information technology. The method includes: collecting measurement results of the target disease group, other disease groups, and a healthy control group on the same test item; determining the central tendency and dispersion of each group on the same test item; determining a preliminary set of distinguishing items based on the central tendency and dispersion of each group; marking connecting edges with distinguishing strength attribute values higher than a threshold as high-distinguishing edges by traversing the distinguishing strength attribute values of all connecting edges in the initial knowledge graph, thereby obtaining an optimized weighted knowledge graph; recalculating the distinguishing strength attribute values of high-distinguishing edges in the optimized weighted knowledge graph; and determining the final disease-specific test indicator knowledge graph based on the evaluation results of the exponential stability of the same connecting edge across different validation queues.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of information technology, and in particular to a method for constructing a knowledge graph based on big data from clinical laboratory testing. Background Technology

[0002] In the healthcare field, disease and laboratory indicator knowledge graphs based on clinical laboratory data have become important tools for assisted diagnosis and precision medicine. The core objective of these knowledge graphs is to provide clinicians with scientific evidence on "which indicators best point to a specific disease" by quantifying the correlation strength between disease nodes and laboratory indicator nodes. However, current mainstream construction methods generally suffer from a key flaw when facing complex clinical scenarios: they struggle to accurately screen out truly disease-specific laboratory indicators, leading to a significant decrease in the diagnostic guidance value of the knowledge graph. Traditional methods typically use one of two approaches to determine the correlation strength between indicators and diseases: one is based on the degree of abnormality within a single disease group; if an indicator significantly deviates from the normal reference range in patients with a particular disease, it is assigned a high weight; the other is based on the degree of variability within a single disease group; if an indicator's numerical distribution is very stable in patients with that disease (i.e., a low coefficient of variation (CV), it is considered "highly reliable" and also assigned a high weight. While both approaches seem reasonable, they overlook a crucial principle in clinical statistics—specificity must be determined by "between-group differences," not "within-group characteristics." Extensive clinical practice has shown that many laboratory indicators exhibit similar abnormal patterns and stability across multiple diseases, resulting in extremely low practical diagnostic value. Constructing edges solely based on the characteristics of a single population artificially creates numerous "false positive strong associations," causing the same indicator node to have high-weighted edges with multiple disease nodes in the knowledge graph, severely weakening the discriminative power between diseases. Typical examples include white blood cell count (WBC), which shows a mild to moderate stable elevation (withintra-group CV <25%) in various chronic inflammatory diseases such as type 2 diabetes, coronary heart disease, chronic kidney disease, and rheumatoid arthritis, but it cannot distinguish between any two of these diseases. Serum creatinine (Cr) and blood urea nitrogen (BUN) show stable elevations in renal insufficiency caused by various etiologies, including chronic kidney disease, hypertensive nephropathy, diabetic nephropathy, and obstructive nephropathy, but cannot indicate the specific cause. Carcinoembryonic antigen (CEA) shows a stable mild elevation in colorectal cancer, lung cancer, pancreatic cancer, breast cancer, as well as in heavy smokers and benign bowel diseases, and has almost no diagnostic value when used alone. Alanine aminotransferase (ALT) can be consistently elevated in viral hepatitis, alcoholic liver disease, fatty liver, and drug-induced liver injury, but its high value alone cannot determine the cause. These indicators share the common characteristic of small intra-group variability and deviations from normal values, but small inter-group differences. Using traditional methods, these indicators would be "claimed" by multiple disease nodes, leading to severe averaging of edge weights in the knowledge graph and ultimately creating the illusion of a "universal indicator," making it difficult for doctors to make accurate differential diagnoses. Current knowledge graph construction processes almost completely ignore statistical comparisons of multiple disease control groups, failing to use the inter-group effect size of "target disease group vs. other common confounding disease groups" as a core weighting basis, thus failing to truly capture the disease specificity of the indicators.This is equivalent to only looking at sensitivity and not specificity when evaluating diagnostic tests. The resulting knowledge graph can only answer "which diseases may this abnormal indicator be related to", but cannot answer the most needed clinical question: "Among many abnormal indicators, which one can best distinguish this disease from other diseases?" Summary of the Invention

[0003] This invention provides a method for constructing a knowledge graph based on big data from clinical laboratory testing, mainly including: Data on laboratory tests were obtained from clinical data for the target disease group, other disease groups, and healthy control group to determine the central tendency and dispersion of each group in terms of laboratory tests. A preliminary set of distinguishable items is selected based on the concentration interval and the dispersion amplitude; The coefficient of variation and stability value of each test item are calculated using the preliminary distinguishing item set to determine the high stability item set. An initial knowledge graph is constructed based on the set of highly stable items and their associated data. The initial knowledge graph includes disease master nodes, test item sub-nodes, and connecting edges. By traversing the connection edges of the initial knowledge graph, connection edges with a discrimination strength attribute value higher than the threshold are marked as high discrimination edges, thus obtaining an optimized weighted knowledge graph. The stability of the discrimination strength attribute values of the high-discrimination edges is evaluated based on the verification queue, the graph structure is updated, and the final disease-specific test index knowledge graph is obtained.

[0004] Furthermore, the step of obtaining test data for the target disease group, other disease groups, and healthy control group from clinical data, and determining the central tendency and dispersion of each group in terms of test items, includes: Historical test data of the target disease group were extracted from clinical test information. Samples of other disease groups were screened by diagnostic codes. Data of the healthy control group were stratified by age group. The original values of the test items in the three sets of data were aligned and outliers were removed to obtain a standardized dataset. For the standardized dataset, the interquartile range of each data set is calculated as the central interval, and the standard deviation is calculated as the dispersion. Based on the central interval and dispersion, calculate the difference between the median of the central interval of the target disease group and other disease groups. If the difference is greater than the sum of the standard deviations of the two groups, it is determined that there is a significant difference between the groups, and the test items that meet the difference conditions are recorded.

[0005] Furthermore, the step of filtering out the initial set of distinguishable items based on the central interval and the dispersion amplitude includes: Obtain the upper and lower bounds of the central interval for the target disease group, and record the interval boundary values for the corresponding test items for other disease groups; The interval difference value is obtained by calculating the straight-line distance between the center points of the two intervals and taking the square root of the sum of the squares of the upper and lower bound differences. If the interval difference value is compared with a preset threshold, and the interval difference value exceeds the threshold, then the test item is marked as a candidate distinguishing item. Extract the dispersion of the candidate distinguishing items in each group. If the dispersion of the target disease group is smaller than that of other disease groups, record the name of the test item and the interval difference value. Summarize the items that meet the conditions to obtain a preliminary set of distinguishing items.

[0006] Furthermore, the step of calculating the coefficient of variation and stability value of each test item through the preliminary distinguishing item set to determine the high-stability item set includes: For the aforementioned preliminary set of distinguishing items, the ratio of the standard deviation to the mean of the target disease group is calculated to obtain the coefficient of variation; Calculate the coefficient of variation for the corresponding items in other disease groups, and obtain the stability ratio by dividing the coefficient of variation of other disease groups by the coefficient of variation of the target disease group; If the stability ratio is greater than the preset stability threshold, the item is marked as a high-stability item. Based on the high-stability items, relevant literature is retrieved from the literature database, diagnostic basis texts and clinical application instructions are extracted, structured association data is constructed, and a set of high-stability items is formed.

[0007] Furthermore, the step of constructing an initial knowledge graph based on the highly stable set of items and its associated data includes: Extract the target disease name as the main disease node identifier, extract the test item name as the child node identifier, and establish a unique node index through coding; For each pair of disease and test item combinations, the intragroup stability value and interval difference value of the test item in the target disease group are obtained. The discrimination strength value is calculated by multiplying the intragroup stability value and interval difference value of the test item in the target disease group. If the discrimination strength value is greater than the preset threshold, a connection edge is created between the disease master node and the test item sub-node, and the connection edge carries the discrimination strength value as the edge weight. An initial knowledge graph is formed by combining the connecting edges and nodes using a structured storage method.

[0008] Furthermore, the step of traversing the connection edges of the initial knowledge graph and marking connection edges with a discrimination strength attribute value higher than a threshold as high discrimination edges to obtain an optimized weighted knowledge graph includes: Traverse all the connection edges in the initial knowledge graph and extract the distinguishing strength attribute value of each edge; The average distinguishing strength value is obtained by summing the distinguishing strength attribute values and dividing by the total number of connected edges. Based on the average discrimination strength, the discrimination strength attribute value of each connecting edge is compared one by one. If the attribute value of a connecting edge is greater than the average discrimination strength, it is marked as a high discrimination edge, and the optimized weight knowledge graph is obtained.

[0009] Furthermore, the step of performing a stability evaluation on the discrimination strength attribute value of the highly discriminative edge based on the verification queue and updating the graph structure includes: The clinical data were grouped by time series using cross-validation, and multiple validation cohorts were formed by random sampling. For the verification queue, the intragroup stability value and interval difference value of the test item corresponding to the high-discrimination edge are calculated in each queue. The discrimination strength attribute value is obtained by multiplying the intragroup stability value and interval difference value of the test item corresponding to the high-discrimination edge. Based on the distinguishing strength attribute value, the coefficient of variation of the same connecting edge between different queues is calculated. If the coefficient of variation is less than the preset stability threshold, the edge is determined to have cross-queue stability. Based on the stability determination results, high-discrimination edges that meet the conditions are retained, edges that do not meet the conditions are removed, and the graph structure is updated.

[0010] Furthermore, obtaining the final disease-specific test indicator knowledge graph includes: Based on the updated graph structure, the topological relationships of the disease master node, test item sub-nodes, and high-resolution edges are preserved. For the high-discrimination edge, record its discrimination strength attribute value and cross-queue stability determination result; By using topological relationships and attribute values, a structured knowledge base containing the association between diseases and test items is formed; Based on the structured knowledge base, the diagnostic basis text and clinical application instructions for each test item are extracted, a mapping relationship with the disease master node is established, and the final disease-specific test indicator knowledge graph is generated.

[0011] The technical solutions provided by the embodiments of the present invention may include the following beneficial effects: This invention discloses a knowledge graph construction method based on big data from clinical laboratory testing. Addressing the business scenario of accurately identifying specific laboratory indicators for target diseases from multiple sets of clinical data while ensuring their stability and clinical applicability, this method systematically solves the challenges of scientific rigor and reliability in indicator selection by integrating statistical analysis of measurement data, literature evidence extraction, and multi-cohort validation. First, this invention calculates the central tendency and dispersion of each data set to select a preliminary set of distinguishing items. Then, it determines a set of highly stable items by combining the coefficient of variation and inter-group differences. Diagnostic evidence is extracted from medical literature and clinical guidelines to construct a knowledge graph centered on diseases and laboratory indicators. The distinguishing strength of connection edges is optimized, and finally, multi-cohort cross-validation ensures the stability of the indicators. The technical effect of this invention is that it significantly improves the selection accuracy and clinical reliability of disease-specific laboratory indicators, providing a scientific basis for accurate diagnosis. Attached Figure Description

[0012] Figure 1 This is a flowchart of a knowledge graph construction method based on big data from clinical laboratory testing, according to the present invention.

[0013] Figure 2 This is a schematic diagram of a knowledge graph construction method based on big data from clinical laboratory testing according to the present invention.

[0014] Figure 3 This is another schematic diagram of a knowledge graph construction method based on big data from clinical laboratory testing according to the present invention. Detailed Implementation

[0015] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

[0016] like Figures 1-3 This embodiment of a knowledge graph construction method based on clinical laboratory big data may specifically include: S101. Collect measurement results of the target disease group, other disease groups and healthy control group on the same test item, and determine the central interval and dispersion of each group on the same test item.

[0017] Historical laboratory test data of patients in the target disease group were extracted from the clinical laboratory information system. Samples from other disease groups were screened using ICD-10 diagnostic codes. Data from healthy individuals undergoing physical examinations were stratified by age group. The raw values of laboratory tests for the three groups were aligned, and outliers exceeding the instrument's detection limits were removed to obtain a standardized multi-group control dataset. For this standardized multi-group control dataset, the 25th percentile (Q1) and 75th percentile (Q3) of each group were calculated. Using [Q1, Q3] as the central tendency, the square root of the variance for each group was calculated as the standard deviation to define the magnitude of dispersion, thus obtaining the interval boundary values and dispersion measures for each disease group. Based on these interval boundary values and dispersion measures, an independent samples t-test was performed on the same laboratory test item between the target disease group and other disease groups. If the p-value was less than 0.05, a significant difference between groups was determined, and the laboratory tests that met the difference criteria were recorded, obtaining the central tendency and magnitude of dispersion for each group on the same laboratory test item.

[0018] For example, in one implementation, the data interface of the hospital's clinical laboratory information system is used to batch extract the laboratory records of patients diagnosed with type 2 diabetes in the past three years. Simultaneously, target disease groups are screened based on the ICD-10 diagnostic code E11, and data from patients with other chronic diseases such as coronary heart disease and chronic kidney disease are extracted as control disease groups according to the principle of concurrent visits. The data collection process first establishes a unique patient identifier mapping table, uniformly converting outpatient and inpatient numbers into a patient master index to ensure accurate association of laboratory data from multiple visits by the same patient. For the healthy control group, equal numbers of healthy individuals are selected from the physical examination center database according to three age groups: 20-40 years, 40-60 years, and over 60 years. The screening criteria for these healthy individuals include no abnormal examination results, no hospitalization history in the past year, and no use of chronic disease medications. Data alignment is achieved through time window matching, pairing laboratory test items for the three groups within the same quarter. For items affected by food intake, such as blood glucose and blood lipids, only fasting values are retained. Outlier removal uses the instrument range determination method. When the test value exceeds the upper limit of the equipment detection or falls below the lower limit of the detection, the data point is marked as invalid and removed from the dataset.

[0019] In one possible implementation, the interquartile range (IQR) is calculated by sorting each data set by value and determining the specific values of the 25th percentile (Q1) and the 75th percentile (Q3). The difference between these two values is the IQR. The central tendency is defined as the range of values [Q1, Q3], reflecting the distribution area of the core data in that set. The standard deviation, as a measure of dispersion, is obtained by calculating the sum of squared deviations of each data point from the group mean, dividing by the sample size, and then taking the square root. A smaller standard deviation indicates that the test item exhibits a stable numerical distribution characteristic within a specific disease group.

[0020] For example, the determination of differences between groups is made by comparing the median of the central interval. The median M1 of the central interval of the target disease group and the median M2 of the central interval of other disease groups are calculated. The absolute value of the difference between the two is compared with the sum of the standard deviations of the two groups. When |M1-M2| is greater than σ1+σ2, it is determined that there is a statistically significant difference between the two groups for this test item. σ1 and σ2 are the standard deviation values of the two groups, respectively.

[0021] Preferably, for the serum C-peptide test, the concentration range for the type 2 diabetes patient group is [0.8, 1.5] ng / ml, while that for the healthy control group is [1.8, 3.2] ng / ml. The concentration ranges of the two groups do not overlap and the distance between them is obvious. This test is identified as a disease-specific indicator with high discriminative power.

[0022] S102. Determine the preliminary set of distinguishable items based on the central interval and dispersion range of each group.

[0023] Obtain the upper and lower bounds of the target disease group's interval. Record the interval boundary values for the corresponding test items in other disease groups. Calculate the straight-line distance between the center points of the two intervals in two-dimensional space. Calculate the interval difference between different disease groups by taking the square root of the sum of the squares of the upper and lower bound differences. Specifically, treat the upper and lower bounds of each interval as two-dimensional coordinate points. For example, the upper bound of the target group is U1 and the lower bound is L1, while the upper bound of other groups is U2 and the lower bound is L2. Then, the interval difference value D is calculated as D = sqrt((U1-U2)). 2 + (L1-L2) 2 The formula sqrt(x) represents the square root and measures the overall difference between two intervals in terms of numerical range. The interval difference is compared to a preset distance threshold. When the difference exceeds the threshold, the test item is marked as a candidate distinguishing item. Simultaneously, the dispersion magnitude of the candidate distinguishing item in each group is extracted, and it is determined whether the dispersion magnitude of the target disease group is smaller than that of other disease groups, thus obtaining a dual screening result that satisfies both the distance and dispersion conditions. For the test items in the dual screening result, their names and corresponding interval difference values are recorded as distinguishing strength indicators. All items that satisfy both the conditions of an interval difference greater than the threshold and a smaller dispersion magnitude in the target group are summarized to obtain a preliminary distinguishing item set.

[0024] For example, in one implementation, the Euclidean distance is calculated based on the geometric characteristics of the cluster intervals. The cluster intervals of each disease group are regarded as line segments in a two-dimensional coordinate system, where the upper bound is used as the ordinate and the lower bound is used as the abscissa. The interval differences are quantified by calculating the average of the differences between the upper and lower bounds of the two intervals.

[0025] Specifically, for serum cystatin C, a renal function indicator, the concentration range for the chronic kidney disease group is [1.5, 2.8] mg / L, and for the diabetic nephropathy group it is [1.2, 2.3] mg / L. The interval difference is calculated by taking the square root of the sum of the upper bound difference (2.8 - 2.3 = 0.5) and the lower bound difference (1.5 - 1.2 = 0.3), resulting in an interval difference of approximately 0.58. The preset distance threshold is set based on the biological coefficient of variation of the test item, typically taking 20% of the normal reference interval range as the criterion.

[0026] It should be noted that the dual screening mechanism is designed with the actual needs of clinical diagnosis in mind. Relying solely on interval differences may lead to some highly variable indicators being misclassified as specific indicators, while the comparison of dispersion ensures the stability of data within the target disease group. When the dispersion of the target disease group is smaller than that of other disease groups, it indicates that the test item exhibits a more concentrated distribution characteristic in the target disease, resulting in higher diagnostic consistency.

[0027] In one possible implementation, for the thyroid function test group, the interval difference value of free thyroxine FT4 in patients with hyperthyroidism reached 8.5 pmol / L, which far exceeded the threshold of 3.0 pmol / L, and the dispersion of this group was only 2.1, which was significantly smaller than the 3.8 of other thyroid disease groups. Therefore, it was included as a candidate differentiator.

[0028] Preferably, the distinguishing strength index is constructed by recording the name of each test item that meets the conditions and its corresponding interval difference value, forming a structured data record. Each record contains five fields: test item code, item name, interval difference value, target group dispersion, and control group dispersion.

[0029] For example, the process of constructing the initial differentiation project set implemented strict quality control. For boundary cases, such as projects whose interval difference value is exactly equal to the threshold, the dispersion amplitude ratio was introduced as a supplementary judgment criterion. Only when the dispersion amplitude ratio of the target group and the control group is less than 0.7 is the project set included, thereby improving the accuracy of screening.

[0030] S103. Based on the preliminary differentiation item set, determine the coefficient of variation of each test item in the target disease group and other disease groups. Combine the intragroup stability value of the target disease group relative to the control group to determine the high stability item set, and obtain the high stability item set and its associated diagnostic basis data.

[0031] For each test item in the initial differentiation set, the coefficient of variation (COP) is calculated as the ratio of its standard deviation to the mean in the target disease group. Similarly, the COPs for corresponding items in other disease groups are calculated. The stability ratio is obtained by dividing the COPs of other disease groups by the COP of the target disease group. If the stability ratio is greater than a preset stability threshold, the item is marked as a high-stability item. Search terms are constructed based on the names of the high-stability items. Relevant literature is retrieved from medical literature databases, and sentences containing co-occurrences of the test item name and the target disease name in the literature abstracts are extracted as diagnostic basis description text. The total number of times the test item appears in literature related to the target disease is counted. Keyword matching and regular expression recognition are performed on the diagnostic basis description text to extract numerical information related to the diagnostic threshold range. The chapters of the target disease diagnostic criteria are retrieved from the clinical diagnostic guidelines database, and the clinical application instructions and reference intervals corresponding to the high-stability items are matched. Structured association data is constructed using the total number of occurrences, diagnostic threshold ranges, and clinical application instructions. An index is created according to the test item codes, and a high-stability item set and its associated diagnostic basis data are summarized.

[0032] For example, in one implementation, the coefficient of variation (CV) is calculated based on statistical principles, assessing data stability by examining the relative relationship between the dispersion of test item values and central tendency. For each test item, the input is a list of test values for that item for all patients in the target disease group. The mean μ is calculated first, followed by the standard deviation σ. The output is CV = (σ / μ) × 100, representing the percentage of data dispersion relative to the mean. This coefficient eliminates the influence of dimensions, making the stability of different test items comparable.

[0033] Specifically, regarding glycated hemoglobin (HbA1c), a diabetes-specific indicator, the mean value was 8.5% in the type 2 diabetes patient group, with a standard deviation of 1.2% and a coefficient of variation of 0.14; while in the chronic kidney disease group, the mean value was 6.2%, with a standard deviation of 2.1% and a coefficient of variation of 0.34. Calculating 0.34 divided by 0.14 yielded a stability ratio of 2.43. When the preset stability threshold was 1.5, this item was marked as highly stable, indicating that the distribution of HbA1c values in type 2 diabetes patients was more concentrated and stable compared to other disease groups. The construction of a medical literature retrieval strategy requires combining MeSH vocabulary with free terminology. The search terms consist of three levels: core terms are the standard names of the test items; extended terms include common abbreviations, synonyms, and related test method names for the item; and limiting terms are the standardized diagnostic names of the target disease. The test item terms and disease terms are connected using the Boolean operators AND and OR to connect different expressions of similar terms, thus constructing a complete search expression. The literature search scope was limited to clinical research, treatment guidelines, and expert consensus documents published in the last ten years. After initial screening by title and abstract, the full text was scanned in depth to extract text fragments containing co-occurrence of the test item name and the target disease name in the same or adjacent sentences.

[0034] For example, the extraction of diagnostic criteria description text employs syntactic analysis to identify the subject-verb-object structure in sentences. When the test item is the subject or object, and the sentence predicate contains diagnostic-related verbs such as "diagnosis," "suggestion," "support," and "differentiation," the sentence is marked as a candidate sentence for diagnostic criteria. Each candidate sentence is scored for semantic relevance, considering three dimensions: the word distance between the test item and the disease name, the presence of quantitative description, and the inclusion of clinically significant explanations. Sentences with a comprehensive score exceeding a preset threshold are formally included in the diagnostic criteria description text set. Frequency statistics not only calculate the simple number of documents but also weight the occurrences based on the impact factor and citation frequency of the documents, assigning higher weight to appearances in high-quality journals and highly cited literature.

[0035] In one possible implementation, keyword matching employs a multi-pattern string matching method to construct a diagnostic threshold recognition template. The template includes numerical prefix patterns such as "greater than," "less than," and "between," numerical body patterns including integers, decimals, percentages, and scientific notation, and unit suffix patterns covering various expressions in both the International System of Units (SI) and traditional unit systems. The regular expression design considers the diversity of numerical expressions in medical literature, such as different formats like "ALT>40U / L," "blood glucose between 7.0-11.1mmol / L," and "creatinine elevated above 120μmol / L." These are matched one by one using pre-compiled regular expression sets to extract standardized numerical range information.

[0036] Preferably, the retrieval and positioning of clinical diagnostic guidelines employs a combination of chapter title identification and content keyword positioning. By identifying chapter titles such as "diagnostic criteria," "laboratory tests," and "auxiliary examination indicators," relevant content areas are quickly located, and specific entries are found through precise matching of test item names. The structured associated data is organized using a three-layer architecture: the top layer is the test item index layer, using standardized test item codes as the primary key; the middle layer is the diagnostic evidence layer, containing four core fields: literature source, diagnostic basis text, frequency weighted value, and diagnostic threshold range; the bottom layer is the clinical application layer, storing practical information extracted from the guidelines, such as recommendation level, applicable population, and testing timing. Furthermore, data association integrates multi-source information through test item codes, with each highly stable item corresponding to a complete chain of evidence, from statistical characteristics to literature support to clinical guideline recommendations, forming a comprehensive diagnostic evidence system.

[0037] For example, regarding thyroid peroxidase antibody (TPOAb), an autoimmune thyroid disease marker, the associated data showed a stability ratio of 3.2 in patients with Hashimoto's thyroiditis, a weighted frequency of 486 occurrences in the literature, a diagnostic threshold of >34 IU / ml, and a clinical guideline recommendation level of A-level evidence. Therefore, this item was identified as a core diagnostic indicator for Hashimoto's thyroiditis.

[0038] In one embodiment, the formation of a highly stable set of projects may also include a data integrity verification step, requiring each project to have three types of information: statistical evidence, literature evidence, and guideline support. Projects lacking any of these three types of information are temporarily retained in the candidate set and formally included after subsequent data is supplemented. This rigorous screening mechanism ensures the reliability and clinical applicability of each diagnostic association in the knowledge graph.

[0039] Extract the descriptive text of the basis for the use of the test item in the diagnosis of the target disease from the relevant literature of each test item in the high stability test item set. Count the total number of times the test item appears in the literature related to the target disease as the literature frequency. Associate the extracted diagnostic basis descriptive text, literature frequency, and diagnostic criteria with the corresponding test item.

[0040] Based on the names of test items and the diagnostic names of the target diseases in the high-stability item set, a search expression composed of keywords, synonyms, and abbreviations is constructed. Clinical research literature published in the past decade is retrieved in batches from medical literature databases. A preliminary screening using titles and abstracts yields a set of full-text articles related to these articles. For this set, syntactic analysis is used to identify sentence segments containing both test item and disease names. Complete sentences containing diagnostic-related verbs are extracted as diagnostic basis description text. The frequency of each test item in the literature set is calculated. The diagnostic criteria chapters for the target disease are retrieved from the clinical diagnostic guidelines database. Relevant paragraphs involving the test items are located, and their numerical ranges and unit information are extracted as diagnostic threshold ranges. Simultaneously, the clinical application instructions for the corresponding test items are obtained. Based on the diagnostic basis description text, literature frequency, diagnostic threshold range, and clinical application instructions, a mapping relationship is established according to the test item codes, forming structured records and associating them with the corresponding test items.

[0041] Specifically, the acquisition of the full-text collection of literature involves a three-stage screening mechanism. The initial screening stage uses title keyword matching to exclude obviously irrelevant literature; the second screening stage scans the abstract content to confirm that the literature involves research on the correlation between the target disease and laboratory tests; the final screening stage obtains the full text to verify that the literature type is clinical research, systematic review, or treatment guidelines. Each article is assigned a quality score, with higher weighting given to journal articles with an impact factor greater than 5, classic articles with more than 50 citations, and guidelines published by national academic societies, ensuring that subsequent analyses are based on high-quality evidence sources.

[0042] It should be noted that the implementation of the syntactic analysis method involves multiple natural language processing stages. Sentences are broken down into lexical units through word segmentation, part-of-speech tagging identifies grammatical components such as nouns, verbs, and adjectives, and dependency parsing determines the modification relationships between words. When the test item name is identified as a noun phrase and has a direct or indirect dependency relationship with diagnostic-related verbs such as "diagnosis," "suggestion," "support," "differentiation," and "prediction" in the sentence, the sentence is marked as a candidate diagnostic basis. Co-occurrence judgment uses a sliding window mechanism with a window size of 50 words. When the test item and disease name appear in the same window and there are no negative words or adversative conjunctions in between, it is considered a valid co-occurrence. For a sentence like "Elevated NT-proBNP levels are an important basis for the diagnosis of heart failure, and a value greater than 125 pg / ml has high diagnostic value for heart failure," NT-proBNP is identified as the test item, heart failure as the target disease, and "diagnosis" as the related verb. The entire sentence is extracted as the diagnostic basis description text.

[0043] For example, the frequency of document occurrences is statistically analyzed using a weighted counting method. Items appearing in the title are assigned a weight of 3, those in the abstract a weight of 2, and those in the main text a weight of 1. The importance of the location of occurrence is also considered: the weight of items appearing in the results or conclusion sections is increased by 50%, the weight in the discussion section remains unchanged, and the weight in the introduction or background section is decreased by 30%. By summing the weighted values for each location, the overall occurrence score of the item in a single document is obtained. Then, the scores of all relevant documents are summed to obtain the document occurrence frequency index.

[0044] In one possible implementation, the chapter location of clinical diagnostic guidelines employs a hierarchical search strategy. First, chapters containing keywords such as "diagnosis," "examination," and "laboratory indicators" in their primary headings are identified. Then, within these chapters, secondary headings are searched to locate specific subsections such as "diagnostic criteria" and "auxiliary examinations."

[0045] Preferably, the extraction of diagnostic threshold ranges requires processing various numerical expression formats. For interval expressions such as "2.5-5.0 mmol / L", the upper and lower bound values and units are extracted; for unilateral thresholds such as ">40 U / L" or "<2.0 mg / dL", comparison symbols and critical values are identified; for graded thresholds such as "mild elevation 1.5-2 times the upper limit of normal, moderate elevation 2-5 times, severe elevation >5 times", the numerical ranges for each grade are extracted separately. All extracted values are uniformly converted to the International System of Units (SI) for easy subsequent standardization. Clinical application instructions typically contain practical information such as testing timing, sample requirements, and influencing factors. Testing timing requirements are extracted by identifying time markers such as "fasting", "morning", and "2 hours postprandial"; sample type terms such as "serum", "plasma", and "whole blood" are identified to determine specimen requirements; and statements such as "affected by..." and "...can lead to false positives" are identified to extract explanations of interfering factors. Furthermore, the structured records are generated using a multi-field data model, with core fields including nine dimensions: test item code, Chinese name of the item, English abbreviation of the item, set of text describing the diagnostic basis, total frequency of occurrence in literature, range of diagnostic threshold values, threshold unit, clinical application instructions, and data update timestamp.

[0046] For example, the structured record of B-type natriuretic peptide (BNP), a marker of cardiac function, shows that the item is coded as "LABBNP 001", the diagnostic criteria description contains 15 high-quality literature statements, the weighted frequency of which is 892 times, the heart failure diagnostic threshold is ">100pg / ml", and the clinical application instructions state that "the influence of confounding factors such as renal insufficiency and advanced age should be excluded".

[0047] S104. Based on the set of highly stable items and their associated diagnostic data, construct a knowledge graph framework with disease names as the main nodes and test item names as the sub-nodes. Establish connection edges between the disease main node and the test item sub-nodes to obtain the initial knowledge graph.

[0048] Based on the set of highly stable items and their associated diagnostic data, the diagnosis name of each disease is extracted as the master node identifier, and the corresponding test item name is extracted as the child node identifier. A unique node index is established through disease codes and test item codes, forming a basic dataset containing disease master nodes and test item child nodes. For each pair of disease and test item combinations in the basic dataset, the intra-group stability value of the test item in the target disease group is obtained, and the interval difference value of the item between different disease groups is extracted. The discrimination strength value is calculated by multiplying the intra-group stability value of the test item in the target disease group and the interval difference value of the item between different disease groups. Based on the comparison result of the discrimination strength value and a preset minimum threshold, a connection edge is created between the disease master node and the test item child node with a discrimination strength value greater than the threshold. The connection edge carries the discrimination strength value as the edge weight attribute value. Through the combination relationship between the connection edge and the node, the node attributes, edge attributes, and topological relationships are stored in a structured storage method to form an initial knowledge graph containing disease master nodes, test item child nodes, and their weighted connection edges.

[0049] For example, in one implementation, the knowledge graph's node system adopts a two-level architecture. The disease master node serves as the first-level node, storing core attributes such as the complete diagnostic name of the disease, ICD-10 code, disease classification, and characteristics of the affected population. The test item sub-node serves as the second-level node, containing basic information such as the test item's standard name, laboratory code, testing method, reference range, and sample type. The node uniqueness index is constructed by combining the disease's ICD code prefix with the test item's laboratory code suffix, forming a globally unique node identifier to ensure that no node duplication or confusion occurs during graph expansion.

[0050] Specifically, the calculation of the discriminant strength reflects the dual requirements of disease specificity: the test indicator must be stable within the target disease group and significantly different from other disease groups. The intragroup stability value reflects the consistency of the test item within a specific disease patient population; a larger value indicates greater stability of the indicator in the patient population. The interval difference value quantifies the degree of separation between different disease groups; a larger value indicates stronger discriminant ability of the indicator. The discriminant strength obtained by multiplying these two values comprehensively considers both stability and difference, avoiding bias that might be caused by a single indicator.

[0051] For example, if a certain test item has large differences between groups but also large variations within groups, or is very stable within groups but has no differences between groups, the discrimination strength will be low in both cases. Only items that simultaneously meet the requirements of stability within groups and significant differences between groups will obtain a high discrimination strength value.

[0052] It should be noted that the determination of the preset minimum threshold is based on both statistical significance and clinical practicality. Through retrospective analysis of historical diagnostic data, the precision and recall of the knowledge graph at different thresholds were calculated, receiver operating characteristic (ROC) curves were plotted, and the threshold corresponding to the largest area under the curve was selected as the initial setting. In practical applications, this threshold can also be dynamically adjusted according to the characteristics of specific disease areas. For rare diseases, the threshold can be appropriately lowered to retain more potentially relevant indicators, while for common diseases, the threshold can be appropriately raised to highlight core diagnostic indicators.

[0053] For example, the process of creating connection edges not only records the association between diseases and test items, but also quantifies the association strength through edge weight attribute values. For the connection edge between the type 2 diabetes master node and the glycated hemoglobin sub-node, its distinguishing strength value reaches 4.8, far exceeding the minimum threshold of 1.5, thus establishing a strong connection; while the connection edge with the blood routine white blood cell count has a distinguishing strength value of only 0.9, below the threshold, and therefore no connection is established. This threshold-based selective connection mechanism ensures that the knowledge graph retains only associations with practical diagnostic value, avoiding interference from irrelevant or weakly relevant information.

[0054] In one possible implementation, the knowledge graph is stored using a hybrid structure combining adjacency lists and attribute tables. The adjacency list records the connections between nodes, and each disease master node maintains a list of pointers to related test item child nodes; the attribute table stores detailed attribute information of nodes and edges, including basic node information, edge weights, data sources, update times, and other metadata.

[0055] Preferably, the topological relationships are organized using a directed graph structure, where an edge from the disease master node to the test item child node represents the semantic relationship "the disease can be diagnosed through this test item". In addition to carrying a discrimination strength value as the main weight, each edge can also be supplemented with other attributes such as diagnostic sensitivity, specificity, positive predictive value and other auxiliary indicators to form a multi-dimensional association description.

[0056] Understandably, after the initial knowledge graph is constructed, structural verification and optimization are still required. This involves identifying isolated nodes or weakly connected subgraphs by calculating graph connectivity metrics; discovering overly connected "hub" test items by analyzing node degree distribution; and identifying clustering patterns between diseases and test items using community detection algorithms. Furthermore, the knowledge graph supports an incremental update mechanism; when a new, highly stable test item is identified or the discriminative strength value of an existing item changes, only the affected nodes and edges need to be updated, without reconstructing the entire graph structure.

[0057] For example, when constructing a knowledge graph of cardiovascular diseases, acute myocardial infarction is used as the main node, and connection edges of varying strengths are established with multiple sub-nodes of laboratory tests, such as cardiac troponin I, creatine kinase isoenzymes, and myoglobin. The connection edge weight for cardiac troponin I is the highest, reaching 5.2, reflecting its status as the gold standard for myocardial injury diagnosis. Myoglobin, although appearing earlier, has lower specificity, resulting in a connection edge weight of only 2.1. This differentiated edge weighting allows clinicians to quickly identify the most diagnostically valuable tests when using the knowledge graph, improving the accuracy and efficiency of diagnostic decisions.

[0058] S105. By traversing the discrimination strength attribute values of all connecting edges in the initial knowledge graph, connecting edges with discrimination strength attribute values higher than the threshold are marked as high discrimination edges, thus obtaining the optimized weighted knowledge graph.

[0059] Traverse all edges in the initial knowledge graph, extract the discrimination strength attribute value of each edge, sum the discrimination strength attribute values and divide by the total number of edges to obtain the average discrimination strength of all edges. Based on the average discrimination strength, compare the discrimination strength attribute value of each edge one by one. If the attribute value of an edge is greater than the average discrimination strength, it is marked as a high discrimination edge. After marking all edges, the optimized weighted knowledge graph is obtained.

[0060] For example, in one implementation, the knowledge graph is traversed using a depth-first search method. The input is the knowledge graph and the starting node, i.e., the first disease master node, and the output is a temporary array containing the distinguishing strength attribute values of all edges. Starting from the first disease master node, all its connected edges are visited in sequence, and the distinguishing strength attribute value of each edge is recorded in the temporary array. After completing the node, the process moves to the next disease master node until all edges of all nodes have been traversed.

[0061] In one implementation, in the initial knowledge graph, all connecting edges are traversed, and the distinguishing strength attribute value D of each edge is extracted. All D values are summed to obtain a total S, which is then divided by the total number of edges N to calculate the average value A = S / N. For example, if there are 5 edges with D values of 2, 3, 4, 5, and 6, then S = 20 and A = 4. The D and A of each edge are compared one by one; if D > A, it is marked as a high-discriminating edge, such as D = 5 > 4. After marking is completed, an optimized weighted knowledge graph is formed.

[0062] Specifically, the average strength is calculated by summing the results and then dividing by the total number of connections. For a knowledge graph with 500 edges, if the sum is 1250, the average is 2.5, which serves as the dividing line between high and low quality connections. Using the average as a screening criterion is based on the statistical principle of normal distribution. In the field of medical testing, the association strength between most test items and diseases exhibits a normal distribution, and the average can effectively distinguish items with significant diagnostic value.

[0063] Preferably, the marking process adds a Boolean identifier field to the original edge attributes, marking high-discrimination edges as true and other edges as false. This marking method retains the information of all edges, making it easier to adjust the filtering strategy according to different application scenarios.

[0064] For example, in emergency rapid diagnosis scenarios, only the test items corresponding to the high-resolution edge can be called, reducing the number of test items and improving diagnostic efficiency.

[0065] S106. Recalculate the discrimination strength attribute value of high-discrimination edges in the optimized weight knowledge graph, and determine the final disease-specific test index knowledge graph based on the evaluation results of the exponential stability of the same connection edge in different validation queues.

[0066] Cross-validation was used to group clinical data according to patient visit time series, with each group containing case data within the same time span. Multiple non-overlapping validation cohorts were formed through random sampling, maintaining the same disease type proportions as the original dataset in each cohort. For each validation cohort, the intra-group stability value and interval difference value of the test item corresponding to the high-discrimination edge in the optimized weight knowledge graph were independently calculated. The discrimination strength attribute value for that cohort was obtained by multiplying these two values, and the set of discrimination strength values for each cohort was recorded. Based on the set of discrimination strength values, the ratio of the standard deviation to the mean of the discrimination strength attribute values of the same connection edge across different cohorts was calculated as the coefficient of variation. If the coefficient of variation was less than a preset stability threshold, the edge was determined to have cross-cohort stability. Based on the cross-cohort stability determination results, high-discrimination edges that met the stability requirements were retained, while edges that did not meet the requirements were removed. The graph topology and edge attribute information were updated to obtain the final disease-specific test indicator knowledge graph validated by multiple cohorts.

[0067] For example, in one implementation, the cross-validation cohort division employs a time-series stratification strategy, dividing the three-year clinical data into 12 quarterly periods. Each validation cohort consists of three randomly selected non-contiguous time periods, forming four non-overlapping validation cohorts. This combination of non-contiguous time periods avoids the impact of seasonal fluctuations in disease incidence on the validation results, while ensuring that each cohort contains case data from different periods, thus enhancing the representativeness of the validation.

[0068] Specifically, maintaining the proportion of disease types within the cohorts was achieved through stratified sampling. For the original dataset, where type 2 diabetes accounted for 30% of total cases, coronary heart disease for 25%, chronic kidney disease for 20%, and other diseases for 25%, each validation cohort strictly adhered to this proportion for case allocation. When the number of cases for a particular disease type was insufficient within a specific time period, cases of the same type from adjacent time periods were used to supplement the data, ensuring that each cohort contained at least 1000 valid cases.

[0069] It should be noted that the process of recalculating the distinguishing strength attribute values in each validation cohort involved independent statistical analysis. For the serum cystatin C test, in the first validation cohort, the intragroup stability value for the chronic kidney disease group was 0.82, the interval difference with the healthy control group was 3.2, and the distinguishing strength was 2.62; in the second validation cohort, the corresponding values were 0.79, 3.4, and 2.69; in the third validation cohort, they were 0.85, 3.1, and 2.64; and in the fourth validation cohort, they were 0.80, 3.3, and 2.64. This multi-cohort independent calculation revealed the differences in the performance of the test indicator in different patient groups, providing a data basis for subsequent stability assessment.

[0070] For example, the coefficient of variation reflects the dispersion of the discriminant strength values across different cohorts. For the four discriminant strength values of serum cystatin C (2.62, 2.69, 2.64, 2.64), the mean was first calculated to be 2.65, and the standard deviation was calculated to be 0.03. The coefficient of variation was then 0.03 / 2.65 = 0.011, which is much smaller than the preset stability threshold of 0.15, indicating that this test has good cross-cohort stability. Conversely, some inflammatory markers, such as C-reactive protein, had discriminant strength values of 1.8, 3.2, 1.5, and 2.9 in the four cohorts, with a coefficient of variation as high as 0.35, exceeding the threshold. This indicates that the diagnostic value of this indicator is greatly affected by patient population characteristics and is not suitable as a stable diagnostic criterion.

[0071] In one possible implementation, the predetermined stability threshold is determined based on statistical analysis of historical validation data. By collecting data on gold standard tests known for their high diagnostic value, their coefficient of variation distribution in multi-cohort validation is calculated, and the 75th percentile is used as the upper limit of the threshold. The threshold can be adaptively adjusted for different types of diseases; the threshold setting is relatively lenient for acute diseases and more stringent for chronic diseases.

[0072] Preferably, a soft-decision mechanism is used for the retention and removal of highly discriminative edges. When the coefficient of variation is slightly higher than the threshold but does not exceed 1.2 times the threshold, the edge is marked as "conditionally retained" and represented by a dashed line in the graph, indicating that a judgment should be made based on the specific circumstances in clinical use. Only edges with a coefficient of variation significantly exceeding 1.2 times the threshold are completely removed.

[0073] Understandably, updating the graph topology involves not only adding and deleting edges, but also recalculating node degrees and handling isolated nodes. When all connecting edges of a test item's child node are removed, that node becomes an isolated node and is deleted from the final graph. Simultaneously, for a disease master node, if it has fewer than three connected test items, an alert is issued, indicating the need to find new diagnostic biomarkers. Furthermore, updating edge attribute information includes adding validation metadata, recording the number of queues for which the edge passed validation, the discrimination strength values in each queue, the coefficient of variation, and other statistical information. This metadata provides a reliability reference for clinical applications.

[0074] For example, when constructing a knowledge graph for chronic kidney disease (CKD) diagnosis, the initial graph contained 25 connections between laboratory tests and CKD. After four-cohort cross-validation, 18 edges with a coefficient of variation less than 0.15 were retained, 4 edges were conditionally retained, and 3 edges were removed. The resulting disease-specific laboratory test knowledge graph is more concise and reliable. The connections between core indicators such as serum creatinine, blood urea nitrogen, and cystatin C have the highest weights and strongest stability, providing clear guidance on test priority for clinical diagnosis.

[0075] In one embodiment, the final knowledge graph also supports visualization, with edge thickness proportional to discriminative strength and edge color intensity proportional to stability, enabling doctors to intuitively identify the most reliable diagnostic tests. This rigorously validated knowledge graph significantly improves the accuracy of disease diagnosis based on test indicators, reduces unnecessary tests, and achieves the goal of precision medicine.

[0076] Obviously, those skilled in the art can make various modifications and variations to the embodiments of this application without departing from the spirit and scope of the embodiments of this application. Therefore, if these modifications and variations to the embodiments of this application fall within the scope of the claims of this application and their equivalents, this application also intends to include these modifications and variations.

Claims

1. A method for constructing a knowledge graph based on big data from clinical laboratory testing, characterized in that, include: Data on laboratory tests were obtained from clinical data for the target disease group, other disease groups, and healthy control group to determine the central tendency and dispersion of each group in terms of laboratory tests. A preliminary set of distinguishable items is selected based on the concentration interval and the dispersion amplitude; The coefficient of variation and stability value of each test item are calculated using the preliminary distinguishing item set to determine the high stability item set. An initial knowledge graph is constructed based on the set of highly stable items and their associated data. The initial knowledge graph includes disease master nodes, test item sub-nodes, and connecting edges. By traversing the connection edges of the initial knowledge graph, connection edges with a discrimination strength attribute value higher than the threshold are marked as high discrimination edges, thus obtaining an optimized weighted knowledge graph. The stability of the discrimination strength attribute values of the high-discrimination edges is evaluated based on the verification queue, the graph structure is updated, and the final disease-specific test index knowledge graph is obtained.

2. The knowledge graph construction method based on clinical laboratory big data as described in claim 1, characterized in that, The process involves obtaining laboratory test data from clinical data for the target disease group, other disease groups, and healthy control groups, and determining the central tendency and dispersion of each group for the laboratory tests, including: Historical test data of the target disease group were extracted from clinical test information. Samples of other disease groups were screened by diagnostic codes. Data of the healthy control group were stratified by age group. The original values of the test items in the three sets of data were aligned and outliers were removed to obtain a standardized dataset. For the standardized dataset, the interquartile range of each data set is calculated as the central interval, and the standard deviation is calculated as the dispersion. Based on the central interval and dispersion, calculate the difference between the median of the central interval of the target disease group and other disease groups. If the difference is greater than the sum of the standard deviations of the two groups, it is determined that there is a significant difference between the groups, and the test items that meet the difference conditions are recorded.

3. The knowledge graph construction method based on clinical laboratory big data as described in claim 1, characterized in that, The preliminary set of distinguishable items is selected based on the central interval and the dispersion amplitude, including: Obtain the upper and lower bounds of the central interval for the target disease group, and record the interval boundary values for the corresponding test items for other disease groups; The interval difference value is obtained by calculating the straight-line distance between the center points of the two intervals and taking the square root of the sum of the squares of the upper and lower bound differences. If the interval difference value is compared with a preset threshold, and the interval difference value exceeds the threshold, then the test item is marked as a candidate distinguishing item. Extract the dispersion of the candidate distinguishing items in each group. If the dispersion of the target disease group is smaller than that of other disease groups, record the name of the test item and the interval difference value. Summarize the items that meet the conditions to obtain a preliminary set of distinguishing items.

4. The knowledge graph construction method based on clinical laboratory big data as described in claim 1, characterized in that, The step of calculating the coefficient of variation and stability value of each test item through the preliminary distinguishing item set, and determining the high-stability item set, includes: For the aforementioned preliminary set of distinguishing items, the ratio of the standard deviation to the mean of the target disease group is calculated to obtain the coefficient of variation; Calculate the coefficient of variation for the corresponding items in other disease groups, and obtain the stability ratio by dividing the coefficient of variation of other disease groups by the coefficient of variation of the target disease group; If the stability ratio is greater than the preset stability threshold, the item is marked as a high-stability item. Based on the high-stability items, relevant literature is retrieved from the literature database, diagnostic basis texts and clinical application instructions are extracted, structured association data is constructed, and a set of high-stability items is formed.

5. The knowledge graph construction method based on clinical laboratory big data as described in claim 1, characterized in that, The step of constructing an initial knowledge graph based on the highly stable set of items and its associated data includes: Extract the target disease name as the main disease node identifier, extract the test item name as the child node identifier, and establish a unique node index through coding; For each pair of disease and test item combinations, the intragroup stability value and interval difference value of the test item in the target disease group are obtained. The discrimination strength value is calculated by multiplying the intragroup stability value and interval difference value of the test item in the target disease group. If the discrimination strength value is greater than the preset threshold, a connection edge is created between the disease master node and the test item sub-node, and the connection edge carries the discrimination strength value as the edge weight. An initial knowledge graph is formed by combining the connecting edges and nodes using a structured storage method.

6. The knowledge graph construction method based on clinical laboratory big data as described in claim 1, characterized in that, The process of traversing the connection edges of the initial knowledge graph and marking connection edges with a discrimination strength attribute value higher than a threshold as high discrimination edges, thereby obtaining an optimized weighted knowledge graph, includes: Traverse all the connection edges in the initial knowledge graph and extract the distinguishing strength attribute value of each edge; The average distinguishing strength value is obtained by summing the distinguishing strength attribute values and dividing by the total number of connected edges. Based on the average discrimination strength, the discrimination strength attribute value of each connecting edge is compared one by one. If the attribute value of a connecting edge is greater than the average discrimination strength, it is marked as a high discrimination edge, and the optimized weight knowledge graph is obtained.

7. The knowledge graph construction method based on clinical laboratory big data as described in claim 1, characterized in that, The step of performing a stability evaluation on the discrimination strength attribute value of the highly discriminative edge based on the verification queue and updating the graph structure includes: The clinical data were grouped by time series using cross-validation, and multiple validation cohorts were formed by random sampling. For the verification queue, the intragroup stability value and interval difference value of the test item corresponding to the high-discrimination edge are calculated in each queue. The discrimination strength attribute value is obtained by multiplying the intragroup stability value and interval difference value of the test item corresponding to the high-discrimination edge. Based on the distinguishing strength attribute value, the coefficient of variation of the same connecting edge between different queues is calculated. If the coefficient of variation is less than the preset stability threshold, the edge is determined to have cross-queue stability. Based on the stability determination results, high-discrimination edges that meet the conditions are retained, edges that do not meet the conditions are removed, and the graph structure is updated.

8. The knowledge graph construction method based on clinical laboratory big data as described in claim 1, characterized in that, The resulting knowledge graph of disease-specific test indicators includes: Based on the updated graph structure, the topological relationships of the disease master node, test item sub-nodes, and high-resolution edges are preserved. For the high-discrimination edge, record its discrimination strength attribute value and cross-queue stability determination result; By using topological relationships and attribute values, a structured knowledge base containing the association between diseases and test items is formed; Based on the structured knowledge base, the diagnostic basis text and clinical application instructions for each test item are extracted, a mapping relationship with the disease master node is established, and the final disease-specific test indicator knowledge graph is generated.