An environment health knowledge graph system and method based on a large language model

By using an environmental health knowledge graph system based on a large language model, the problems of knowledge dispersion, rapid updates, and complex relationships in environmental health knowledge management are solved. This enables efficient and accurate knowledge acquisition and personalized decision support, thereby improving the scientificity and reliability of environmental health risk assessment and management.

CN122245834APending Publication Date: 2026-06-19HUBEI PROVINCIAL ACADEMY OF ECO-ENVIRONMENTAL SCIENCES(PROVINCIAL ECOLOGICAL ENVIRONMENT ENGINEERING ASSESSMENT CENTER)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUBEI PROVINCIAL ACADEMY OF ECO-ENVIRONMENTAL SCIENCES(PROVINCIAL ECOLOGICAL ENVIRONMENT ENGINEERING ASSESSMENT CENTER)
Filing Date
2026-03-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for environmental health knowledge management suffer from problems such as knowledge fragmentation, rapid updates, complex relationships, difficulties in interdisciplinary integration, insufficient knowledge representation capabilities, limited reasoning capabilities, and difficulties in dynamic updates, resulting in insufficient scientific rigor and reliability in risk assessment and decision support.

Method used

An environmental health knowledge graph system based on a large language model is adopted, including a knowledge mining module, a construction module, a reasoning engine module, and a dynamic update module. Through domain-adaptive pre-training and multi-dimensional quality assessment, a multi-level knowledge graph is constructed, integrating rule-based reasoning, case-based reasoning, and statistical reasoning to achieve complex reasoning and personalized decision support.

🎯Benefits of technology

It improves the efficiency and accuracy of environmental health knowledge acquisition, supports personalized risk assessment and decision support, ensures the timeliness and consistency of knowledge, and enhances the scientific nature and reliability of environmental health management.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245834A_ABST
    Figure CN122245834A_ABST
Patent Text Reader

Abstract

This invention discloses an environmental health knowledge graph system and method based on a large language model, belonging to the fields of artificial intelligence and environmental health technology. In the knowledge graph construction and reasoning process, this invention introduces quality assessment results and evidence level information, mapping them as weight parameters in the graph structure to support risk assessment and decision analysis in the environmental health field. The system includes: a knowledge mining module, which automatically extracts high-quality knowledge triples from environmental health literature based on a domain-adaptive large language model; a knowledge graph construction module, which constructs a structured knowledge graph through ontology model design and multi-level entity linking strategies; a knowledge reasoning engine module, which uses a hybrid strategy combining rule-based, case-based, and statistical reasoning for intelligent reasoning; a knowledge application service module, which provides risk assessment, causal explanation, intelligent question answering, and decision support services; and a knowledge base dynamic update and maintenance module, which enables continuous knowledge updates and quality maintenance. This invention achieves fully automated processing from literature to knowledge to application, improving knowledge management and decision support capabilities in the environmental health field.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the interdisciplinary field of environmental health and artificial intelligence, specifically to an environmental health knowledge graph system and method based on a large language model. Background Technology

[0002] Environmental health, as an interdisciplinary field, studies the impacts and mechanisms of environmental factors on human health, involving multiple areas such as environmental science, toxicology, epidemiology, and public health. With the acceleration of industrialization and urbanization, environmental pollution has become increasingly prominent, and health risks caused by environmental factors have become a major public health concern worldwide. According to the World Health Organization (WHO, Preventing disease through healthy environments, 2016 report), an estimated 13 million people worldwide die annually from preventable environmental factors, accounting for 24% of all deaths globally. Therefore, accurately assessing and effectively managing environmental health risks is of great significance for protecting public health and promoting sustainable development.

[0003] Environmental health knowledge is highly specialized, complex, and dynamic, and its management faces numerous challenges: 1. Knowledge is fragmented and rapidly updated. Environmental health knowledge is scattered across a vast amount of scientific literature, research reports, monitoring data, and professional databases, primarily in unstructured text format. Statistics show that the number of environmental health papers published by Chinese scholars has surged from approximately 2,000 in 2010 to approximately 20,000 after 2020, increasing their global share from 10% to 30% (TongZhu et al, 2024). Traditional manual knowledge extraction methods struggle to cope with such a massive volume of knowledge updates, resulting in a significant lag in knowledge acquisition. In particular, cutting-edge knowledge such as emerging pollutants and newly discovered toxicological mechanisms is difficult to incorporate into the environmental health management system in a timely manner, affecting the scientific rigor of risk assessments and decision-making. 2. The knowledge relationships are complex and uncertain. Environmental health issues typically involve complex causal relationships involving multiple factors, pathways, and endpoints. For example, PM2.5 can affect multiple systems, including the cardiovascular, respiratory, and immune systems, through various mechanisms such as oxidative stress, inflammatory responses, and autonomic nervous system regulation. Traditional knowledge representation methods struggle to express such complex network relationships and cannot effectively handle the conditionality, timeliness, and uncertainty of knowledge. Furthermore, different studies may reach inconsistent or even contradictory conclusions regarding the association between the same environmental exposure and health effects, lacking effective mechanisms for handling knowledge conflicts. 3. Difficulty in integrating interdisciplinary knowledge. Environmental health research involves multiple disciplines such as environmental science, toxicology, epidemiology, and clinical medicine, each forming a relatively independent knowledge system. For example, research on the environmental behavior of pollutants and their health effects often belong to different disciplines, lacking organic integration; the exposure-effect relationship of the same pollutant may differ in different studies, making it difficult to form a unified knowledge framework. This disciplinary barrier leads to a fragmented state of environmental health knowledge, hindering the comprehensive application of knowledge.

[0004] Existing environmental health knowledge management technologies have the following limitations: 1. Shortcomings of traditional knowledge base technology. Existing environmental health knowledge bases are mostly in the form of documents, tables or simple databases, which have the following limitations: (1) Limited knowledge representation ability: It is difficult to express mechanistic knowledge such as "pollutants cause health effects through specific mechanisms" which includes intermediate links, and it is also unable to represent the conditionality, timeliness and uncertainty of knowledge. (2) Weak knowledge association ability: It lacks the ability to express complex relationships between knowledge and it is difficult to support the organic integration of interdisciplinary knowledge. (3) Difficulty in updating and maintaining: It mainly relies on manual updates, which is inefficient and difficult to cope with the rapid growth and dynamic changes of knowledge. (4) Insufficient reasoning ability: Most of them are static knowledge bases, lacking the ability to make complex reasoning based on knowledge, and it is difficult to answer questions that require cross-domain knowledge derivation; 2. Limitations of Large Language Models in Professional Knowledge Domains. Although Large Language Models (LLMs) have made breakthrough progress in the field of natural language processing, their application in professional fields such as environmental health still faces challenges: (1) Insufficient domain knowledge: General large language models lack understanding of environmental health terminology, conceptual relationships, and domain rules, making it difficult to accurately process professional literature. (2) Limited accuracy of knowledge extraction: When dealing with complex professional relationships (such as dose-response relationships and mechanism pathways), errors or incomplete extraction results are prone to occur. (3) Limited reasoning ability: There is a lack of understanding and application ability of domain logic rules, making it difficult to conduct complex reasoning that conforms to professional logic. (4) Timeliness issues: Training data has a time limit, making it impossible to obtain the latest research results, and there is a lack of effective knowledge update mechanisms. (5) Difficulty in ensuring knowledge quality: There is a lack of reliability assessment mechanisms for extracted knowledge, making it difficult to distinguish between high-quality and low-quality evidence, and there is also a lack of ability to detect and handle knowledge conflicts. 3. The application of knowledge graph technology in the field of environmental health is insufficient.

[0005] While knowledge graph technology provides an effective method for the structured representation of knowledge in complex domains, its application in the field of environmental health still has shortcomings: (1) Insufficient domain adaptability: General knowledge graph construction methods are difficult to handle the professional characteristics and complex relationships in the field of environmental health. (2) Difficulty in knowledge acquisition: The professionalism and dispersion of environmental health knowledge make knowledge acquisition costly and difficult to construct large-scale knowledge graphs. (3) Limited reasoning ability: Existing knowledge graph reasoning is mostly based on simple path search, which is difficult to handle the complex reasoning needs in the field of environmental health. (4) Difficulty in dynamic updating: There is a lack of effective knowledge updating mechanisms, making it difficult to maintain the timeliness of knowledge graphs. (5) Insufficient integration of standard knowledge: The authoritative knowledge in national environmental health standards has not been effectively integrated, resulting in limited authority and practicality of knowledge graphs.

[0006] Therefore, environmental health risk assessment and decision support have the following technical requirements: 1. The scientific requirements of risk assessment Environmental health risk assessment requires the integration of environmental monitoring data, health data, and professional knowledge, but existing methods have the following problems: (1) Insufficient knowledge guidance: Purely data-driven risk assessment may yield statistical correlations, but lacks mechanistic explanations; purely knowledge-driven reasoning may lack empirical support under specific environmental conditions. (2) Insufficient handling of uncertainty: It is difficult to scientifically represent and handle the uncertainty in risk assessment, affecting the reliability of decision-making. (3) Limited personalization capabilities: It is difficult to conduct personalized risk assessments based on the characteristics of different populations and environmental conditions.

[0007] 2. The need for intelligent decision support

[0008] Environmental health management decisions require comprehensive consideration of scientific, economic, and social factors. However, existing decision support systems have the following limitations: (1) Insufficient knowledge integration capability: It is difficult to integrate multi-source heterogeneous knowledge to provide comprehensive decision support. (2) Limited reasoning capability: It lacks intelligent reasoning ability for complex decision problems and is difficult to provide scientific decision suggestions. (3) Insufficient adaptability: It is difficult to provide personalized decision support according to different decision scenarios and user needs.

[0009] 3. Knowledge quality assurance needs. Environmental health knowledge comes from diverse sources and varies in quality, requiring a systematic quality assessment mechanism: (1) Evidence grading needs: The strength of evidence provided by studies with different research designs and methodological qualities varies significantly, requiring the establishment of a scientific evidence grading system. (2) Conflict resolution needs: Different studies may reach contradictory conclusions on the same environment-health relationship, requiring a systematic conflict detection and resolution mechanism. (3) Causality determination needs: Statistical association is not equivalent to causation, requiring a systematic assessment of causality based on professional standards (such as the Bradford Hill criteria).

[0010] With the rapid development of artificial intelligence, its application in the field of environmental health shows the following trends: (1) Developing from general models to domain-specific models: developing specialized AI models to meet the needs of specific domains; (2) From single-modal to multi-modal fusion: Integrating multiple information sources such as text, images, and data to provide more comprehensive analysis capabilities; (3) From static models to dynamic updates: Establish a continuous learning mechanism for models to keep pace with the development of the field; (4) Develop from black box model to interpretable model: improve the interpretability of the model and meet the credibility requirements of professional fields.

[0011] Based on the above analysis, the field of environmental health knowledge management urgently needs the following technological innovations: (1) Automated knowledge acquisition technology: It can automatically and accurately extract structured knowledge from massive professional documents and evaluate and screen the quality of the extracted knowledge in multiple dimensions.

[0012] (2) Specialized knowledge representation method: a knowledge representation framework that can express complex relationships in the field of environmental health and effectively integrate authoritative knowledge in national standards.

[0013] (3) Intelligent knowledge reasoning technology: capable of complex reasoning and problem solving that conforms to professional logic, including the discovery, verification and strength quantification of causal relationships, as well as the systematic processing of reasoning uncertainty.

[0014] (4) Personalized knowledge application technology: capable of providing customized knowledge services according to different user needs.

[0015] (5) Dynamic knowledge update mechanism: It can keep the knowledge base updated in sync with the development of the domain and ensure the logical consistency of knowledge during the update process.

[0016] In summary, existing technologies have significant shortcomings in environmental health knowledge management, risk assessment, and decision support, necessitating a comprehensive solution capable of automatically acquiring knowledge, intelligent reasoning and analysis, and dynamic updating and maintenance. This invention presents an innovative solution to address these technological challenges. Summary of the Invention

[0017] The purpose of this invention is to provide an environmental health knowledge graph system and method based on a large language model, in order to solve the problem that the existing technologies mentioned in the background art have obvious deficiencies in environmental health knowledge management, risk assessment and decision support, and lack a comprehensive solution that can automatically acquire knowledge, systematically evaluate knowledge quality, intelligently reason and analyze, and dynamically update and maintain it.

[0018] To achieve the above objectives, the present invention provides the following technical solution: An environmental health knowledge graph system and method based on a large language model includes: a knowledge mining module driven by a large language model, an environmental health knowledge graph construction module, a knowledge reasoning engine module, a knowledge application service module, and a knowledge base dynamic update and maintenance module. The knowledge mining module uses a large language model of environmental health that has been pre-trained for domain adaptability and fine-tuned for knowledge extraction instructions. It employs a two-stage extraction strategy that combines high-precision targeted extraction with broad-coverage general extraction to automatically extract structured knowledge triples in the field of environmental health from scientific literature. The module then filters these triples through a multi-dimensional quality assessment that includes five dimensions: extraction credibility, content consistency, source reliability, research quality, and evidence level. The knowledge graph construction module is based on a domain ontology model that includes multiple core entity categories such as environmental pollutants, environmental media, exposure pathways, biomarkers, health effects, mechanisms of action, and intervention measures, as well as multiple standard relation types. It adopts a four-level entity linking strategy of precise matching, fuzzy matching, semantic matching, and context matching to organize the extracted knowledge into a multi-level knowledge graph. The knowledge reasoning engine module integrates three strategies: rule-based reasoning, case-based reasoning, and statistical reasoning. Through the strategy fusion unit, it dynamically selects the combination of reasoning strategies and corresponding weights based on query characteristics, and realizes complex reasoning and question answering based on knowledge graph. The knowledge application service module is used to provide environmental health risk assessment, causal mechanism explanation, intelligent literature review and evidence synthesis, intelligent environmental health question answering, and multi-criteria decision support services based on the knowledge graph and reasoning results. The knowledge base dynamic update and maintenance module ensures the timeliness and consistency of knowledge through knowledge change monitoring, logical consistency and statistical consistency verification, knowledge conflict detection and resolution, incremental update and version control mechanisms.

[0019] Furthermore, the knowledge mining module driven by the large language model includes: a document intelligent acquisition and preprocessing submodule, a large language model submodule for the environmental health domain, a knowledge triple extraction submodule, and a knowledge quality assessment submodule; The environmental health domain large language model submodule is based on a Transformer architecture-based large language model, constructed through domain-adaptive pre-training and knowledge extraction instruction fine-tuning. The domain-adaptive pre-training uses a professional corpus containing literature from environmental science, toxicology, epidemiology, and public health disciplines to continuously pre-train the basic model. Pre-training tasks include masked language modeling, document-level prediction, and knowledge relationship prediction. The knowledge extraction instruction fine-tuning employs a combination of supervised fine-tuning and human feedback-based reinforcement learning. The supervised fine-tuning stage uses a prompt template designed with a thought chain to decompose the extraction steps, while the human feedback-based reinforcement learning stage uses a proximal policy optimization algorithm. Fine-tuning uses low-rank adaptation techniques to update only the query matrix and value matrix of the attention layer.

[0020] The knowledge triplet extraction submodule uses a two-stage strategy based on a large language model to extract five types of core knowledge: the first stage is high-precision targeted extraction, which uses professional prompt templates to target knowledge of pollutant characteristics, exposure-effect relationships, biomarker relationships, mechanism-pathway relationships, and intervention effect relationships; the second stage is broad-coverage general extraction, which uses open prompts to capture knowledge not covered in the first stage.

[0021] Furthermore, the knowledge quality assessment submodule includes a multi-dimensional quality assessment unit, an evidence level assessment unit, and a knowledge conflict detection and processing unit. The multi-dimensional quality assessment unit evaluates the quality of each knowledge triple from five dimensions: extraction credibility, content consistency, source reliability, research quality, and evidence level, and performs screening based on a comprehensive quality scoring formula.

[0022] in to The weight coefficients for each dimension and satisfying The evidence level assessment unit uses a modified GRADE method to classify evidence into six levels, from I to VI. The knowledge conflict detection and processing unit identifies four types of conflicts: direct contradictions, numerical inconsistencies, conditional conflicts, and temporal evolution conflicts.

[0023] Furthermore, the environmental health knowledge graph construction module includes: an ontology model design submodule, a graph construction and fusion submodule, a knowledge graph representation and storage submodule, and a knowledge graph visualization and exploration submodule; The ontology model design submodule adopts a combination of top-down and bottom-up approaches to construct the environmental health domain ontology, which includes the following core entity categories: environmental pollutants, environmental media, exposure pathways, biomarkers, health effects, mechanisms of action, and intervention measures. The ontology model also defines a variety of standard relation types, which are organized in a three-layer structure. The top layer is divided into four categories: causal relations, association relations, compositional relations, and functional relations.

[0024] The graph construction and fusion submodule employs a four-level entity linking strategy—precise matching, fuzzy matching, semantic matching, and contextual matching—to link entities in knowledge triples to standard entities in the ontology model. It also includes a standard knowledge integration unit, which integrates national environmental health standard knowledge into the knowledge graph using an authority-first principle.

[0025] Furthermore, the knowledge reasoning engine module includes: a multimodal knowledge representation submodule, a hybrid reasoning strategy submodule, a causal chain discovery and verification submodule, and an uncertainty reasoning and representation submodule; The multimodal knowledge representation submodule integrates three knowledge graph embedding algorithms—TransE, RotatE, and ComplEx—to map entities and relations to a low-dimensional vector space. The hybrid reasoning strategy submodule integrates rule-based reasoning, case-based reasoning, and statistical reasoning methods. The strategy fusion unit dynamically selects the combination of reasoning strategies and corresponding weights based on query characteristics. The causal chain discovery and verification submodule uses an improved bidirectional breadth-first search algorithm to search for causal paths and performs quantitative evaluation of causal relationships based on the Bradford Hill criterion. The uncertainty reasoning and representation submodule integrates multi-source evidence based on Dempster-Shafer theory and propagates uncertainty in the reasoning process through Monte Carlo simulation and interval analysis.

[0026] Furthermore, the knowledge application service module includes: a knowledge-assisted risk assessment submodule, a knowledge-driven causal explanation submodule, an intelligent literature review and evidence synthesis submodule, an environmental health intelligent question-and-answer submodule, and a knowledge-driven decision support submodule.

[0027] The knowledge-assisted risk assessment submodule extracts exposure parameters and dose-response relationships from the knowledge graph to achieve comprehensive risk assessment of the three major exposure pathways: respiratory exposure, ingestion exposure, and skin contact. The knowledge-driven causal explanation submodule retrieves causal chains from the knowledge graph to generate multi-level mechanism explanations. The knowledge-driven decision support submodule uses a multi-criteria decision analysis framework to comprehensively evaluate and rank candidate intervention programs.

[0028] Furthermore, the knowledge base dynamic update and maintenance module includes: a knowledge change monitoring and acquisition submodule, a knowledge consistency verification submodule, an expert collaborative knowledge verification submodule, and a knowledge graph incremental update submodule.

[0029] The knowledge consistency verification submodule includes a logical consistency check unit and a knowledge conflict detection unit, which identify four types of conflicts: direct contradictions, numerical inconsistencies, conditional conflicts, and temporal evolution conflicts. The knowledge graph incremental update submodule defines six basic change operations, uses a transaction mechanism to ensure atomicity and consistency, and uses semantic version numbers to manage version history.

[0030] An environmental health knowledge graph method based on a large language model includes the following steps: Step S1: Using a domain-adaptive pre-trained and knowledge extraction instruction-fine-tuned large language model for the environmental health domain, a two-stage strategy combining high-precision targeted extraction and broad-coverage general extraction is adopted to automatically extract structured knowledge triples from environmental health literature, and perform multi-dimensional quality assessment and screening. Step S2: Construct a domain ontology model containing multiple core entities and various standard relationship types, and use a four-level entity linking strategy to construct the extracted knowledge into a multi-level knowledge graph for the environmental health domain; Step S3: Map entities and relationships in the knowledge graph to a low-dimensional vector space, dynamically select a combination of reasoning strategies based on query features, and realize hybrid reasoning and causal discovery based on the knowledge graph; Step S4: Execute knowledge application services based on knowledge graphs and inference results, specifically including: conducting environmental health risk assessment based on multi-pathway exposure models and dose-response relationships; retrieving and constructing causal evidence chains to generate multi-level causal mechanism explanations; conducting intelligent literature review and evidence synthesis according to the PECO framework; realizing intelligent environmental health question answering through intent recognition, knowledge retrieval, and inference loops; and generating decision support schemes using the analytic hierarchy process and multi-criteria decision analysis methods. Step S5: New knowledge is continuously acquired through automatic document scanning and knowledge change detection. After consistency verification and conflict detection, it is integrated into the knowledge graph through incremental updates to achieve continuous updating and consistency maintenance of the knowledge base.

[0031] Furthermore, step S1 includes: intelligent document acquisition and preprocessing, construction of a large language model in the field of environmental health, extraction of knowledge triples, and knowledge quality assessment; The steps for constructing the large language model in the environmental health domain include: using a professional corpus in the environmental health domain to perform domain-adaptive continuous pre-training on the basic model, and then fine-tuning it with knowledge extraction instructions. The fine-tuning adopts a combination of supervised fine-tuning and reinforcement learning based on human feedback, and uses low-rank adaptation techniques to update only the query matrix and value matrix of the attention layer. The knowledge triplet extraction step includes using a two-stage strategy to extract five types of core knowledge: pollutant characteristic knowledge, exposure-effect relationships, biomarker relationships, mechanism-pathway relationships, and intervention effect relationships; the knowledge quality assessment step calculates a comprehensive quality score for each knowledge triplet and performs evidence grading based on a modified GRADE method.

[0032] Furthermore, step S2 includes: ontology model design, graph construction and fusion, knowledge graph representation and storage, and knowledge graph visualization and exploration; The ontology model design steps construct an environmental health domain ontology that includes the following core entity categories: environmental pollutants, environmental media, exposure pathways, biomarkers, health effects, mechanisms of action, and intervention measures. The graph construction and fusion steps employ a four-level entity linking strategy: precise matching, fuzzy matching, semantic matching, and contextual matching. It also includes standard knowledge integration, using an authority-first principle to integrate national environmental health standard knowledge into the knowledge graph.

[0033] Furthermore, step S3 includes: multimodal knowledge representation, hybrid reasoning strategies, causal chain discovery and verification, and uncertainty reasoning and representation; The hybrid reasoning strategy steps include: dynamically selecting a combination of reasoning strategies based on query characteristics, executing rule-based reasoning, case-based reasoning, and statistical reasoning in parallel, and then integrating the results through weighted fusion; the causal chain discovery and verification step uses an improved bidirectional breadth-first search algorithm to search for causal paths and performs causal strength quantification assessment based on the Bradford Hill criterion; the uncertainty reasoning step integrates multi-source evidence based on Dempster-Shafer theory and propagates uncertainty through Monte Carlo simulation and interval analysis.

[0034] Compared with the prior art, the beneficial effects of the present invention are: (1) Efficient and automated knowledge acquisition: Using a large language model of environmental health domain that has been pre-trained with domain adaptability and fine-tuned with knowledge extraction instructions, a two-stage extraction strategy is adopted to automatically extract environmental health knowledge from the literature, which effectively improves the efficiency of knowledge acquisition and solves the problem of time-consuming and labor-intensive traditional manual sorting.

[0035] (2) Systematic knowledge quality assurance: Through a multi-dimensional quality assessment system that includes five dimensions: extraction credibility, content consistency, source reliability, research quality and evidence level, a modified GRADE method is used to classify evidence, and contradictory knowledge is identified and resolved through a knowledge conflict detection and processing mechanism to ensure that the knowledge included in the knowledge graph is accurate and reliable.

[0036] (3) Construct a systematic knowledge system: Construct a knowledge graph that integrates multi-dimensional information such as pollutants, exposure, health effects, and mechanisms of action. Through a four-level entity linking strategy and standard knowledge integration, the fragmented knowledge is structured and integrated to support comprehensive knowledge association.

[0037] (4) Supports intelligent reasoning and causal explanation: Based on knowledge graphs, it performs multiple types of reasoning, including rule-based reasoning, case-based reasoning, and statistical reasoning. Through causal chain discovery based on the Bradford Hill standard and uncertainty reasoning based on Dempster-Shafer theory, it can answer complex professional questions and provide mechanism explanations, thereby enhancing the system's ability to understand and explain the relationship between environmental health and other factors.

[0038] (5) Improve the scientific nature and quality of decision-making: Provide knowledge support for risk assessment and intervention measures, conduct comprehensive evaluation and robustness analysis of candidate solutions through a multi-criteria decision analysis framework, improve the scientific nature of assessment and the quality of decision-making, and help to achieve precise environmental health management.

[0039] (6) Ensure the timeliness and authority of knowledge: Establish a dynamic update mechanism based on incremental updates and transaction control, integrate the latest scientific research results and national standards in a timely manner, and ensure the continuous reliability of the knowledge base through logical consistency verification and knowledge conflict detection.

[0040] (7) In the process of knowledge graph construction and reasoning, quality assessment results and evidence level information are introduced and mapped to weight parameters in the graph structure to support risk assessment and decision analysis in the field of environmental health. Through the synergistic effect of the above technical features, a high-quality, interpretable, and dynamically updatable intelligent knowledge graph system for the field of environmental health is realized. Attached Figure Description

[0041] Figure 1 This is an overall architecture diagram of the environmental health knowledge graph construction and intelligent reasoning system of the present invention; Figure 2 Flowchart of the knowledge mining module driven by a large language model; Figure 3 Constructing a module structure diagram for an environmental health knowledge graph; Figure 4 This is the algorithm flowchart for the knowledge reasoning engine module; Figure 5 Functional structure diagram of the knowledge application service module; Figure 6 Structure diagram of the knowledge base dynamic update and maintenance module; Figure 7 This is a schematic diagram of the coupling mechanism of knowledge quality, evidence level, and reasoning weight. Detailed Implementation

[0042] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0043] Please see Figure 1 This invention provides a technical solution: an environmental health knowledge graph system and method based on a large language model. The system includes five core modules, each of which can operate independently and also collaborate to form a complete knowledge processing flow. For example... Figure 1 As shown, the overall workflow of the system is as follows: First, the knowledge mining module driven by the large language model automatically extracts structured knowledge in the field of environmental health from scientific literature; then, the environmental health knowledge graph construction module organizes the extracted knowledge into a multi-level knowledge graph; next, the knowledge reasoning engine module realizes complex reasoning based on the knowledge graph; then, the knowledge application service module applies the knowledge and reasoning ability to environmental health management practices; finally, the knowledge base dynamic update and maintenance module ensures the continuous updating and consistency of the knowledge base.

[0044] Figures 2 to 6 The internal structure and workflow of each module are shown. The system setup process is further explained below: 1. System Overall Architecture The system architecture adopts a layered design, comprising four layers: data layer, model layer, functional layer, and application layer. These layers interact through standardized interfaces, supporting flexible component replacement and horizontal system scaling. The system employs a microservice architecture, with each functional module decomposed into independent microservices that communicate via RESTful APIs and message queues. System deployment supports containerization and orchestration management, allowing for dynamic scaling up and down based on business load.

[0045] This system features a specially designed environmental health standards knowledge base as an authoritative component of the knowledge system. The knowledge base covers environmental health-related standards issued by the Ministry of Ecology and Environment, the National Health Commission, and other departments, including core standards such as the "General Guidelines for Ecological and Environmental Health Risk Assessment" (HJ 1111-2020), the "Basic Dataset for Exposure Parameter Survey" (HJ968-2019), the "Technical Specification for Survey of Particulate Matter (PM2.5) Infiltration Coefficient in Civil Building Ambient Air" (HJ949-2018), the "Technical Specification for Exposure Parameter Survey" (HJ 877-2017), and the "Environmental and Health Data Dictionary." Through automated parsing technology, the system transforms the standard content into structured knowledge, complementing scientific literature knowledge to construct a complete environmental health knowledge system.

[0046] 1.1 Data Layer Implementation

[0047] The data layer provides data storage and management services for the system, and includes the following components: 1.1.1 Document Repository: Implemented using a distributed document storage system, configured as a 3-node replica set to provide data redundancy and high availability. Storage employs a sharding strategy, partitioning by publication time and field, with each shard limited to under 50GB.

[0048] 1.1.2 Standards Library: A dedicated standards document management system is used to store and manage national standards, industry standards, and group standards in the field of environmental health.

[0049] 1.1.3 Knowledge Graph Database: Implemented using a graph database, deployed as a 3-node cluster configuration, supporting causal clustering and read replicas.

[0050] 1.1.4 Auxiliary databases: including relational databases and time-series databases.

[0051] 1.2 Model Layer Implementation

[0052] The model layer provides core algorithm support for the system and includes the following components: 1.2.1 Large Language Model: The system adopts a large language model based on the Transformer architecture, specifically implemented as a customized model with a hybrid expert structure. Model specifications include: billions of parameters, 32 Transformer layers, 128 attention heads, an embedding dimension of 8,192, and a feedforward network dimension of 32,768. The model service architecture employs a distributed inference framework, including the following components: a model server (multi-GPU configuration, large-capacity system memory, high-speed interconnect), load balancing (implementing request distribution, configuring a weighted round-robin strategy), model optimization (using mixed-precision computation and inference acceleration), and a batch processing mechanism (dynamic batch processing, maximum batch size of 32, timeout set to 100ms). Model service performance metrics: average inference latency <500ms (single sample), throughput >500 tokens / s, 99% tail latency <1.5s.

[0053] The system also includes an environmental health standards text parsing submodule, specifically designed to process environmental health standards documents and extract structured standard knowledge. This submodule works in conjunction with other submodules to integrate standard knowledge with scientific literature knowledge, forming a complete environmental health knowledge system.

[0054] 1.2.2 Knowledge Graph Embedding Model: The system integrates three algorithms—TransE, RotatE, and ComplEx—for vector representation of entities and relations. The model configuration is as follows: embedding dimension 256 (TransE and RotatE), 128×2 complex dimension (ComplEx); training parameters: Adam optimizer, learning rate 1e-4, batch size 1024, negative sample ratio 5:1; regularization: L2 regularization coefficient 1e-5, embedding normalization constraint; loss function: adaptive marginal ranking loss, initial marginal γ=24, soft marginal parameter α=0.5. Model training is performed on a multi-GPU cluster, with a training time of approximately 10-15 hours. The final Hits@10 metric on the link prediction task reaches over 0.85, and the MRR reaches over 0.4. The embedding update strategy combines full retraining (once a month) and incremental updates (once a week).

[0055] 1.2.3 Hybrid Inference Engine: A hybrid engine integrating symbolic and neural inference, comprising the following components: a symbolic inference engine (supporting forward and backward inference, using SWRL format for rule files, and containing approximately 300 rules in the environmental health domain), neural inference components (including graph neural network models and attention models), a Bayesian inference module (supporting variable elimination and MCMC sampling algorithms, with a maximum of 100 variables processed), and an ensemble strategy (using a weighted voting mechanism, with dynamically adjusted weights based on historical performance feedback). The inference engine is deployed on a dedicated server with ample memory and multi-core CPUs, achieving an average latency of <1s for complex inference and <100ms for simple inference.

[0056] 1.3 Functional Layer Implementation

[0057] The functional layer implements the core functions of the system and includes the following components: 1.3.1 Knowledge Extraction Function: Enables a complete workflow for extracting structured knowledge from environmental health literature. This module includes: a document preprocessing unit (supporting multi-format document conversion, chapter recognition, and table extraction), an entity recognition unit (based on a BiLSTM-CRF model and a pre-trained language model, achieving an F1 score of over 0.9), a relation extraction unit (combining remote supervision and large language model hints, achieving an F1 score of over 0.8), an attribute extraction unit (a dedicated extractor for numerical parameters and conditional attributes, achieving an accuracy of over 90%), and a quality assessment unit (a multi-dimensional quality scoring system supporting evidence level assessment). This function is provided via a RESTful API, supporting both batch and incremental processing.

[0058] 1.3.2 Knowledge Graph Construction Function: This function implements the construction process from knowledge triples to a knowledge graph. This module includes: an ontology management unit (maintaining the ontology model for the environmental health domain, supporting ontology editing and version control), an entity linking unit (multi-strategy entity linking algorithms with an accuracy rate exceeding 90%), a relation mapping unit (relation standardization and attribute expansion with an accuracy rate exceeding 90%), a graph construction unit (incremental graph construction algorithms supporting efficient update and rollback operations), and a consistency checking unit (rule-based consistency verifier covering hundreds of consistency rules). This function is provided through a Java API and a Python client library, supporting transaction operations and batch imports.

[0059] 1.3.3 Knowledge Reasoning Functionality: This function enables complex reasoning based on an environmental health knowledge graph. This module includes: a path reasoning unit (a graph traversal-based pathfinding algorithm supporting multiple path scoring strategies), a rule reasoning unit (a description logic-based reasoning engine supporting OWL2 reasoning rules), a statistical reasoning unit (a probabilistic graphical model reasoning framework supporting Bayesian networks and Markov random fields), and a mechanism reasoning unit (a causal path discovery and verification engine supporting Bradford Hill standard evaluation). This function provides query services through the GraphQL API, supporting complex query construction and result filtering.

[0060] 1.3.4 Knowledge Application Function: Applying environmental health knowledge to real-world scenarios. This module includes: a risk assessment unit (a knowledge-based environmental health risk assessment engine), a causal explanation unit (a generator for explaining the mechanisms of environment-health associations), a literature review unit (an automatic literature review and evidence synthesis system), and a decision support unit (a system for generating and evaluating environmental health decision-making solutions). This function provides services through web applications and APIs, supporting customized configurations for various application scenarios.

[0061] 1.3.5 Knowledge Update Function: Enables continuous updating and maintenance of environmental health knowledge. This module includes: a literature monitoring unit (automatic literature scanning and change detection system), a knowledge extraction unit (incremental knowledge extraction and quality control process), a conflict resolution unit (rule-based and evidence-based conflict resolution strategies), and a version control unit (knowledge version management and history tracking system). This function executes automatically according to a schedule (weekly updates) and supports manual triggering and expert review processes.

[0062] 1.4 Application Layer Implementation

[0063] The application layer provides the user interface and services to end users, and includes the following components: 1.4.1 Web Application Interface: A front-end application developed based on mainstream front-end frameworks (such as React) and using a component-based design.

[0064] 1.4.2 API Services: The system provides a unified API gateway that supports multiple service access methods: RESTful API (based on the OpenAPI 3.0 specification, supporting JSON and XML response formats), GraphQL API (supporting flexible query construction and reducing data transmission redundancy), and WebSocket interface (supporting real-time notifications and long-connection scenarios).

[0065] 1.4.3 Mobile Applications: Provide native applications for iOS and Android platforms, developed based on cross-platform mobile frameworks (such as Flutter).

[0066] The system deployment architecture adopts a hybrid cloud solution, with core computing resources deployed in a private cloud and data storage using a hybrid strategy (core data is stored in a private cloud, while non-sensitive data can be stored in the cloud).

[0067] 2. Knowledge mining module driven by large language models

[0068] like Figure 2 As shown, the knowledge mining module driven by the large language model is the core component of the system, responsible for automatically extracting structured knowledge from massive amounts of environmental health literature. This module includes four key sub-modules: intelligent literature acquisition and preprocessing sub-module, large language model sub-module for the environmental health domain, knowledge triple extraction sub-module, and knowledge quality assessment sub-module. These sub-modules work collaboratively to form a complete processing chain from raw literature to high-quality structured knowledge.

[0069] 2.1 Intelligent Document Acquisition and Preprocessing Submodule

[0070] The intelligent document acquisition and preprocessing submodule aims to address the issues of automatic acquisition, format standardization, and content preprocessing of environmental health documents. This submodule includes a multi-source document automatic acquisition unit, a document structure parsing unit, a technical terminology recognition unit, and a text standardization processing unit.

[0071] 2.1.1 Multi-Source Document Automatic Acquisition Unit: This unit adopts a distributed crawler framework and API integration solution, supporting automatic retrieval and acquisition of documents from 25 academic databases, including PubMed, Web of Science, ScienceDirect, and CNKI. The system configuration is as follows: The retrieval strategy optimization engine uses a Bayesian optimization algorithm, with keyword weights adaptively adjusted weekly. The initial keyword library contains thousands of environmental pollutant names and thousands of health effect terms. The complexity threshold for the generated retrieval expression is set to 30 (the number of Boolean operators), ensuring retrieval accuracy of no less than 80% and recall of no less than 75%. The API access configuration uses a token bucket algorithm for rate limiting. The maximum request frequency for each data source is set to 100-300 times / hour (adjusted according to platform policies), the request timeout is set to 30 seconds, the number of retries for failure is 3, and the retry interval increases exponentially (initially 5 seconds). Distributed crawlers use distributed crawling frameworks (such as Scrapy), are configured with multiple worker nodes, each node has a maximum concurrency of 20, use random user agents and IP proxy pools (containing dozens of proxy addresses), follow robots.txt rules, and have a request interval of random 30-60 seconds.

[0072] 2.1.2 Document Structure Parsing Unit: This unit employs a hybrid document understanding engine to process multi-format document content. PDF parsing combines PDF.js with a self-developed layout analysis algorithm, achieving page segmentation accuracy of over 95%. Table extraction utilizes the deep learning model TableNet (ResNet-50 backbone network, U-Net decoder structure), achieving an F1 score of over 0.9 for both table and cell recognition. Image processing employs the YOLOv5 model for chart detection and a ResNet-34 and GRU combination architecture for chart understanding, supporting data extraction for seven common chart types, including bar charts, line charts, and scatter plots, with a numerical accuracy of over 85%. Document structure recognition uses a BERT-based sequence labeling model (fine-tuned parameters: learning rate 2e-5, batch size 16, training epochs 5) to label structural elements such as chapters, paragraphs, and titles, achieving an F1 score of over 0.9.

[0073] 2.1.3 Terminology Recognition Unit: This unit enables the automatic recognition and standardization of professional terms in the environmental health field. Terminology recognition employs a hybrid approach combining BiLSTM-CRF and a domain dictionary. The model configuration is as follows: BiLSTM hidden layer dimension is 512, CRF transition matrix regularization coefficient is 0.01, and word embedding uses a domain-adaptive Word2Vec model (window size 5, negative sampling number 10, vector dimension 300). The domain dictionary contains over 100,000 terms, categorized into pollutant dictionary (several thousand items), health effect dictionary (several thousand items), biomarker dictionary (approximately two thousand items), mechanism of action dictionary (nearly two thousand items), and intervention measure dictionary (over one thousand items). Terminology recognition accuracy reaches over 95% (exact match) and over 90% (partial match). Terminology standardization employs a multi-level mapping strategy, mapping variant forms (such as abbreviations, synonyms, and colloquialisms) to standard names, achieving a standardization accuracy of over 95%.

[0074] 2.1.4 Text Standardization Processing Unit: This unit cleans, standardizes, and enhances text content. Text cleaning combines a rule engine with statistical methods, achieving an accuracy of over 98% in handling noise (such as headers, footers, and special characters). Sentence segmentation employs a hybrid approach based on rules and machine learning, supporting sentence boundary detection for 10 languages ​​with an accuracy of over 99%. Text enhancement utilizes context-aware methods, including abbreviation expansion (accuracy over 90%), entity linking (F1 score over 0.9), and key attribute annotation (F1 score over 0.85). Numerical standardization supports multiple unit conversions and unifications, covering over a hundred common environmental monitoring and health indicator units.

[0075] 2.2 Large Language Model Submodule in the Environmental Health Domain

[0076] The Environmental Health Domain Large Language Model submodule is the core algorithm engine of the system, responsible for understanding and analyzing the content of environmental health literature. Based on a general large language model, this submodule achieves deep understanding of professional environmental health texts through domain-adaptive pre-training and task-specific fine-tuning.

[0077] 2.2.1 Basic Model Selection and Configuration: This system uses a large language model with a Transformer architecture as the basic model, employing a Hybrid Expert (MoE) structure to improve efficiency. The model has hundreds of billions of parameters, including 32 Transformer layers, 128 attention heads, an embedding dimension of 8,192, a feedforward network dimension of 32,768, and uses SwiGLU as the activation function. The model supports a context window of 8,192 tokens, sufficient to process a single complete paper. For quantization, 8-bit quantization technology is used during the inference stage to convert FP16 weights to INT8 representation, compressing the model size to about one-quarter of the original while keeping performance loss within 1%. Distributed deployment uses the DeepSpeed ​​ZeRO-3 strategy. With a multi-GPU configuration, the average inference speed reaches several hundred tokens per second, and the average processing time for a single paper (approximately 5,000 tokens) is controlled within 30 seconds.

[0078] 2.2.2 Domain-Adaptive Pre-training: The system uses a professional corpus in the environmental health domain to further pre-train the basic model. The pre-training corpus contains hundreds of thousands of environmental health documents, totaling approximately 200GB of text data, covering related disciplines such as environmental science, toxicology, epidemiology, and public health. Pre-training employs a continuous pre-training method, preserving the general knowledge of the basic model while enhancing the understanding of environmental health expertise. The pre-training task configuration is as follows: in the Masked Language Modeling (MLM) task, the masking rate is 15%; in the document-level prediction task, the negative sample ratio is 3:1, achieving an accuracy of over 95% in adjacent document recognition; and in the knowledge relationship prediction task, the relationship types cover various common environmental health relationships. Training uses mixed precision (bfloat16), a global batch size of 1,024, a learning rate of 1e-5, linear warm-up (first 1% of steps), and cosine learning rate decay, training for 4 epochs. After training, the model improved its accuracy by nearly 20 percentage points in the environmental health text understanding benchmark and its F1 score by about 20 percentage points in the technical terminology recognition task.

[0079] 2.2.3 Knowledge Extraction Instruction Fine-tuning: The system fine-tunes the instructions for knowledge extraction tasks based on manually annotated environmental health literature. The fine-tuning dataset contains thousands of annotated documents and approximately 100,000 knowledge triple annotations, covering five core knowledge categories: pollutant characteristic knowledge, exposure-effect relationships, biomarker relationships, mechanism-pathway relationships, and intervention effect relationships. Fine-tuning employs a combination of supervised fine-tuning and reinforcement learning with human feedback (RLHF). Supervised fine-tuning uses specific prompt templates and task formats, with a learning rate of 5e-6, a batch size of 64, and 3 training epochs. The RLHF stage uses the Proximal Policy Optimization (PPO) algorithm, with the reward model trained using human-annotated preference pairs. The KL penalty coefficient is set to 0.05, and the initial reward model accuracy reaches over 85%. Fine-tuning uses Low-Rank Adaptation (LoRA) technology, updating only some key parameters. The LoRA rank is 16, the α value is 32, and the target module includes the Q and V matrices of the attention layer. The fine-tuned model achieved an F1 score of over 0.85 on the environmental health knowledge extraction test set, which is about 20 percentage points higher than the general model.

[0080] 2.2.4 Hint Engineering and Few-Shot Learning: The system features a professional hint template library to guide the model in extracting specific types of environmental health knowledge from different types of text. The hint templates employ a "chain-of-thought" design, breaking down the extraction steps to improve the accuracy of extracting complex knowledge. The system has built a library containing hundreds of validated hint templates, covering five major categories of core knowledge and various sub-categories. Regarding few-shot learning configuration, each hint template contains 3-5 high-quality examples. The example selection algorithm uses k-means clustering (k=100) to select the most representative samples from the labeled data, with a sample diversity score of no less than 0.7 (based on cosine similarity of vector representations). Hint templates are updated monthly, adjusted based on extraction performance feedback and the need for new knowledge types. For specific domains such as emerging pollutants, the system employs the MAML (Model-Independent Meta-Learning) method, using 10-shot adaptation, a learning rate of 1e-5, and 5 inner loop steps to achieve rapid adaptation to new knowledge patterns.

[0081] 2.3 Knowledge Triple Extraction Submodule

[0082] The knowledge triple extraction submodule is responsible for extracting structured knowledge representations from text, namely knowledge triples in the form of (entity 1, relation, entity 2). This submodule integrates a large language model and specialized rules, and adopts a two-stage strategy that combines high-precision targeted extraction with broad-coverage general extraction to achieve high-quality environmental health knowledge extraction.

[0083] 2.3.1 Text Segmentation and Processing Strategy: The system employs an intelligent segmentation strategy to process long documents, ensuring comprehensive extraction and contextual understanding. Segmentation uses a sliding window method with a window size of 1,024 tokens and an overlap of 200 tokens (approximately 20%) between adjacent windows, ensuring the coherence of knowledge across paragraphs. The paragraph importance scoring algorithm combines TF-IDF with keyword density, prioritizing high-information-density paragraphs such as abstracts, results, and discussions, with three priority levels. For tabular data, the system uses a dedicated tabular knowledge extraction model, supporting row-column mapping and relation parsing, achieving an accuracy rate of over 85%. The system generates processing metadata for each document block, including block ID, document location, importance score, and contextual references, for subsequent knowledge integration.

[0084] 2.3.2 Two-Stage Triple Extraction Strategy: The system employs a large language model-driven knowledge extraction method to extract structured knowledge triples from text blocks. The extraction process uses a two-stage strategy: the first stage employs high-precision targeted extraction, using professional prompt templates to target specific knowledge types, achieving an accuracy of over 90%; the second stage employs broad-coverage general extraction, using open prompts to capture potential uncovered knowledge, improving recall to over 85%. Prompt engineering configurations include: prompt length controlled at 200-300 tokens, instruction clarity score no less than 4 points (out of 5), example diversity covering major knowledge variants, and output format using a structured JSON template. Large language model inference parameters are set as follows: temperature 0.3 (prioritizing accuracy), top_p 0.92, maximum output token count 1,024, duplication penalty 1.2, and batch size 16. The system performs post-processing on the model output, including format validation, structure repair, and consistency checks, achieving a format compliance rate of over 99%.

[0085] The first phase of high-precision targeted extraction uses prompt templates designed for five core knowledge categories: Pollutant characteristic knowledge extraction templates focus on chemical names, CAS numbers, physicochemical properties, and environmental behavior parameters; exposure-effect relationship extraction templates focus on exposure pathways, exposure doses, effect endpoints, and dose-response coefficients; biomarker relationship extraction templates focus on biomarker names, associated pollutants, associated health effects, and sensitivity / specificity parameters; mechanism-pathway relationship extraction templates focus on molecular targets, signaling pathways, intermediate steps, and temporal relationships; and intervention effect relationship extraction templates focus on intervention type, target pollutant or effect, effect quantity, and applicable conditions. Each template includes 3-5 high-quality examples, using a thought-chain design to guide the model step-by-step analysis and extraction.

[0086] 2.3.3 Entity Recognition and Normalization: The system realizes the recognition and normalization of professional entities in the environmental health domain. Entity recognition adopts a hybrid approach combining Named Entity Recognition (NER) and a large language model. The NER model is based on a BiLSTM-CRF architecture, using domain-specific training data (containing hundreds of thousands of annotated text sentences). The model configuration is as follows: character embedding dimension of 30, word embedding dimension of 300, BiLSTM hidden layer dimension of 512, and CRF transition matrix regularization coefficient of 0.01. The model achieves an F1 score of over 0.9 on the environmental health entity recognition task. The large language model is used to recognize complex, long-tailed, and newly emerging entity representations, complementing the NER model. Entity normalization employs a three-level mapping strategy: exact matching (using a terminology dictionary, coverage approximately 70%), fuzzy matching (using edit distance and character n-grams, threshold 0.85, coverage approximately 20%), and semantic matching (using entity embedding cosine similarity, threshold 0.92, coverage approximately 10%). The standardized entity is linked to a unique identifier, and the standardization accuracy can reach over 90%.

[0087] 2.3.4 Relation Extraction and Representation: The system extracts and represents professional relations in the environmental health field. It defines several core relation types, covering physicochemical relations (e.g., "dissolve," "adsorption"), biological relations (e.g., "inhibition," "activation"), and epidemiological relations (e.g., "increase risk," "related to"). Relation extraction employs a combination of remote supervision and a large language model; the training data contains approximately 100,000 labeled relation instances. Relation mapping uses a two-stage strategy: rule-based mapping (based on keyword and pattern matching, with a coverage of approximately 60%-70%) and semantic mapping (using the BERT relation classifier, fine-tuned with a pre-trained BERT-base model, a learning rate of 2e-5, a batch size of 32, 10 training epochs, and a coverage of approximately 30%). For relations that cannot be mapped (approximately 5%), the system retains the original expression and marks them as "non-standard relations," which are then manually reviewed by experts. The overall F1 score for relation extraction reaches over 0.8, significantly higher than traditional methods.

[0088] 2.3.5 Attribute and Quantitative Parameter Extraction: The system can extract attributes and quantitative parameters related to knowledge triples, enriching knowledge representation. Attribute extraction uses an attention-based parameter extraction model, focusing on attribute expressions in syntactic dependency relations. The model configuration is: 4 layers in the Transformer encoder, 8 attention heads, 512 hidden layer dimensions, and GELU activation function. Quantitative parameter extraction focuses on five key indicators: exposure-response coefficient, relative risk ratio (RR), hazard ratio (HR), correlation coefficient (r), p-value, and confidence interval. Numerical recognition uses a rule-based pattern combined with a statistical model, supporting multiple expression formats (such as scientific notation, percentages, ratios) and unit recognition (supporting over a hundred common units and their variants), achieving a numerical extraction accuracy of over 90%. Attribute classification includes time attributes (such as research period, year of discovery), spatial attributes (such as research area, scope of application), and conditional attributes (such as applicable population, environmental conditions), achieving an attribute classification accuracy of over 90%.

[0089] 2.4 Knowledge Quality Assessment Submodule

[0090] The knowledge quality assessment submodule performs a comprehensive quality assessment on the extracted knowledge triples to ensure the accuracy and reliability of the information included in the knowledge base. This submodule implements multi-dimensional quality assessment and filtering functions, providing high-quality knowledge input for subsequent knowledge graph construction.

[0091] 2.4.1 Multi-dimensional Quality Assessment: The system assesses the quality of each piece of knowledge from multiple dimensions: Reliability Assessment: This assesses the certainty of knowledge extraction using a large language model. The system employs a combined algorithm based on attention distribution and token probability to calculate the confidence score for knowledge extraction. The algorithm configuration is as follows: attention weight α = 0.6, token probability weight β = 0.4, low confidence threshold set to 0.65, and high confidence threshold set to 0.85. For cross-validation, the system repeatedly extracts knowledge using different cue variations and temperature settings. The consistency score is calculated based on Jaccard similarity, with a threshold set to 0.8. The reliability assessment accuracy (consistency with human assessment) reaches over 85%.

[0092] Content consistency assessment: This function detects the consistency of knowledge within the same document. The system uses a graph-based consistency check algorithm to construct an internal knowledge graph for the document, detecting logical contradictions (such as A causing B and A not causing B simultaneously) and numerical conflicts (such as parameter value differences exceeding 50%). Conflict detection is based on path consistency rules and relation opposition rules, covering consistency constraints for various relation types. The accuracy of internal consistency assessment can reach over 90%.

[0093] Source reliability assessment: This evaluates the authority of the literature source. Assessment indicators include journal impact factor (divided into 5 levels: >10, 5-10, 3-5, 1-3, <1), publisher reputation (divided into 4 levels: top, high, medium, general), author h-index (divided into 4 levels: >50, 30-50, 10-30, <10), and institution type (divided into 5 categories: top research institutions, general research institutions, government agencies, industry institutions, others).

[0094] Research quality assessment is based on indicators such as research design, sample size, and methodological rigor. The assessment criteria include research type (divided into 7 categories: meta-analysis, randomized controlled trial, cohort study, case-control study, cross-sectional study, laboratory study, and expert opinion), sample size (divided into 4 levels: >1000, 100-1000, 10-100, <10), statistical methods (divided into 3 levels: advanced, standard, and basic), and research completeness (divided into 3 levels: complete, partially complete, and incomplete).

[0095] The system calculates a comprehensive quality score based on the above five dimensions. The formula for the comprehensive quality score is as follows:

[0096] in to These are the weight coefficients for each dimension, set to the default value. =0.3、 =0.2、 =0.15、 =0.15、 =0.2, and satisfies The threshold for high-quality knowledge is set at 0.8, the threshold for medium-quality knowledge is set at 0.6, and knowledge below 0.5 is filtered out. The screening accuracy (the proportion of useful knowledge retained) can reach over 90%.

[0097] 2.4.2 Evidence Level Assessment: The system adopts an evidence grading system applicable to the environmental health field to assess the evidentiary strength of knowledge. The system uses a modified version of the GRADE method (Grading of Recommendations Assessment, Development and Evaluations) to classify evidence into six levels: Level I is evidence from high-quality systematic reviews or meta-analyses; Level II is evidence from well-designed randomized controlled trials; Level III is evidence from well-designed non-randomized controlled studies; Level IV is evidence from cohort studies or case-control studies; Level V is evidence from cross-sectional studies or case series; and Level VI is evidence from expert opinions or mechanistic speculations.

[0098] Evidence Level Determination: The system automatically assigns evidence levels based on factors such as research type, design quality, sample size, and analysis methods. The determination algorithm uses a decision tree model, containing over twenty determination rules covering seven research types and 15 common research design variations. The model accuracy (consistent with expert ratings) reaches over 90%.

[0099] Evidence Comprehensive Assessment: When there are multiple sources for the same knowledge point, the system performs a comprehensive evidence assessment. The formula for the comprehensive evidence scoring is:

[0100] in =0.4、 =0.3、 =0.3, based on Statistical assessment of evidence consistency The threshold is set to 50%.

[0101] Evidence Level Labeling: The system labels each knowledge triple with evidence level information, including a level rating (I-VI), credibility score (0-1), a list of evidence sources, and a description of uncertainty. The level information is stored as structured metadata, supporting subsequent filtering and reasoning.

[0102] 2.4.3 Knowledge Conflict Detection and Handling: The system detects conflicts between newly extracted knowledge and existing knowledge bases, and implements intelligent conflict handling. Conflict Type Identification: The system identifies four main types of conflicts: direct contradictions (e.g., "A causes B" vs. "A does not cause B"), numerical inconsistencies (e.g., different dose-response coefficients), conditional conflicts (surface conflicts caused by different applicable conditions), and temporal evolution conflicts (conflicts caused by knowledge updates over time). Conflict detection uses a knowledge graph traversal algorithm combined with semantic comparison, achieving an accuracy rate of over 90%.

[0103] Conflict Analysis: The system performs root cause analysis on detected conflicts, categorizing them into genuine conflicts (differences in research results) and statement conflicts (differences in expression). Analysis dimensions include comparison of research designs, sample characteristics, measurement methods, and publication dates. The analysis algorithm uses feature-weighted scoring, with research design weighted at 0.4, sample characteristics at 0.3, measurement methods at 0.2, and publication date at 0.1. Conflict classification accuracy can reach over 85%.

[0104] 2.4.4 Knowledge Screening and Optimization: Based on the quality assessment results, the system screens and optimizes knowledge. Quality Threshold Filtering: The system sets multiple quality thresholds and filters based on a comprehensive quality score. The threshold for high-quality knowledge is set at 0.8, the threshold for medium-quality knowledge is set at 0.6, and knowledge below 0.5 is filtered out. The filtering accuracy (the proportion of useful knowledge retained) can reach over 90%.

[0105] Deduplication and merging: The system identifies and processes duplicate or similar knowledge entries.

[0106] Standardized expression: Unify the format of knowledge expression to improve readability and consistency.

[0107] Association Enhancement: This feature analyzes isolated knowledge points, adds missing implicit associations, and enhances the connectivity of the knowledge network. Association enhancement uses algorithms based on path heuristics and similarity reasoning, adding an average of 2-3 new associations per entity, achieving an enhancement coverage of over 75%. Newly added associations are marked as inference types, and the basis for the inference is recorded to support subsequent verification.

[0108] The filtered knowledge triples are output in standard JSON format, containing entity information, relation information, attribute sets, evidence information, and metadata. The output format is designed to be compatible with the input requirements of the knowledge graph construction module, ensuring seamless integration. The quality assessment submodule processes an average of tens of triples per second, supporting batch processing and incremental updates.

[0109] 3. Environmental Health Knowledge Graph Construction Module

[0110] like Figure 3 As shown, the environmental health knowledge graph construction module is one of the core components of the system, responsible for transforming knowledge triples extracted from literature into a structured environmental health knowledge graph. This module consists of four key sub-modules: ontology model design, graph construction and fusion, knowledge graph representation and storage, and knowledge graph visualization and exploration. These sub-modules work collaboratively to form a complete construction process from knowledge triples to an interactive knowledge graph.

[0111] 3.1 Ontology Model Design Submodule

[0112] The ontology model design submodule aims to construct a conceptual system and semantic framework for the environmental health domain, providing a structured organizational solution for knowledge graphs. This submodule includes a domain concept analysis unit, a relation type definition unit, an attribute system design unit, and an ontology evolution management unit.

[0113] 3.1.1 Domain Concept Analysis Unit: This unit is responsible for identifying and classifying core concepts in the field of environmental health. The analysis process employs a combination of top-down and bottom-up methods. The top-down approach determines the high-level conceptual framework through interviews with domain experts and analysis of authoritative literature. The bottom-up approach extracts candidate concepts from a large volume of literature through text mining and cluster analysis. The selection of core concepts uses an importance scoring mechanism; the scoring formula is as follows:

[0114] in The frequency of the concept in the literature (normalized to 0-1). The degree of connectivity between a concept and other concepts (normalized to 0-1). Expert scoring (0-1) is applied. The system sets an importance threshold of 0.65; concepts with a value higher than this are included in the core ontology. Concept classification uses a hierarchical clustering algorithm, calculating semantic similarity between concepts using concept vectors (300 dimensions) generated by Word2Vec. The clustering threshold is set to 0.75, forming a multi-level concept category system. The system ultimately defines 7 core entity categories: Pollutant, Environmental Media, Exposure Pathway, Health Effect, Biomarker, Mechanism, and Intervention. Each category has 2-4 subcategories, forming over a hundred category nodes in total.

[0115] The ontology concept system is constructed based on national standard terminology, fully integrating standard terms and relational definitions from the "Environmental and Health Data Dictionary". For the classification of environmental pollutants, health effects, and exposure pathways, it strictly follows the classification framework in the "General Guidelines for Ecological and Environmental Health Risk Assessment" (HJ 1111-2020) to ensure consistency between knowledge representation and national standards. The system also establishes a mapping relationship between standard terminology and scientific literature terminology to resolve differences in terminology expression. For example, it uniformly maps variant expressions such as "PM2.5", "fine particulate matter", and "fine-diameter particulate matter" in the literature to the standard term "PM2.5 (particulate matter with a particle size less than or equal to 2.5 μm)".

[0116] In practical applications, the operational steps of concept analysis are as follows: (a) Import approximately 10,000 high-quality environmental health documents into the system as the analysis corpus; (b) Use a term extraction algorithm (based on CRF and domain dictionary) to extract specialized terms from the documents, initially obtaining approximately 20,000 candidate terms; (c) Apply filtering rules (minimum frequency threshold of 50 times / 10 million words, minimum professional score of 0.7) and deduplication to screen out approximately 10,000 effective terms; (d) Use a hierarchical clustering algorithm to organize the terms into a multi-level category structure; (e) Verify and optimize the category system through expert review (multiple experts in the field of environmental health), and finally determine the conceptual hierarchy of the ontology.

[0117] 3.1.2 Relationship Type Definition Unit: This unit is responsible for designing the semantic relationship types between concepts in the environmental health domain. The relationship type design adopts a goal-oriented approach, determining necessary relationship types based on core application scenarios (risk assessment, mechanism analysis, intervention analysis). Relationship extraction employs a semi-automatic method. First, dependency parsing is used to extract verb phrases from the literature, generating candidate relationship expressions. Then, semantic similarity clustering is used to calculate the similarity between relationship expressions based on word vectors (generated using domain-BERT), with a clustering threshold set to 0.82, merging semantically similar expressions. Finally, after expert review, standard relationship types and variant mapping rules are determined. The system ultimately defines several standard relationship types, including physicochemical relationships (such as "dissolves in," "converts into," etc.), biological relationships (such as "activates," "inhibits," etc.), epidemiological relationships (such as "increases risk," "leads to," etc.), and intervention relationships (such as "mitigates," "prevents," etc.). Each relation type defines domain and range constraints. For example, the domain of the "increase risk" relation is "pollutant" or "mechanism," and the range is "health effect." The relation types are organized in a three-level structure: the top level is divided into four categories (causal, association, compositional, and functional relations), the second level is divided into 12 categories, and the third level is the specific relation type.

[0118] 3.1.3 Attribute System Design Unit: This unit is responsible for defining attribute sets for concepts and relationships. Attribute design follows the principles of completeness, standardization, and scalability, and is determined based on domain requirements and data availability. For entity categories, the system defines general attributes (such as ID, name, description, source, etc.) and category-specific attributes. For example, specific attributes for the "pollutant" category include CAS number, molecular formula, half-life, toxicity rating, etc.; specific attributes for the "health effect" category include ICD code, affected organs, severity, reversibility, etc. For relationship types, the system defines general attributes such as relationship strength, level of evidence, time characteristics, and applicable conditions, as well as relationship-specific attributes. For example, specific attributes for the "increased risk" relationship include hazard ratio, dose dependence, population specificity, etc. Attribute value types support string, numeric, date, enumeration, list, and composite types. Attribute definitions adopt JSON Schema format, supporting type validation, value range constraints, and default value settings. The system defines over a hundred attributes for multiple core entity categories and dozens of attributes for various relationship types.

[0119] 3.1.4 Ontology Evolution Management Unit: This unit implements version control and evolution management of the ontology model. The system adopts a dual-track ontology management strategy: a core ontology (a stable ontology determined by expert consensus) and an extended ontology (parts dynamically expanded based on new knowledge). Ontology version control uses semantic version numbers (Major.Minor.Patch format). Major version changes indicate incompatible structural changes, Minor version changes indicate backward-compatible feature enhancements, and Patch version changes indicate backward-compatible bug fixes. Ontology updates follow a proposal-review-release process: new concepts or relationships are first submitted as proposals, and after expert review, it is determined whether to include them in the core or extended ontology. Ontology consistency checks include three aspects: logical consistency (e.g., circular dependency detection), structural consistency (e.g., orphaned node detection), and semantic consistency (e.g., constraint violation detection), performed using automated verification tools. Ontology documentation is automatically generated, including a concept dictionary, relation tables, and a graphical category hierarchy.

[0120] 3.2 Map Construction and Fusion Submodule

[0121] The knowledge graph construction and fusion submodule is responsible for transforming knowledge triples into knowledge graphs that conform to the ontology model and fusing them with external knowledge sources. This submodule includes an entity linking unit, a relation mapping unit, a knowledge graph construction unit, an external knowledge fusion unit, and a standard knowledge integration unit.

[0122] 3.2.1 Entity Linking Unit: This unit is responsible for linking entity representations in knowledge triples to standard entities in the ontology model. Entity linking employs a four-level strategy, with each level executed sequentially. Successful matching at the current level skips subsequent levels: exact matching (using a standard entity dictionary containing tens of thousands of entities, achieving over 99% accuracy and approximately 40%-50% coverage), fuzzy matching (using edit distance and character n-gram similarity, with a threshold of 0.85, achieving over 95% accuracy and increasing coverage by approximately 20%-25%), semantic matching (using cosine similarity of entity embedding vectors, with a threshold of 0.92, achieving over 95% accuracy and increasing coverage by approximately 15%-20%), and contextual matching (a deep matching model considering entity context information, achieving over 90% accuracy and increasing coverage by approximately 5%-10%). For entities that cannot be linked (approximately 2%), the system creates temporary entities and marks them as "pending verification". Entity linking also includes an entity disambiguation step. For ambiguous entities, the system uses a context-aware disambiguation algorithm, calculating the relevance score of candidate entities based on BERT context embeddings, achieving a disambiguation accuracy of over 90%. Entity category assignment employs a combination of rule-based and machine learning methods. Rules cover common patterns, while machine learning handles complex cases, achieving a classification accuracy of over 90%.

[0123] 3.2.2 Relation Mapping Unit: This unit is responsible for mapping relation representations in triples to standard relation types in the ontology model. Relation mapping employs a two-stage strategy: First, a rule-based mapper is used to handle common relation expressions through relation expression templates and keyword matching. The rule base contains thousands of mapping rules, covering approximately 60%-70% of relation representations. Then, a BERT-based relation classifier is used to handle complex or ambiguous relation expressions. The classifier uses a fine-tuned BERT model, taking the relation context text as input and outputting the probability distribution of 42 standard relation types, achieving a classification accuracy of over 85% and increasing coverage to over 90%. For relations that still cannot be mapped (approximately 5%), the system retains the original representation and marks it as a "non-standard relation," which is then manually reviewed by experts. Relation direction determination uses rules based on dependency syntax and semantic roles, achieving an accuracy of over 95%. Relationship attribute extraction uses specialized extractors for specific attributes, such as numerical extractors (handling numerical attributes such as risk ratio and p-value), time extractors (handling duration, lag time, etc.), and conditional extractors (handling applicable population, environmental conditions, etc.).

[0124] 3.2.3 Knowledge Graph Construction Unit: This unit is responsible for constructing a complete knowledge graph based on linked entities and mapped relationships. Graph construction employs an incremental approach, supporting the updating of existing knowledge and the addition of new knowledge. The entity node creation process includes: checking if the entity already exists (using standard identifiers or semantic matching); for new entities, creating nodes and setting categories and attributes; for existing entities, updating attributes or merging information. The relationship edge creation process includes: checking if the relationship already exists (based on head and tail entities and relationship type); for new relationships, creating edges and setting attributes; for existing relationships, deciding whether to update, retain, or mark conflicts based on evidence strength and timeliness. Consistency checks during graph construction include type consistency (checking whether entities and relationships conform to ontology constraints), logical consistency (detecting contradictory knowledge, such as A causing B versus A not causing B), and structural integrity (detecting isolated nodes and dangling relationships). Graph optimization includes redundant relationship elimination, transitive relationship compression, and synonym entity merging; the optimized graph query performance can be improved by more than 30%.

[0125] 3.2.4 External Knowledge Fusion Unit: This unit is responsible for fusing external knowledge sources with the knowledge graph constructed by this system, expanding the knowledge coverage. The system supports integration with more than ten external knowledge bases, including chemical substance databases (such as PubChem, ChEBI), medical knowledge bases (such as UMLS, DisGeNET), environmental databases (such as EPA CompTox, ATSDR), and biological knowledge bases (such as Reactome, KEGG). Knowledge fusion adopts a two-step approach: entity alignment and relation mapping. First, entity alignment is performed using a multi-feature matching algorithm (including name similarity, structural similarity, attribute consistency, and context similarity), achieving an entity alignment accuracy of over 90%. Then, relation mapping is performed, mapping the relation schemas of external knowledge bases to the ontology model, achieving a relation mapping accuracy of over 85%. The fusion process employs a quality control mechanism, including credibility assessment (based on the authority of the knowledge source and the timeliness of the data), consistency checks (identifying and resolving conflicting knowledge), and expert verification (manual review of key knowledge points). The system implements incremental fusion updates, and synchronizes with external knowledge bases regularly (quarterly). On average, each update adds tens of thousands to hundreds of thousands of entities and hundreds of thousands of relationships.

[0126] 3.2.5 Standard Knowledge Integration Unit: This unit is responsible for integrating the parsed environmental health standard knowledge into the knowledge graph. The integration process includes four steps: First, the system maps the structured knowledge parsed from the standards to the ontology model, including mapping standard clauses to knowledge entities and standard requirements to attributes or relationships; second, the system establishes connections between standard knowledge and scientific knowledge, such as linking limit requirements in the standards with relevant research evidence to construct a "limit-evidence" relationship network; then, the system handles cross-references and dependencies between standards to construct a complete standard relationship network; finally, the system adds time and source attributes to all standard knowledge to support version management and source tracing queries.

[0127] The integration of standard knowledge adopts the "authoritative priority" principle. When there are discrepancies between standard knowledge and scientific literature knowledge, the standard definition is adopted by default, while scientific knowledge is retained as supplementary information. For example, for default values ​​of exposure parameters, the system prioritizes the recommended values ​​in the "Basic Dataset for Exposure Parameter Survey" (HJ 968-2019), while also associating them with parameter distribution characteristics reported in scientific research to provide users with a more comprehensive reference. Regarding standard updates, the system adopts a time-based version management strategy, retaining historical versions of standard knowledge while clearly marking their validity period, supporting queries for specific versions of standard requirements by time point.

[0128] The technical challenge of standard knowledge integration lies in handling the differences in the structure and expression of standard content. The system employs a knowledge normalization strategy to unify the same content in different forms of expression into a standardized representation. For example, for air quality limits, different standards may use different units or time periods; the system automatically performs unit conversions and period adjustments to ensure data comparability. For qualitative requirements, the system uses semantic normalization methods to identify the same requirements expressed differently as equivalent relationships. The overall accuracy of standard knowledge integration can reach over 95%, significantly enhancing the authority and practical value of the knowledge graph.

[0129] Through the collaborative work of the aforementioned units, the graph construction and fusion submodule can transform scattered knowledge triples into a structured and standardized environmental health knowledge graph, and continuously expand and enrich the graph content through external knowledge fusion. The environmental health knowledge graph currently constructed by the system contains approximately one million entity nodes and millions of relation edges, covering thousands of environmental pollutants and thousands of health effects. The entity category coverage rate reaches over 90%, and the relation type coverage rate reaches over 90%, providing a solid foundation for subsequent knowledge reasoning and applications.

[0130] 3.3 Knowledge Graph Representation and Storage Submodule

[0131] The knowledge graph representation and storage submodule is responsible for the efficient representation, storage, and querying of the environmental health knowledge graph. This submodule includes a graph data model design unit, a distributed storage optimization unit, an indexing and query optimization unit, and a data consistency guarantee unit.

[0132] 3.3.1 Graph Data Model Design Unit: This unit is responsible for designing a graph data model suitable for representing environmental health knowledge. The system adopts a property graph model, supporting labeled nodes, typed edges, and key-value pair attributes.

[0133] 3.3.2 Distributed Storage Optimization Unit: This unit implements a distributed storage scheme for large-scale environmental health knowledge graphs. The system adopts a distributed architecture combining sharding and replication, supporting horizontal scaling and high availability. The sharding strategy is based on graph topology characteristics, employing an "edge-cut" method to divide the graph into multiple shards according to node type and connection mode.

[0134] 3.3.3 Indexing and Query Optimization Unit: This unit implements efficient indexing and query optimization for the environmental health knowledge graph. The system constructs a multi-level indexing system, including: entity tag index (B+ tree index, accelerating type-based queries), attribute index (composite B+ tree index, supporting fast lookup of highly selective attributes), full-text index (based on inverted index, supporting text search of entity names and descriptions), time index (a dedicated index for time attributes), and spatial index (R-tree index, supporting geolocation queries). For common query patterns in the environmental health field, the system pre-builds more than twenty query templates, covering over 80% of query needs.

[0135] 3.3.4 Data Consistency Assurance Unit: This unit ensures the consistency and reliability of the environmental health knowledge graph data. The system implements a multi-layered consistency assurance mechanism: at the architecture level, it adopts a master-slave replication mode, supporting both synchronous replication (ensuring strong consistency) and asynchronous replication (optimizing performance). The default configuration is to synchronously replicate the master node and one slave node, and asynchronously replicate the remaining slave nodes.

[0136] 3.4 Knowledge Graph Visualization and Exploration Submodule

[0137] The Knowledge Graph Visualization and Exploration submodule provides intuitive and interactive browsing and retrieval functions for environmental health knowledge graphs. This submodule includes multi-level visualization units, interactive graph exploration units, intelligent knowledge search units, and personalized knowledge recommendation units.

[0138] 3.4.1 Multi-level Visualization Representation Unit: This unit implements multi-level visualization of the knowledge graph. The system supports three visualization levels: the global overview layer (displaying the overall structure and main category distribution of the knowledge graph, using a force-directed layout algorithm to cluster entities by category), the domain exploration layer (displaying a knowledge subgraph for a specific domain, such as all associated knowledge of a specific pollutant, using a hierarchical or radial layout), and the detailed viewing layer (displaying complete attribute information and direct associations of a single entity or relationship). Visualization rendering uses WebGL technology, supporting smooth interaction with large-scale graphs (millions of nodes), maintaining a frame rate above 30fps. The visual encoding of nodes and edges includes color (representing entity category or relationship type), size (representing connectivity or importance), shape (distinguishing different attribute features), and labels (displaying names or key attributes).

[0139] 3.4.2 Interactive Knowledge Graph Exploration Unit: This unit allows users to explore the knowledge graph through interactive operations. Supported interactive operations include: node expansion (clicking a node displays its directly related entities and relationships), path query (specifying a start and end point to visualize the connecting path), subgraph filtering (filtering subgraphs by entity category, relationship type, evidence level, etc.), and time slicing (selecting a specific time range to view the status of knowledge during that period). The system supports a bookmark function, allowing users to save views and query results of interest for later review.

[0140] 3.4.3 Intelligent Knowledge Search Unit: This unit supports multiple knowledge retrieval methods, including keyword search (supporting fuzzy matching and synonym expansion), structured query (combined queries based on entity category, relationship type, and attribute conditions), and natural language query (converting user questions into graph query statements through the natural language understanding module). Search results are sorted by relevance and support secondary filtering based on evidence level, timeliness, and source reliability.

[0141] 3.4.4 Personalized Knowledge Recommendation Unit: This unit proactively recommends relevant knowledge content based on the user's browsing history, query records, and research interests. The recommendation algorithm combines collaborative filtering and content-based recommendation methods, utilizing the structural features of the knowledge graph to identify knowledge points related to the user's interests. The system also supports a knowledge update reminder function, automatically pushing notifications to the user when new knowledge is added to the knowledge domain they are interested in or when existing knowledge is updated.

[0142] 4. Knowledge Reasoning Engine Module

[0143] like Figure 4As shown, the knowledge reasoning engine module is the core intelligent component of the system, responsible for implementing complex knowledge reasoning functions based on the environmental health knowledge graph. This module includes a multimodal knowledge representation submodule, a hybrid reasoning strategy submodule, a causal chain discovery and verification submodule, and an uncertainty reasoning and representation submodule. These submodules work collaboratively to form a complete capability chain from knowledge representation to intelligent reasoning.

[0144] 4.1 Multimodal Knowledge Representation Submodule

[0145] The multimodal knowledge representation submodule is responsible for transforming the environmental health knowledge graph into a computable representation, supporting efficient knowledge reasoning and analysis. This submodule includes a knowledge graph embedding unit, a semantic relation representation unit, a temporal knowledge representation unit, and a probabilistic knowledge representation unit.

[0146] 4.1.1 Knowledge Graph Embedding Unit: This unit implements the distributed representation of the environmental health knowledge graph, mapping entities and relationships to a low-dimensional vector space. The system implements three core embedding algorithms: TransE (modeling relationships as translation operations in the entity embedding space, suitable for one-to-one relationships), RotatE (modeling relationships as rotation operations on the complex plane, suitable for complex relationship patterns), and ComplEx (using complex vectors to represent entities and relationships, handling asymmetric relationships). The embedding training configuration is as follows: embedding dimension 256 (TransE and RotatE are real vectors, ComplEx is a 128-dimensional complex vector), negative sample ratio 5:1, using self-adversarial negative sampling to improve training efficiency, marginal parameter γ=24, 100 training epochs, batch size 1024, Adam optimizer, initial learning rate 1e-4, and learning rate decay strategy is step decay (decaying to 0.5 times the original rate every 20 epochs). The training data contains all entities and relationships in the graph, approximately one million entity nodes and millions of relationship edges. Training was performed on a multi-GPU cluster, with a total training time of approximately 10-15 hours. Model evaluation metrics included: link prediction (MRR exceeding 0.4, Hits@10 exceeding 0.85), relation prediction (MRR exceeding 0.35, Hits@10 exceeding 0.8), and triple classification (accuracy exceeding 0.9). The embedding update strategy combined full retraining (monthly) with incremental updates (weekly). Embedding applications included: entity similarity calculation (using cosine similarity, threshold 0.85), relation inference (vector space algebraic operations), and knowledge completion (link prediction).

[0147] 4.1.2 Semantic Relation Representation Unit: This unit implements a refined representation of semantic relations in the environmental health domain. The system employs a combination of relation vectors and relation operators to support the representation and computation of complex relations. The relation representation framework includes: basic relation representation (vector representation of various standard relation types), relation strength representation (5 strength levels, from "very weak" to "very strong"), relation directionality (positive, negative, or neutral), and relation timeliness (permanent, long-term, medium-term, short-term, or instantaneous). The system supports relation composition operations, implementing sequence composition of relations based on tensor operations (R3 = R1). R2) and parallel combination (R3 = R1⊕R2). Relation representation learning adopts a path-based approach, using a path-constrained loss function to train relation embeddings, ensuring that the combinatorial property is satisfied. The system also implements context-aware relation representation, where the same relation can have different representations under different conditions. Condition factors include population characteristics (e.g., age, gender, health status), environmental conditions (e.g., temperature, humidity, coexisting pollutants), and time factors (e.g., exposure duration, latency period). Evaluation metrics for relation representation include: path prediction accuracy (up to 0.8 or higher), relation classification accuracy (up to 0.85 or higher), and analogy task accuracy (up to 0.75 or higher).

[0148] 4.1.3 Temporal Knowledge Representation Unit: This unit implements the representation method for temporal knowledge in the environmental health domain. The system supports three time mode representations: point time (an event occurring at a specific moment, such as a policy release date), interval time (a process lasting for a period of time, such as an exposure period), and periodic time (a phenomenon that repeats regularly, such as a seasonal disease outbreak). Time representation uses standardized time points and intervals, supporting absolute time (a specific date and time) and relative time (a time period relative to a reference point). Temporal relationship representation uses Allen's time interval algebra, supporting the representation and reasoning of 13 basic time relationships (such as before, after, during, overlaps, etc.). The system implements the time-bound triple representation, using time information as the constraint condition of the triple, represented as (entity 1, relation, entity 2, [start time, end time]). For the unique temporal patterns in the environmental health field, the system defines specialized temporal relationship types, such as "exposure-effect time window" (representing the time range from exposure to the appearance of an effect), "cumulative effect threshold" (representing the minimum cumulative exposure time required to achieve a health effect), and "recovery period" (representing the time required for the effect to subside). The temporal representation supports multi-level granularity conversion, from years to seconds, and dynamically adjusts according to the application scenario. Temporal knowledge assessment includes temporal reasoning accuracy (accuracy of correctly inferring temporal relationships can reach over 0.9) and temporal pattern recognition capability (F1 score of over 0.8 for recognizing typical temporal patterns in the environmental health field).

[0149] 4.1.4 Probabilistic Knowledge Representation Unit: This unit implements a representation framework for probabilistic knowledge in the environmental health domain, supporting uncertainty reasoning. The system adopts a probabilistic graphical model representation framework, supporting Bayesian networks (representing conditional dependencies between variables) and Markov random fields (representing interactions between variables). Probabilistic representations include parametric representations (using conditional probability tables or potential functions to parameterize variable relationships) and non-parametric representations (using kernel density estimation or Gaussian processes). The system implements mixed variable processing, supporting joint modeling of continuous variables (such as pollutant concentrations and biomarker levels) and discrete variables (such as disease states and exposure levels).

[0150] Through the collaborative work of the aforementioned units, the multimodal knowledge representation submodule realizes the computational representation of environmental health knowledge, transforming the static knowledge graph into a dynamic knowledge resource that supports uncertainty modeling and reasoning. Based on multimodal representation and probabilistic modeling mechanisms, the system can uniformly model different types of variables and provide a structured computational foundation for subsequent weighted reasoning and risk assessment.

[0151] 4.2 Hybrid Reasoning Strategy Submodule

[0152] This submodule integrates multiple inference strategies to flexibly handle different types of environmental health problems. For example... Figure 4 As shown, the hybrid reasoning strategy submodule includes four core components: rule-based reasoning unit, case-based reasoning unit, statistical reasoning unit, and strategy fusion unit, which realizes intelligent reasoning on environmental health issues from different perspectives.

[0153] The rule-based inference unit employs a combined forward and backward linking inference mechanism. The operation flow is as follows: First, the system extracts domain rules from the environmental health knowledge graph, including causal rules (e.g., "If substance A is carcinogenic and population B is exposed to substance A, then population B faces a risk of carcinogenesis"), transitive rules (e.g., "A causes B and B causes C, then A may cause C"), and constraint rules (e.g., "The same pollutant cannot both promote and inhibit the same biological process"). Second, the system formalizes these rules as predicate logic formulas, constructing an inference rule base. Then, when a user query is received, the system automatically selects either a forward linking (deriving conclusions from facts) or a backward linking (finding supporting evidence from the target conclusion) strategy based on the query type. Finally, the system performs rule matching and inference calculations, generating logical deduction results. The rule engine uses the Rete algorithm to optimize matching efficiency, and an inference caching mechanism reduces redundant calculations, keeping the average response time within 100ms.

[0154] The formula for calculating the confidence score of the rule-based reasoning result of the rule-based reasoning unit is as follows: ; in Let be the confidence level of the i-th premise. The confidence level of the rule itself is defined. The rule representation adopts the SWRL (Semantic Web Rule Language) format and supports more than 300 rules in the environmental health domain. The rule priority is divided into 5 levels, and the conflict resolution strategy adopts the priority ranking and specificity priority principle. The rule reasoning supports non-monotonic logic, allowing the reversal of previous conclusions when new evidence appears.

[0155] The case-based reasoning unit implements analogical reasoning based on a historical environmental health case library. The operation process is as follows: First, the system maintains a structured case library containing over ten thousand environmental health cases. Each case includes four parts: problem description, situational characteristics, solution, and outcome evaluation. Second, when faced with a new problem, the system calculates the similarity between the new problem and each case in the case library. The similarity calculation formula is: , in, The weights of feature j Let be the similarity function for feature j. and The system assigns values ​​to feature j for both new and historical cases; feature weights are automatically optimized using machine learning; then, it selects the top K cases with the highest similarity (K defaults to 5, but is configurable), analyzes the solutions to these cases and their applicable conditions; finally, the system uses a case adaptation algorithm to adjust historical solutions based on the differences between new and old problems, generating recommended solutions suitable for the current problem. A dynamic case library update mechanism automatically adds newly solved typical cases monthly to ensure the timeliness of knowledge.

[0156] The implementation parameters of the case-based reasoning unit technology are as follows: Case representation adopts an attribute-value pair structure, with each case containing dozens of feature attributes on average; Case retrieval adopts a KD-tree index structure, which supports fast nearest neighbor lookup, and the retrieval time complexity is O(log n); Case adaptation adopts a regularized set of transformation functions, including four types of adaptation operations: replacement, adjustment, combination, and specialization.

[0157] The statistical inference unit implements uncertainty inference based on probabilistic models. The operation process is as follows: First, the system extracts the probabilistic dependencies between environmental health variables from the knowledge graph and constructs a Bayesian network model. Second, the system learns the parameters of conditional probability tables from environmental health literature and research data, and uses the EM algorithm to handle missing data. Then, when a inference task is received, the system selects an appropriate inference algorithm based on the task type: variable elimination algorithm for predictive queries, belief propagation algorithm for diagnostic queries, and do-calculus method for intervention queries. Finally, the system performs probability calculations and generates inference results containing uncertainty estimates. The system supports approximate inference based on Monte Carlo sampling. When processing large-scale complex networks, a single inference sampling of tens of thousands of times takes an average time of no more than 500ms.

[0158] The statistical inference unit technology implementation parameters are as follows: the upper limit of the number of Bayesian network nodes is 200, and the average number of parent nodes per node does not exceed 5; the conditional probability table storage adopts a decision tree structure with a compression rate of over 60%; the inference accuracy control parameter ε=0.01 ensures that the deviation between the inference result and the accurate calculation does not exceed 1%; it supports hybrid Bayesian networks to simultaneously handle discrete variables (such as disease state) and continuous variables (such as pollutant concentration), with continuous variables represented using a conditional Gaussian model.

[0159] The strategy fusion unit intelligently integrates multiple reasoning strategies. The operation process is as follows: First, the system analyzes the user query and extracts query features, including query type (factual, relational, mechanism-based, predictive), degree of structure, uncertainty requirements, and time constraints. Second, based on query features and historical performance evaluation, the system assigns the optimal combination of reasoning strategies and weights to the current query. Then, the system executes the selected reasoning strategies in parallel, obtaining multiple reasoning results. Finally, the system integrates the multiple results using a weighted fusion method to generate the final reasoning answer. The system employs an online learning mechanism to continuously optimize the strategy selection model based on user feedback, improving reasoning accuracy.

[0160] 4.3 Causal Chain Discovery and Verification Submodule

[0161] This submodule focuses on the discovery and verification of causal relationships in the field of environmental health. For example... Figure 4 As shown, the causal chain discovery and verification submodule includes four core components: causal path search unit, mechanism completion reasoning unit, causal strength evaluation unit, and multimodal causal verification unit, realizing a complete causal analysis link from knowledge graph to data verification.

[0162] The causal path search unit enables automatic discovery of causal paths based on a knowledge graph. The operation process is as follows: First, the user specifies a starting entity (e.g., a specific pollutant) and an ending entity (e.g., a specific health effect). Second, the system searches the knowledge graph for all possible paths connecting these two entities using an improved bidirectional breadth-first search algorithm. Then, the system applies path filtering rules, filtering based on path length (default limit ≤ 5 hops), relationship type (must contain causal semantic relationships), and intermediate node type (biological rationality check). Finally, the system ranks the paths that meet the criteria, using ranking indicators including path credibility (cumulative calculation based on relationship confidence), path directness (shorter paths preferred), and path completeness (completeness of mechanism explanation). The system supports interactive path exploration, allowing users to further explore path nodes of interest for detailed explanations.

[0163] Path confidence is calculated using the geometric mean of the confidence scores of all edges, using the following formula:

[0164] Where n is the number of edges in the path. Let be the confidence score of the i-th edge. The parameters for the causal path search unit technology are as follows: The search algorithm uses an improved bidirectional BFS, with a time complexity of O(n log n). Where b is the average branch factor and d is the path length; the path filtering rules include more than ten domain rules, formulated based on the consensus of environmental health experts; the priority queue is implemented using a Fibonacci heap, which supports fast minimum value extraction and key-value update.

[0165] The mechanism completion reasoning unit automatically completes missing links in causal chains. The operation process is as follows: First, the system analyzes discovered causal paths to identify potential knowledge gaps, such as the lack of a clear intermediate mechanism between two nodes. Second, based on similar path analysis, the system retrieves structurally similar complete paths from the knowledge graph and extracts their intermediate nodes as candidate fillers. Then, the system applies domain-specific rule reasoning, inferring possible intermediate processes based on biological mechanisms. Finally, the system synthesizes the results of both methods to generate completion suggestions for each missing link, while also annotating the reasoning basis and confidence level. The system allows users to confirm or correct the completion results. Confirmed and valid completions can be added to the knowledge graph and marked as "reasoning generated."

[0166] The formula for calculating the confidence level of the completion result is:

[0167] Here, α and β are weighting coefficients that satisfy α+β=1, with default settings of α=0.7 and β=0.3. Knowledge gap identification uses path integrity scoring, triggering completion when the semantic distance between adjacent nodes exceeds a threshold (default 0.6); similar path retrieval uses path embedding technology, mapping paths to a 256-dimensional vector space and calculating path similarity using cosine similarity; rule reasoning uses an ontology reasoning engine, containing over a hundred rules on molecular and cellular biological mechanisms.

[0168] The causal strength assessment unit enables the quantitative assessment of the strength of causal relationships in environmental health. The operational process is as follows: First, the system extracts quantitative indicators of specific causal relationships from the knowledge graph, such as relative risk ratio (RR), percentage attributable risk (PAF), and regression coefficient (β). Second, when multiple studies exist, the system applies meta-analysis, considering sample size, design quality, and publication time of each study, to calculate the weighted average effect size. Then, the system assesses the sufficiency of evidence for the causal relationship, scoring it based on the Bradford Hill criteria (strength, consistency, specificity, temporality, dose-response relationship, plausibility, coherence, experimental evidence, and analogy). Finally, the system generates a comprehensive causal strength assessment report, including quantitative effect estimates, uncertainty intervals, and the level of evidence. The system supports conditional causal strength analysis, assessing changes in causal relationships under different population characteristics or environmental conditions.

[0169] The formula for calculating the Bradford Hill standard total score is:

[0170] The weights of each dimension The Delphi method was used to determine the heterogeneity among studies. The meta-analysis employed a random effects model, and the DerSimonian-Laird method was used to estimate heterogeneity among studies; the weights of individual studies were calculated using the following formula:

[0171] in To study internal variance, To study the variance between studies; the level of evidence is divided into four levels: Strong, Moderate, Weak, and Uncertain, determined based on the total score and key criterion scores.

[0172] The multimodal causal verification unit compares and verifies causal paths in the knowledge graph with empirical research results reported in external literature. The causal consistency scoring formula is as follows:

[0173] The value range is [0,1]; the verification results are classified as: Confirmed, Partially Supported, Inconclusive, and Not Supported, based on the consistency score and the verification results of the critical path.

[0174] 4.4 Uncertainty Reasoning and Representation Submodule

[0175] This submodule addresses the uncertainty in environmental health knowledge, providing reliable and quantifiable reasoning results. For example... Figure 4 As shown, the uncertainty reasoning and representation submodule includes four core components: uncertainty representation model unit, Bayesian inference network unit, evidence theory inference unit, and uncertainty propagation and decision unit, realizing a complete process from uncertainty representation to decision support.

[0176] The uncertainty representation model unit realizes a multidimensional representation of the uncertainty of environmental health knowledge. The operation process is as follows: First, the system classifies the sources of uncertainty in knowledge, including random uncertainty (intrinsic variability), cognitive uncertainty (incomplete knowledge), linguistic uncertainty (expressive ambiguity), and measurement uncertainty (observation error). Second, the system selects an appropriate mathematical representation model for each type of uncertainty: probability distribution for random uncertainty, evidence interval for cognitive uncertainty, fuzzy sets for linguistic uncertainty, and error propagation model for measurement uncertainty. Then, the system extracts uncertainty parameters from knowledge graphs and literature data, such as distribution parameters, confidence intervals, and membership functions. Finally, the system constructs a multidimensional uncertainty representation of knowledge entries and stores it in the attribute layer of the knowledge graph. The system supports uncertainty visualization, using probability density maps, confidence intervals, and fuzzy membership functions to intuitively display uncertainty.

[0177] The uncertainty representation model unit technology implementation parameters are as follows: the probability distribution supports 15 common distribution types, including normal distribution, log-normal distribution, Weibull distribution, etc.; the evidence interval representation adopts the interval [Bel(A), Pl(A)], where Bel(A) is the confidence function and Pl(A) is the likelihood function; the fuzzy set representation adopts the triangular / trapezoidal membership function, and the domain is divided into 5-7 linguistic variables; the measurement error model is based on the GUM (Guide to Measurement Uncertainty) standard and supports Type A and Type B uncertainty assessment.

[0178] The Bayesian inference network unit implements probabilistic inference based on Bayesian networks. The operation process is as follows: First, the system extracts conditional dependencies between variables from the knowledge graph and constructs a Bayesian network structure (directed acyclic graph). Second, the system learns network parameters (conditional probability tables) from literature data and expert knowledge. Then, when a reasoning query is received, the system converts the query into variable assignment and conditional probability calculation. Finally, based on the network size and accuracy requirements, the system selects an exact inference algorithm (such as the connection tree algorithm) or an approximate inference algorithm (such as importance sampling) to calculate the posterior probability distribution of the target variable. The system supports online updates of the Bayesian network; when new evidence appears, the network parameters are updated through Bayesian rules to maintain the model's timeliness.

[0179] The parameters for implementing the Bayesian inference network unit technology are as follows: network structure learning uses a constraint-based PC algorithm with a significance level of α=0.01; parameter learning uses maximum likelihood estimation (complete data) or the EM algorithm (missing data) with a convergence threshold of ε=1e. -4 The exact reasoning implementation of the variable elimination algorithm has a time complexity of O(n log n). , where w is the tree width of the network; the approximate inference adopts adaptive importance sampling. For complex networks, tens of thousands of samples are used to ensure that the estimation error does not exceed 5% at a 95% confidence level.

[0180] The evidence theory reasoning unit implements uncertainty reasoning based on Dempster-Shafer theory. The operation process is as follows: First, the system assigns basic belief assignments (BPAs) to environmental health knowledge, reflecting the degree of certainty and uncertainty of the knowledge. Second, when multiple sources of evidence are obtained, the system uses Dempster's combination rule to integrate belief assignments from different sources. The combination formula is as follows:

[0181] Then, the system calculates the confidence function (Bel) and likelihood function (Pl) for each hypothesis, constructing a belief interval [Bel,Pl]. Finally, the system assesses the uncertainty of the conclusion based on the belief interval and, according to the decision-making needs, selects an appropriate decision rule (such as maximum confidence, maximum likelihood, or the midpoint of the interval) to generate the final inference. The system is particularly suitable for handling conflicting and incomplete evidence situations, and can clearly distinguish between the two uncertainty states of "don't know" and "uncertainty".

[0182] The technical implementation parameters of the evidence theory reasoning unit are as follows: basic belief assignments are generated from evidence level and expert ratings; conflict handling adopts a discount strategy, discounting the BPA (Best Belief in Acknowledgment) of evidence sources with lower reliability when the conflict degree k>0.3; the decision rules include four types, with the default being the midpoint of the interval, i.e. .

[0183] The uncertainty propagation and decision-making unit implements uncertainty propagation and decision support during the reasoning process. The operational flow is as follows: First, the system tracks the sources and degrees of uncertainty at each stage of the reasoning chain. Second, the system applies an uncertainty propagation algorithm to calculate the cumulative uncertainty of complex reasoning. Random uncertainty is simulated using Monte Carlo simulation based on Latin hypercube sampling, and cognitive uncertainty is analyzed using interval analysis based on constraint propagation algorithms. Then, the system generates reasoning results containing uncertainty representations, including point estimates, interval estimates, and probability distributions. Finally, the system provides decision support based on uncertainty analysis, including risk-return assessment, threshold analysis, robustness testing, and sensitivity analysis. The system implements variance decomposition technology to calculate the Sobol's index, identifying the factors that contribute most to the uncertainty of the outcome. The system implements adaptive decision expression, automatically adjusting the tone of the conclusion based on the degree of uncertainty, using cautious expressions such as "may" and "suggest consideration" in cases of high uncertainty.

[0184] The uncertainty propagation and decision unit technology implementation parameters are as follows: Monte Carlo simulation uses Latin hypercube sampling with a sample size of tens of thousands to ensure that the estimation error does not exceed 3% at a 95% confidence level; interval analysis uses the constraint propagation algorithm with a time complexity of O(n·d·e), where n is the number of variables, d is the number of constraints, and e is the constraint evaluation complexity; sensitivity analysis implements variance decomposition technology to calculate the Sobol' exponent and identify the factors that contribute the most to the uncertainty of the results; the decision support framework includes five decision scenarios (prevention decision, intervention decision, management decision, research decision, and communication decision) and corresponding decision rules.

[0185] The uncertainty reasoning and representation submodule is tightly integrated with other modules, providing uncertainty management support for the entire system. When the knowledge graph construction module discovers uncertain knowledge, this submodule provides a standardized uncertainty representation; when the knowledge reasoning engine performs complex reasoning, this submodule tracks and quantifies the uncertainty of the reasoning results; when the knowledge application service module generates risk assessments or policy recommendations, this submodule ensures that uncertainty is properly communicated and handled. This comprehensive uncertainty management ensures the scientific validity and reliability of the system output, avoiding misleading results caused by excessive certainty.

[0186] 5. Knowledge Application Service Module

[0187] This module applies environmental health knowledge graphs and reasoning capabilities to real-world environmental health management scenarios, providing various knowledge services, including risk assessment, causal explanation, literature review, intelligent question answering, and decision support. For example... Figure 5 As shown, this module adopts a layered architecture design, including an API interface layer, a service core layer, and a data interaction layer, which realizes a seamless transformation from knowledge graph to practical application.

[0188] 5.1 Knowledge-Assisted Risk Assessment Submodule

[0189] This submodule combines environmental monitoring data and health knowledge to provide knowledge-enhanced risk assessment services. It provides end-to-end knowledge support from exposure assessment to risk characterization, significantly improving the scientific rigor and reliability of risk assessments.

[0190] The knowledge-assisted risk assessment submodule comprises four core parts: exposure assessment unit, dose-response relationship application unit, comprehensive risk characterization unit, and risk management suggestion generation unit.

[0191] Exposure Assessment Unit: This unit, based on environmental health knowledge and monitoring data, enables accurate exposure assessment across multiple media and pathways. The specific implementation method is as follows: Environmental monitoring data processing: The system supports input monitoring data in various formats (CSV, Excel, JSON, XML, etc.), and uses an adaptive parsing algorithm to automatically identify the data structure, perform data cleaning (outlier detection, missing value handling) and standardization. Data processing is performed using a hybrid programming language of R and Python. The cleaning process uses the moving median method (window size = 5) for outlier detection, multivariate interpolation (based on the MICE algorithm, iteration count = 10) for missing values, and standardization uses the z-score method.

[0192] Knowledge-driven exposure model parameterization: The system extracts key exposure parameters for specific pollutants from a knowledge graph, including physicochemical properties (volatility, solubility, octanol-water partition coefficient, etc.), environmental behavior parameters (half-life, diffusion coefficient, bioaccumulation factor, etc.), and exposure factors (respiration rate, water consumption, skin contact area, etc.). Parameter extraction employs a combination of knowledge graph embedding and attribute retrieval, supporting parameter uncertainty representation (mean ± standard deviation or 95% confidence interval).

[0193] The exposure assessment methodology strictly adheres to the "Technical Specification for Exposure Parameter Investigation" (HJ 877-2017) and the "Technical Specification for Investigation of Permeability Coefficient of Particulate Matter (PM2.5) in Ambient Air of Civil Buildings" (HJ949-2018), employing standardized exposure parameters and assessment procedures. The system's built-in parameter library is constructed entirely based on the data requirements of the "Basic Dataset for Exposure Parameter Investigation" (HJ968-2019), including demographic characteristics, temporal activity patterns, exposure behaviors, and physiological parameters, ensuring the scientific rigor and comparability of the assessment results.

[0194] Multi-pathway exposure calculation: The system implements a comprehensive calculation model for three major exposure pathways: respiratory exposure, ingestion exposure, and skin contact. Respiratory exposure uses an improved single-chamber model (considering indoor-outdoor exchange rate and sedimentation rate); ingestion exposure considers three media: water, food, and soil / dust; and skin contact uses an improved USEPA DERMAL model (considering absorption kinetics). The calculation process employs the Monte Carlo simulation method (iterations = 10,000) to generate an exposure dose distribution rather than a single value, more accurately representing the uncertainty of exposure.

[0195] Spatiotemporal variability processing: The system supports temporal and spatial variability analysis of exposures, enabling the generation of high-resolution exposure maps based on kriging interpolation and spatiotemporal topological relationships. Spatial analysis resolution reaches 100m×100m, and temporal resolution supports hourly, daily, monthly, quarterly, and annual scales. The system uses GPU-accelerated computation, with the computation time for millions of grid points controlled within 30 seconds.

[0196] Dose-Response Relationship Application Unit: This unit provides scientific dose-response relationship assessments based on toxicological and epidemiological data from a knowledge graph. The specific implementation method is as follows: (a) Dose-Response Model Selection: The system retrieves dose-response relationship data for specific pollutant-health effect pairs from the knowledge graph, supporting multiple model formats, including linear no-threshold models, linear with threshold models, log-linear models, multi-segment linear models, and benchmark-dose models. Model selection is based on toxicological mechanism information and epidemiological evidence type in the knowledge graph: when the knowledge graph indicates that the pollutant has a genotoxic mechanism, the linear no-threshold model is used by default; when there is clear effect threshold evidence in the knowledge graph, a threshold model is used. The system also allows users to manually select the model type based on their professional judgment.

[0197] (b) Parameter Extraction and Validation: The system extracts key parameters of the dose-response relationship from the knowledge graph, including the slope factor (carcinogen), reference dose / reference concentration (non-carcinogen), baseline dose (BMD), and effect estimates (e.g., the relative risk increment corresponding to each 10 μg / m³ increase in concentration). When multiple sets of parameters exist in the knowledge graph, the system selects the optimal parameter set based on the level of evidence and study quality score, and calculates the uncertainty range of the parameters.

[0198] (c) Population-Specific Adjustment: The system performs population-specific adjustments to the dose-response relationship based on population sensitivity information in the knowledge graph. Adjustment factors include age (sensitivity coefficients for children and the elderly), sex (gender-specific effect differences), baseline health status (e.g., increased susceptibility in patients with chronic diseases), and genetic susceptibility (e.g., the influence of specific gene polymorphisms). Adjustment coefficients are extracted from the knowledge graph and derived from systematic reviews and meta-analyses.

[0199] Comprehensive Risk Characterization Unit: This unit integrates exposure assessment and dose-response relationship analysis results to perform comprehensive risk calculation and characterization. The specific implementation method is as follows: (a) Carcinogenic Risk Calculation: For carcinogenic effects, the system calculates the lifetime excess cancer risk (ILCR) using the formula: ILCR = Exposure Dose × Slope Factor × Exposure Duration / Average Lifespan. The system calculates the risk separately for the three routes of inhalation, ingestion, and skin contact, and then sums the results to obtain the total carcinogenic risk. The acceptable risk level is set to 10 by default. -6 Up to 10 -4 Interval.

[0200] (b) Non-carcinogenic risk calculation: For non-carcinogenic effects, the system calculates the hazard quotient (HQ) and hazard index (HI), where HQ = exposure dose / reference dose (or reference concentration). When multiple contaminants are exposed to the same target organ, the target organ hazard index HI = ΣHQ is calculated, and HI > 1 indicates a potential health risk.

[0201] (c) Uncertainty Analysis: The system uses Monte Carlo simulation to comprehensively analyze the uncertainty in the risk calculation process, generate the probability distribution of risk values, report the 5th, 25th, 50th, 75th and 95th percentile values, and identify the parameters that have the greatest impact on risk estimation through sensitivity analysis.

[0202] (d) Visual presentation: The system generates various forms of risk assessment visualization results, including risk heat maps (spatial distribution), risk trend maps (time changes), sensitivity tornado maps, and probability distribution maps, supporting user interaction to explore risk assessment results.

[0203] Risk Management Recommendation Generation Unit: This unit provides scientific risk management recommendations based on risk assessment results and intervention knowledge from the knowledge graph. The specific implementation method is as follows: (a) Risk Level Determination: Based on the risk assessment results and national standard requirements, the system classifies risks into four levels: Negligible Risk (ILCR<10) -6 or HI<0.1), low risk (10 -6 ≤ILCR<10 -5 Or 0.1≤HI<1), medium risk (10-5 ≤ILCR<10 -4 Or 1≤HI<10) and high risk (ILCR≥10) -4 Or HI≥10).

[0204] (b) Intervention Response Retrieval: The system retrieves intervention response knowledge corresponding to risk sources from the knowledge graph, including pollution source control measures, exposure prevention measures, and health protection measures. The search scope covers four categories of measures: engineering controls, administrative management, personal protective equipment, and medical monitoring.

[0205] (c) Recommendation Generation: Based on risk level and available interventions, the system automatically generates tiered risk management recommendations. For high-risk situations, immediate action is recommended, along with a prioritized list of interventions; for medium-risk situations, continuous monitoring is recommended, and preventative recommendations are provided; for low-risk situations, routine protective measures and health tips are offered. The recommendations cite evidence sources from the knowledge graph to enhance their scientific validity and credibility.

[0206] 5.2 Knowledge-Driven Causal Explanation Submodule

[0207] This submodule provides mechanistic explanations for observed environment-health associations, upgrading statistical associations to mechanism-based causal understandings. It deeply integrates data analysis and knowledge reasoning to achieve scientific explanations from "what" to "why" and "how."

[0208] The knowledge-driven causal explanation submodule comprises four core parts: a correlation phenomenon identification and classification unit, a knowledge base causal chain retrieval unit, a multi-level mechanism explanation unit, and a knowledge-data fusion analysis unit.

[0209] Correlation Identification and Classification Unit: This unit enables the automatic identification and classification of correlations between environmental and health statistics. The specific implementation method is as follows: (a) Association Pattern Detection: The system receives user input of environmental factors and health effect names, or directly receives statistical analysis results (such as correlation coefficients, regression coefficients, and relative risk values), and identifies the association phenomena to be explained. The system performs a standardized assessment of the association strength, classifying associations into three categories according to the Cohen effect size standard: weak association (effect size < 0.3), moderate association (0.3 ≤ effect size < 0.5), and strong association (effect size ≥ 0.5).

[0210] (b) Association Type Classification: The system classifies identified associations into five categories: positive linear association, negative linear association, nonlinear association, threshold effect, and interaction effect. The classification is based on statistical descriptions provided by the user or existing association features retrieved by the system from the knowledge graph.

[0211] (c) Confounding Factor Identification: Based on causal relationship knowledge in the knowledge graph, the system automatically identifies confounding factors that may affect the observed associations. The identification method is based on the search for common causal entities that "simultaneously affect exposure factors and health effects" in the knowledge graph. The system generates a list of confounding factors and their levels of evidence in the knowledge graph.

[0212] Knowledge Base Causal Chain Retrieval Unit: This unit retrieves causal chains from the knowledge graph that may explain observed associations. The specific implementation method is as follows: (a) Entity Mapping: The system maps the environmental and health variables involved in the association analysis to the corresponding entities in the knowledge graph, using the same four-level entity linking strategy as in Section 3.2.1 to ensure mapping accuracy.

[0213] (b) Causal Path Retrieval: Based on the mapped entities, the causal path search algorithm described in Section 4.3 is used to retrieve causal paths connecting environmental factors and health effects in the knowledge graph. The system sets the maximum path length to 6 hops and returns a list of causal paths sorted by path credibility.

[0214] (c) Path Relevance Assessment: The system assesses the relevance between the retrieved causal paths and the observed associations. The assessment dimensions include directional consistency (whether the causal direction and the association direction are consistent), strength matching (whether the magnitude of the causal effect matches the strength of the association), and temporal reasonableness (whether the temporal sequence of the causal path is reasonable). The assessment results are used to perform a secondary ranking of the causal paths.

[0215] Multi-level mechanism explanation unit: This unit generates a detailed multi-level mechanism explanation based on the retrieved causal chains. The specific implementation method is as follows: (a) Molecular and cellular level interpretation: The system extracts the mechanisms of action of pollutants at the molecular and cellular levels from the knowledge graph, including mechanisms such as receptor binding, gene expression regulation, signaling pathway activation, oxidative stress, and DNA damage. The system extracts mechanistic knowledge node by node along the causal path, constructing a complete mechanism chain from initial pollutant exposure to cellular effects. Specific experimental evidence and literature sources are cited for each mechanism link.

[0216] (b) Organ System Level Explanation: This system integrates molecular and cellular level mechanistic information to explain how pollutants lead to functional abnormalities at the organ and system levels. The system extracts the "cellular effect → tissue damage → organ dysfunction" pathway from the knowledge graph, covering pathological processes such as inflammatory responses, tissue fibrosis, and functional decline. The explanation includes the identification of affected organs and a temporal description of the pathological process.

[0217] (c) Population-level interpretation: The system integrates epidemiological evidence from the knowledge graph to explain how exposure leads to changes in disease incidence and mortality at the population level. This includes susceptible population characteristics, descriptions of exposure-response relationships, and public health impact assessments. The system pays particular attention to the morphology (linear / non-linear / threshold) of dose-response relationships and inter-population differences.

[0218] (d) Explanatory Text Generation: The system uses a large language model to transform structured, multi-layered mechanism information into coherent natural language explanatory text. The generation process employs a knowledge-constrained text generation method: first, an explanatory framework is constructed (organized hierarchically from molecule to cell to organ to population); then, specific mechanism descriptions are filled in for each level; and finally, text coherence is optimized. Each key assertion in the generated explanatory text is labeled with its evidence source and level, facilitating user verification and traceability.

[0219] Knowledge-Data Fusion Analysis Unit: This unit enables in-depth data analysis guided by knowledge, strengthening the empirical foundation of causal explanations. The specific implementation method is as follows: (a) Knowledge-guided variable selection: Based on the causal and correlation relationships in the knowledge graph, the system guides the selection of variables in statistical analysis. The system extracts a list of covariates and confounding factors related to the target from the knowledge graph and suggests including them in the variable set of the analysis model to avoid omitting important variables or including irrelevant variables.

[0220] (b) Model enhancement with prior constraints: Using the causal direction and effect size range in the knowledge graph as prior constraints enhances the interpretability of the statistical model. For example, when a relationship is explicitly marked as a positive causal relationship in the knowledge graph, this information can be used as a prior constraint for Bayesian regression to improve the stability of the model estimation.

[0221] (c) Knowledge Consistency Verification: The system compares the user-provided data analysis results with the existing knowledge in the knowledge graph and outputs a consistency report. The report includes: findings consistent with existing knowledge, findings inconsistent with existing knowledge and analysis of possible causes, and new findings not yet covered in the knowledge graph. Inconsistent findings are marked as potential knowledge update candidates for processing in subsequent knowledge base update processes.

[0222] 5.3 Intelligent Literature Review and Evidence Synthesis Submodule

[0223] This submodule automatically generates a comprehensive literature review and evidence synthesis based on the user's research questions. It integrates knowledge graph content and evidence evaluation methods, transforming fragmented knowledge into a systematic synthesis of evidence to support evidence-based decision-making.

[0224] The intelligent literature review and evidence synthesis submodule comprises four core parts: research question decomposition and mapping, knowledge aggregation and organization, evidence level assessment, and adaptive literature review generation.

[0225] Research Question Decomposition and Mapping Unit: This unit intelligently decomposes the user-input research question into structured components and maps them to a knowledge graph. The specific implementation method is as follows: (a) Question Parsing: The system uses a large language model in the environmental health domain to semantically parse the research questions input by users, and uses the PECO framework to extract key elements, including Population, Exposure, Comparison, and Outcome. The parsing uses specially designed prompt templates, achieving an accuracy rate of over 90%.

[0226] (b) Knowledge Graph Mapping: The system maps the parsed elements to corresponding entities and relationships in the knowledge graph, establishing a connection between the research question and the knowledge graph. The mapping adopts a four-level entity linking strategy, and the mapping coverage can reach over 90%.

[0227] (c) Retrieval Strategy Generation: Based on the mapping results, the system automatically generates a knowledge graph retrieval strategy, including a core entity list, relation type filtering conditions, and path depth restrictions, providing structured guidance for subsequent knowledge aggregation.

[0228] Knowledge Aggregation and Organization Unit: This unit retrieves relevant knowledge from the knowledge graph based on the decomposed research question and organizes it in a structured manner. The specific implementation method is as follows: (a) Knowledge Retrieval: Based on the retrieval strategy, the system performs multi-hop retrieval in the knowledge graph to obtain all knowledge entries related to the research question. The retrieval scope includes directly related knowledge (1 hop) and indirectly related knowledge (2-4 hops).

[0229] (b) Knowledge Classification: The system classifies and organizes the retrieved knowledge according to topic categories (exposure characteristics, health effects, mechanisms of action, intervention measures, etc.) and evidence types (experimental studies, observational studies, systematic reviews, etc.) to form a hierarchical knowledge structure.

[0230] (c) Knowledge Synthesis: The system synthesizes multiple pieces of knowledge on the same topic, identifying consistent conclusions and points of disagreement. For points of disagreement, the system lists the viewpoints of each party and their levels of evidence, providing users with a balanced presentation of information.

[0231] Evidence Level Assessment Unit: This unit systematically assesses and synthesizes the retrieved knowledge. It reuses the modified GRADE evidence level assessment method described in Section 2.4, assigning each piece of knowledge in the literature review an evidence level rating of I-VI, and comprehensively assessing multiple pieces of evidence related to the same research question to generate an overall evidence strength determination.

[0232] Adaptive Literature Review Generation Unit: This unit generates an adaptive literature review report based on the organization's knowledge and evidence assessment. The specific implementation method is as follows: (a) Review Structure Design: The system automatically designs the review structure based on the research question type and knowledge distribution. The standard structure includes background introduction, description of exposure characteristics, review of health effects, analysis of mechanisms of action, evaluation of intervention measures, evidence synthesis, and conclusion recommendations. The system dynamically adjusts the structure and length based on the quantity and quality of available knowledge in each section.

[0233] (b) Content Generation: The system uses a large language model to transform structured knowledge into coherent natural language text. The generation process employs a knowledge constraint strategy to ensure that each assertion is supported by knowledge entries in the knowledge graph. Each paragraph is accompanied by corresponding evidence sources and level labels.

[0234] (c) Adaptive Formatting: The system automatically adjusts the content and presentation of the review based on user needs. It supports three levels of detail: concise abstract (500-1000 words), standard review (3000-5000 words), and detailed report (8000 words or more). It also supports language style adjustments for different reader groups: professional academic style, policy recommendation style, and popular science style.

[0235] 5.4 Intelligent Question-and-Answer Submodule for Environmental Health

[0236] This submodule provides professional answers to environmental health questions, allowing users to obtain accurate knowledge through natural language interaction. The submodule deeply integrates knowledge graphs and large language model capabilities to achieve intelligent solutions to complex environmental health problems.

[0237] The intelligent question-and-answer submodule for environmental health comprises four core parts: natural language understanding and intent recognition, knowledge retrieval and reasoning, evidence integration and answer generation, and interactive dialogue management.

[0238] Natural Language Understanding and Intent Recognition Unit: This unit enables deep understanding of natural language issues in the environmental health domain. The specific implementation method is as follows: (a) Question Classification: The system uses a BERT-based classification model to classify input questions. Classification categories include fact-based queries (e.g., "What is the national standard limit for PM2.5?"), relational queries (e.g., "What diseases are benzene associated with?"), mechanism explanation queries (e.g., "How does PM2.5 cause cardiovascular disease?"), comparative analysis queries (e.g., "Which is more harmful to the respiratory system, PM2.5 or PM10?"), and advice / consultation queries (e.g., "How to reduce the risk of indoor formaldehyde exposure?"). The classification model is trained using labeled data in the environmental health domain, achieving an accuracy rate of over 90%.

[0239] (b) Entity and Relationship Recognition: The system extracts the environmental health entities and relationships involved in the question text, and uses a combination of BiLSTM-CRF model and domain dictionary to map the recognized entities to the corresponding nodes in the knowledge graph.

[0240] (c) Query Structure: The system transforms natural language questions into knowledge graph query statements. Fact-based queries are transformed into attribute retrieval queries; relation-based queries are transformed into neighbor node queries; mechanism explanation queries are transformed into path search queries; comparative analysis queries are transformed into multi-entity attribute comparison queries; and suggestion / consultation queries are transformed into intervention knowledge retrieval queries.

[0241] Knowledge Retrieval and Reasoning Unit: This unit retrieves information from the knowledge graph based on structured queries and performs necessary reasoning. The specific implementation method is as follows: (a) Direct Retrieval: For fact-based and relational query questions, the system directly executes the query in the knowledge graph to obtain matching results. The search results are sorted by evidence level and timeliness.

[0242] (b) Reasoning Solution: For mechanism explanation and comparative analysis questions, the system invokes the hybrid reasoning strategy of the knowledge reasoning engine module. Mechanism explanation questions utilize causal path search and multi-level mechanism explanation functions; comparative analysis questions extract the attributes and related knowledge of the two entities from the knowledge graph and perform structured comparative analysis.

[0243] (c) Knowledge Supplementation: When the information in the knowledge graph is insufficient to fully answer the question, the system calls the large language model to supplement the answer and clearly marks the source type of the information: content marked as "from knowledge graph" has high reliability (after quality assessment and evidence classification), while content marked as "from model generation" requires further verification by the user.

[0244] Evidence Integration and Response Generation Unit: This unit generates accurate and comprehensive responses based on the retrieval and reasoning results. The specific implementation method is as follows: (a) Answer organization: The system organizes the answer structure according to the question type. For fact-finding questions, a concise and direct answer format is adopted; for mechanism-explanation questions, a hierarchical explanation format (by molecule → cell → organ → population levels) is used; for comparative analysis questions, a comparative table format is used; for advice-seeking questions, a hierarchical advice list format is used.

[0245] (b) Evidence annotation: The system annotates the source and level of evidence for each key assertion in the answer, including the knowledge item number supporting the assertion, the evidence level (I - VI levels), and the source literature information, enabling users to trace the knowledge source and evaluate the reliability.

[0246] (c) Uncertainty expression: Based on the uncertainty analysis results in the reasoning process, the system uses appropriate certainty expressions in the answer. Conclusions with Bel > 0.8 in the belief interval [Bel, Pl] are expressed using statements such as "clearly", "sufficient evidence shows"; for 0.5 < Bel ≤ 0.8, statements like "very likely", "most studies support" are used; for Bel ≤ 0.5, cautious statements such as "possibly", "some studies suggest", "uncertain" are used.

[0247] Interactive dialogue management unit: This unit maintains the multi-round dialogue state and supports continuous and natural knowledge exchange. The specific implementation methods are as follows: (a) Dialogue state tracking: The system maintains the dialogue history and context information, supporting anaphora resolution (e.g., "it" in "How toxic is it" refers to the pollutant mentioned above) and ellipsis completion (e.g., "What about PM10" is automatically completed to a full question). The dialogue state is represented using structured slots, recording the entities, relationships, and topics under discussion.

[0248] (b) Probing guidance: When the user's question is too broad or ambiguous, the system narrows the scope through probing. The probing strategy is based on the concept hierarchy in the knowledge graph, gradually guiding from the upper-level concept to the specific query target. For example, when the user asks "The impact of air pollution on health", the system probes for the specific pollutant type and health effect type of concern.

[0249] (c) Association recommendation: Based on the current dialogue content, the system retrieves relevant associated knowledge points from the knowledge graph and actively recommends subsequent questions or supplementary information that the user may be interested in, supporting in-depth exploration of knowledge.

[0250] 5.5 Knowledge-driven decision support sub-module

[0251] This sub-module provides knowledge-based advice and evaluation for environmental health management decisions. This sub-module includes four core parts: a decision problem construction unit, an intervention plan generation unit, a multi-criteria decision evaluation unit, and a scenario analysis and adaptive management unit.

[0252] Decision Problem Construction Unit: This unit transforms environmental health decision problems into a structured representation, laying the foundation for solution generation and evaluation. The specific implementation method is as follows: (a) Construction of the Decision-Making Objective System: The system employs the Analytic Hierarchy Process (AHP) to construct the decision-making objective system. The objective decomposition supports a three-level structure: the top level represents the overall objective (e.g., "reducing PM2.5-related health risks"); the middle level comprises sub-objectives (e.g., "reducing emission sources," "reducing population exposure," and "strengthening health protection"); and the bottom level represents specific objectives (e.g., "controlling industrial emissions," "improving indoor air quality," and "identifying high-risk groups"). Two methods are used to determine objective weights: expert scoring (determining relative importance through expert scoring) and the Analytic Hierarchy Process (constructing a judgment matrix and calculating eigenvectors). The system performs a consistency check to ensure that the consistency ratio (CR) of the judgment matrix is ​​<0.1. Weight sensitivity analysis assesses the impact of weight changes on the decision-making outcome using a perturbation method (weight changes ±20%).

[0253] (b) Decision Constraint Identification: The system automatically identifies and formalizes decision constraints. Constraints are divided into hard constraints (conditions that must be met) and soft constraints (preferences to be met as much as possible). Constraint types include technical constraints (such as technical feasibility, equipment availability), economic constraints (such as budget limits, cost caps), time constraints (such as implementation deadlines, response times), and social constraints (such as public acceptance, regulatory compliance). Constraints are represented using a parametric model, with each constraint defined as a triple (attribute, operator, threshold), such as (implementation cost ≤ 1 million RMB). The system supports constraint priority settings (levels 1-5), with higher priority constraints having greater weight in the solution selection process.

[0254] (c) Parameterization of Decision-Making Scenarios: The system implements parameterized representations of environmental health decision-making scenarios. Scenario parameters fall into three main categories: environmental characteristic parameters (e.g., pollutant type, concentration level, spatial distribution), demographic characteristic parameters (e.g., population density, age structure, health status), and socioeconomic parameters (e.g., economic development level, medical resources, policy environment). Parameter values ​​are acquired using multiple data sources: on-site monitoring data, statistical yearbooks, population censuses, expert estimates, etc. For parameter uncertainty, the system supports interval representation, probability distribution representation, and fuzzy set representation to capture the uncertainty of the real world. The system maintains a scenario template library containing more than ten typical environmental health decision-making scenarios; new scenarios can be quickly constructed based on these templates.

[0255] (d) Knowledge Mapping and Association: The system maps decision problem elements to relevant nodes in the knowledge graph, establishing decision-knowledge connections. The mapping process includes three steps: entity matching (mapping decision entities to knowledge graph entities, such as mapping "PM2.5" to the corresponding pollutant node); relationship inference (inferring relevant causal relationships and impact paths based on goals and constraints); and knowledge extraction (retrieving background knowledge, mechanism understanding, and experience cases related to the decision from the knowledge graph). The mapping algorithm employs a hybrid strategy, combining exact matching and semantic similarity calculation, achieving a mapping accuracy of over 90%. Based on the mapping results, the system constructs a decision knowledge subgraph, providing a knowledge foundation for subsequent solution generation.

[0256] Intervention Plan Generation Unit: This unit generates a set of feasible intervention plans based on the decision problem and knowledge base. The specific implementation method is as follows: (a) Knowledge Base Intervention Measure Retrieval: The system retrieves intervention measure knowledge related to the decision-making objectives from the knowledge graph. The retrieval dimensions include target pollutant / health effect, intervention type (source control, process interruption, end-of-pipe protection), applicable scenarios, and implementation conditions. Based on the intervention effect evidence in the knowledge graph, the system performs preliminary screening of measures, excluding measures with insufficient evidence or that are inapplicable.

[0257] (b) Solution Combination Generation: The system employs rule-based and heuristic search methods to combine individual interventions into complete intervention plans. Combination rules consider synergistic effects between measures (e.g., the combined effect of source control and exposure blocking) and conflict constraints (e.g., mutually exclusive measures due to resource competition). The system typically generates a set of 5-15 candidate plans, covering different intervention intensities and resource input levels.

[0258] (c) Pre-screening of feasibility: The system pre-screens candidate solutions based on decision constraints, excluding solutions that violate hard constraints and scoring the degree of violation of soft constraints. The pre-screened solution set then proceeds to the multi-criteria evaluation stage.

[0259] Multi-criteria decision evaluation unit: This unit performs multi-dimensional evaluation of candidate solutions, supporting solution comparison and selection. The specific implementation method is as follows: (a) Assessment Framework Construction: The system implements a multi-dimensional assessment framework for environmental health decision-making, comprising four main assessment dimensions: health benefits (weight w1=0.35), technical feasibility (weight w2=0.25), economic rationality (weight w3=0.25), and social impact (weight w4=0.15). Each dimension is further subdivided into several specific indicators: health benefits include health risk reduction rate, population coverage, response timeliness, and long-term protection effect; technical feasibility includes technology maturity, implementation complexity, technology reliability, and management difficulty; economic rationality includes initial investment, operating costs, indirect benefits, and cost-benefit ratio; and social impact includes public acceptance, social equity, environmental friendliness, and sustainable development. The system supports customization of the assessment framework, allowing users to adjust indicators and weights according to the characteristics of their decision-making.

[0260] (b) Indicator Quantification and Scoring: The system, based on knowledge graphs and decision-making models, quantifies and evaluates the performance of each option on each indicator. Health benefit assessment is based on the exposure-effect relationship in the knowledge graph, calculating the reduction in health risk caused by the intervention; technical feasibility assessment uses the Technology Maturity Scale (TRL 1-9) and the Implementation Complexity Scale (levels 1-5); economic rationality assessment uses life-cycle cost analysis and cost-benefit ratio calculation; social impact assessment combines expert scoring and case analysis. The quantification process considers data uncertainty, generating point estimates and confidence intervals. The system employs standardization processing, converting indicators with different dimensions into a uniform scale (0-100). The choice of standardization method depends on the characteristics of the indicator, including the maximum value method, range method, and Z-score method.

[0261] (c) Comprehensive Evaluation and Ranking: The system employs multiple multi-criteria decision analysis (MCDA) methods to comprehensively evaluate candidate solutions. The comprehensive score calculation formula is as follows:

[0262] in The weight of the i-th indicator and satisfying , To standardize scores, the system generates a ranking of alternatives and calculates dominance metrics (such as advantage rate) and relative distances between them. The system supports comparisons of results from multiple methods; when different methods yield consistent rankings, the credibility of the results is enhanced; when discrepancies arise, the system analyzes the reasons for the differences and provides decision-making references. The system also supports subgroup analysis to assess the differentiated impact of the alternatives on different population groups (such as children, the elderly, and vulnerable groups).

[0263] (d) Sensitivity and Robustness Analysis: The system implements comprehensive sensitivity and robustness analysis functions. Sensitivity analysis assesses the impact of changes in input parameters on the evaluation results, including three types of analysis: weight sensitivity analysis (changing indicator weights ±30%), indicator sensitivity analysis (changing indicator scores ±20%), and parameter sensitivity analysis (changing model parameters ±25%). Robustness analysis evaluates the stability of the scheme under different scenarios, using Monte Carlo simulation to randomly sample key parameters (N=10,000 times) and generate result distributions. The system calculates the robustness indicators of the scheme, including robustness probability (the probability that the scheme will remain in the top three) and maximum regret value (performance loss under the worst-case scenario). The analysis results are presented in intuitive charts, including tornado diagrams, butterfly diagrams, and probability distribution diagrams.

[0264] Scenario Analysis and Adaptive Management Unit: This unit considers the uncertainty of the decision-making environment and designs robust implementation strategies. The specific implementation method is as follows: (a) Scenario Design: Based on environmental change trends and policy development information in the knowledge graph, the system designs multiple future scenarios. Scenario dimensions include pollutant emission trends (increase, no change, decrease), population change trends (aging population, urbanization level), climate change impacts (temperature rise, frequency of extreme weather), and policy environment changes (stricter standards, increased investment). The system generates scenario sets by combining different states of key dimensions, typically containing 4-8 representative scenarios.

[0265] (b) Scenario Impact Assessment: The system evaluates the performance of each candidate solution under different scenarios, using a multi-criteria evaluation framework to score each solution under each scenario. The evaluation results are presented in the form of a scenario-solution matrix to help decision-makers identify solutions that perform stably in most scenarios.

[0266] (c) Adaptive Management Strategy Design: Based on the scenario analysis results, the system designs adaptive management strategies. The strategies include: trigger condition definition (when a specific environmental or health indicator reaches a preset threshold, the strategy is adjusted), adjustment plan contingency plan (pre-designed adjustment paths for different trigger conditions), and monitoring and evaluation plan (defining key monitoring indicators and evaluation cycles to support continuous optimization of the plan).

[0267] 6. Knowledge Base Dynamic Update and Maintenance Module

[0268] This module ensures the continuous updating, accuracy, and consistency of the environmental health knowledge graph. For example... Figure 6 As shown, this module comprises four core parts: knowledge change monitoring and acquisition submodule, knowledge consistency verification submodule, expert collaborative knowledge verification submodule, and knowledge graph incremental update submodule.

[0269] 6.1 Knowledge Change Monitoring and Acquisition Submodule

[0270] This submodule continuously monitors knowledge updates in the field of environmental health, promptly capturing new knowledge and integrating it into the knowledge base. This submodule comprises four parts: an automatic scientific literature scanning unit, a knowledge change detection unit, an external knowledge base integration unit, and an expert knowledge acquisition unit. The system features a specially designed standards and specifications monitoring unit to specifically monitor the updates to environmental health standards. Standard monitoring covers environmental health-related standards issued by departments such as the Ministry of Ecology and Environment, the National Health Commission, and the Standardization Administration of China, acquiring standard update information in real time through regular scanning and API interfaces. When a new standard is detected or an existing standard is updated, the system automatically triggers a standard parsing process, converting the standard text into structured knowledge and comparing it with the existing knowledge base to identify the changes. The system pays particular attention to the impact of standard changes on the knowledge base, such as limit adjustments, method changes, or terminology updates, automatically assessing the scope of the changes and generating change reports.

[0271] Automatic Scientific Literature Scanning Unit: This unit enables continuous monitoring and acquisition of environmental health literature. The specific implementation method is as follows: (a) Scheduled scanning mechanism: The system automatically scans 25 academic databases, including PubMed and Web of Science, at a preset cycle (weekly by default), using predefined search strategies to retrieve newly published environmental health literature. The scanning adopts an incremental mode, retrieving only new literature published after the last scan time each time, reducing redundant processing.

[0272] (b) Keyword and Topic Tracking: The system maintains a keyword database and topic list in the field of environmental health, supporting multi-dimensional tracking of literature updates by keyword, topic, author, institution, etc. Based on hot entities and relationships in the knowledge graph, the system automatically identifies high-interest areas and prioritizes scanning them.

[0273] (c) Document Relevance Assessment: The system uses a BERT-based document relevance classifier to score the relevance of newly acquired documents, and only documents with high relevance (score > 0.7) are sent to the subsequent knowledge extraction process. Documents with low relevance are archived and stored for manual retrieval.

[0274] Knowledge Change Detection Unit: This unit extracts knowledge from new documents and detects differences from the existing knowledge base. The specific implementation method is as follows: (a) Incremental Knowledge Extraction: The system uses a knowledge mining module to automatically extract knowledge from new documents, generating a new set of knowledge triples. The extraction process adopts the same two-stage strategy and quality assessment process as the initial construction.

[0275] (b) Difference Detection: The system automatically compares newly extracted knowledge with the existing knowledge graph, detecting three types of changes: newly added knowledge (new entities or relationships not existing in the existing graph), updated knowledge (corrections or supplements to existing knowledge attribute values, such as new dose-response coefficients), and conflicting knowledge (information that contradicts existing knowledge). Difference detection employs entity alignment and relationship matching algorithms, achieving an accuracy rate of over 90%.

[0276] (c) Change Priority Ranking: The system prioritizes detected changes based on evidence level (higher-level evidence takes priority), scope of impact (changes involving safety thresholds take priority) and timeliness (changes related to emergency public health take priority).

[0277] External Knowledge Base Integration Unit: This unit enables dynamic integration with external professional knowledge bases. The system regularly synchronizes with more than ten external knowledge bases, including PubChem, UMLS, EPA CompTox, and DisGeNET, to obtain updated chemical substance information, medical terminology, and toxicology data. Integration employs an incremental synchronization strategy, performing a full synchronization quarterly and a differential synchronization weekly. The synchronization process uses the same entity alignment and relation mapping methods as in Section 3.2.4 to ensure consistency between external knowledge and the system's knowledge.

[0278] Expert Knowledge Acquisition Unit: This unit supports direct contributions of professional knowledge from environmental health experts. The system provides a structured knowledge entry interface, allowing experts to directly add new knowledge entries, modify existing knowledge attributes, or mark knowledge quality. Knowledge entered by experts is automatically incorporated into the quality assessment process and labeled as an "Expert Contribution" source. The system records the contribution history of each expert, supporting knowledge traceability and accountability.

[0279] 6.2 Knowledge Consistency Verification Submodule

[0280] This submodule verifies the consistency between new knowledge and the existing knowledge base, handles potential conflicts, and ensures the logical coherence of the knowledge base. This submodule consists of four parts: a logical consistency check unit, a statistical consistency check unit, a knowledge conflict detection unit, and a conflict resolution strategy unit.

[0281] Logical Consistency Check Unit: This unit verifies the logical consistency of the knowledge graph, ensuring it conforms to predefined domain rules and constraints. The specific implementation method is as follows: (a) Type Constraint Check: Verify whether the category labels of entities and the domain constraints of relations conform to the ontology model definition. For example, check whether the subject of the "increase risk" relation is a pollutant entity and the object is a health effect entity. Knowledge entries that violate type constraints are marked as "type error" and need to be corrected before being added to the database.

[0282] (b) Logical Contradiction Detection: Detects logical contradictions in the knowledge graph, including direct contradictions (such as the simultaneous existence of "A causes B" and "A does not cause B" relationships for the same entity pair) and indirect contradictions (such as contradictory conclusions derived through transitive reasoning). Detection employs a combination of graph traversal and rule matching, with a rule base covering hundreds of consistency rules.

[0283] (c) Transitivity Consistency Check: Verifies whether the transitivity derivation of knowledge introduces contradictions. For example, if the knowledge graph contains "A promotes B", "B promotes C", and "A inhibits C", the system detects transitivity inconsistencies and marks the relevant knowledge entries. The checking algorithm traverses all paths with a length ≤ 4 in the knowledge graph and calculates the consistency of path effects.

[0284] Statistical Consistency Check Unit: This unit assesses the consistency of knowledge based on statistical methods, identifying outliers and pattern biases. The specific implementation method is as follows: (a) Numerical Anomaly Detection: Statistical analysis is performed on numerical attributes in the knowledge graph to identify outliers. For quantitative parameters (such as relative risk values) of the same type of relationship, the system calculates the mean and standard deviation of the parameter distribution and marks values ​​that deviate from the mean by more than 3 standard deviations as outliers. Outliers are not automatically deleted but are submitted for manual review after being marked.

[0285] (b) Pattern Consistency Assessment: Assess whether the newly added knowledge is consistent with the statistical patterns of existing knowledge. For example, when the relative risk value of a newly reported pollutant-health effect differs from the average value of multiple existing studies by more than a preset threshold, the system generates a "pattern bias" warning, indicating that there may be differences in research methods or new scientific findings.

[0286] Knowledge Conflict Detection Unit: This unit automatically identifies conflicts and inconsistencies in the knowledge graph. The system identifies four types of conflicts: direct contradictions (e.g., "A causes B" vs. "A does not cause B"), numerical inconsistencies (e.g., dose-response coefficients reported from different sources differ by more than 50%), conditional conflicts (surface conflicts caused by different applicable conditions, such as differences in effects among different populations or at different exposure concentrations), and temporal evolution conflicts (conflicts caused by knowledge updates over time, such as old research conclusions being overturned by new research). The detection employs a knowledge graph traversal algorithm combined with semantic comparison methods, achieving an accuracy rate of over 90%.

[0287] Conflict Resolution Strategy Unit: This unit implements automated and semi-automated resolution of knowledge conflicts. The specific implementation method is as follows: (a) Automatic resolution strategy: For conflicts that can be resolved automatically, the system applies the following rules: priority selection based on the level of evidence (higher level evidence takes precedence, such as Level I evidence taking precedence over Level VI evidence); selection based on timeliness (when the research time is more recent and the level of evidence is comparable, the more recent research conclusion is adopted); selection based on consistency (conclusions supported by the majority of studies take precedence over conclusions supported by the minority of studies).

[0288] (b) Conditional conflict resolution: For conditional conflicts, the system analyzes the differences in the applicable conditions of conflicting knowledge and transforms conflicting relationships into conditional relationships. For example, "PM2.5 increases the risk of cardiovascular disease (in the elderly population)" and "PM2.5 is not significantly associated with the risk of cardiovascular disease (in the young population)" are not considered contradictions, but are stored as conditional knowledge in the knowledge graph.

[0289] (c) Human Adjudication: For complex conflicts that cannot be resolved by automated strategies, the system packages the conflict information and submits it to experts for review, providing a conflict analysis report (including conflict type, relevant knowledge items, a summary of evidence from each party, and system recommendations). The experts then make the final adjudication. The adjudication results are recorded as conflict resolution cases to optimize automated resolution strategies.

[0290] Through the collaborative work of these four core units, the knowledge consistency verification submodule achieves comprehensive consistency management of the environmental health knowledge graph, ensuring the logical coherence, statistical consistency, and version coordination of knowledge. The system can automatically detect various types of knowledge conflicts and provide scientific and systematic solutions, significantly improving the quality and reliability of the knowledge base. These functions are particularly important in the environmental health field because knowledge in this area is rapidly updated and diverse in perspectives, requiring rigorous consistency management to ensure the scientific validity and reliability of knowledge application.

[0291] The knowledge base update mechanism pays special attention to the dynamic changes in national environmental health standards. A dedicated standard monitoring mechanism is established, and when national standards such as the "Technical Guidelines for Ecological and Environmental Health Risk Assessment" are updated, the system automatically triggers consistency checks and necessary updates to the relevant parts of the knowledge base, ensuring that the knowledge base is always synchronized with the latest national standards. The system saves historical versions and change records of standards, supporting queries of historical standards by time point, providing a foundation for longitudinal analysis of standard evolution. A "smooth transition" strategy is adopted for standard update processing. Before a new standard officially comes into effect, the system maintains both old and new standard versions simultaneously, clearly marking their respective applicable periods to avoid discontinuity issues caused by standard switching. The system also implements impact propagation analysis of standard changes, assessing the impact of standard updates on knowledge items and application functions that rely on the standard, ensuring the overall consistency and effectiveness of the system.

[0292] 6.3 Expert Collaboration Knowledge Verification Submodule

[0293] This submodule builds an environmental health expert network to support knowledge auditing and verification. It comprises four core parts: an expert network management unit, an intelligent task allocation unit, a collaborative audit workflow unit, and an audit quality control unit.

[0294] Expert Network Management Unit: This unit builds and maintains a collaborative network of environmental health experts. The system maintains an expert database, recording each expert's basic information, research field (e.g., air pollution and health, water pollution toxicology, environmental epidemiology), area of ​​expertise (e.g., research experience with specific pollutants or health effects), academic qualifications (professional title, h-index, representative achievements), and review history (number of reviewed knowledge items, review quality score). Experts are grouped by field to ensure sufficient review capacity in each major environmental health area. The system supports two methods of joining the network: expert self-registration and administrator invitation.

[0295] Intelligent Task Allocation Unit: This unit intelligently allocates knowledge verification tasks. The specific implementation method is as follows: Based on the subject area and complexity of the knowledge items to be reviewed, the system automatically matches the most suitable review experts. The allocation algorithm comprehensively considers three factors: domain matching degree (the degree of matching between the expert's research field and the knowledge item's topic, weight 0.5), workload balancing (prioritizing allocation to experts with fewer current tasks, weight 0.3), and review quality history (prioritizing allocation to experts with higher review quality scores, weight 0.2). For critical knowledge items involving security thresholds or high influence, the system assigns two or more experts for independent review.

[0296] Collaborative Review Workflow Unit: This unit implements a structured and efficient knowledge review process. The review process includes three stages: initial review (a single expert reviews the accuracy, completeness, and level of evidence of the knowledge item, providing a "pass," "pass after revision," or "fail" review opinion); secondary review (items with differing opinions from the initial review or involving key knowledge are independently reviewed by a second expert); and final review (items with remaining disagreements from the secondary review are submitted to senior experts or an expert committee for adjudication). Each stage has a time limit (7 days for initial review, 5 days for secondary review, and 10 days for final review), and tasks not completed within the time limit are automatically reassigned. The system provides an online review interface that displays complete contextual information about the knowledge item (including original literature, extraction process, and quality assessment results), supporting experts in completing reviews efficiently.

[0297] Audit Quality Control Unit: This unit ensures the quality and consistency of the knowledge audit process. The specific implementation method is as follows: Audit consistency assessment uses Cohen's Kappa coefficient to measure the consistency between two audit experts on the same knowledge item. A Kappa value below 0.6 triggers a re-audit. The system regularly (quarterly) organizes audit calibration activities, selecting standard cases for all audit experts to conduct independent reviews, statistically analyzing the consistency of review results, discussing points of disagreement, and unifying audit standards. The system records all audit comments and results, generating audit quality reports for continuous improvement of the audit process.

[0298] 6.4 Knowledge Graph Incremental Update Submodule

[0299] This submodule enables efficient incremental updates of the knowledge graph, maintaining the timeliness of knowledge. It comprises four core parts: a change operation definition unit, an incremental update algorithm unit, a version control and rollback unit, and an update triggering and scheduling unit.

[0300] Change Operation Definition Unit: This unit defines the standardized change operations for the knowledge graph. The system defines six basic change operations: Node Creation (adding a new entity, requiring the entity category and attribute values ​​to be specified), Node Update (modifying the attribute values ​​of an existing entity, requiring the target entity and the content to be modified to be specified), Node Deletion (removing an entity and all its associated relationships, requiring a check for cascading effects), Edge Creation (adding a new relationship, requiring the head and tail entities and relationship type and attributes to be specified), Edge Update (modifying the attribute values ​​of an existing relationship, such as updating the evidence level or numerical parameters), and Edge Deletion (removing a relationship, requiring a check for whether it affects the integrity of the reasoning path). Each operation includes three stages: precondition checking (verifying the legality of the operation, such as entity category constraints and relationship domain value range constraints), execution logic (actually executing the graph database operation), and post-verification (verifying the correctness and consistency of the operation results).

[0301] Incremental Update Algorithm Unit: This unit implements efficient incremental updates of the knowledge graph. The specific implementation method is as follows: (a) Transaction mechanism: All change operations are encapsulated in database transactions to ensure the atomicity of updates (operations either all succeed or all are rolled back), consistency (the knowledge graph before and after the update satisfies all constraints), isolation (concurrent update operations do not interfere with each other) and durability (successfully committed updates will not be lost).

[0302] (b) Batch Update Optimization: For large-scale knowledge updates (such as batch additions after periodic document scanning), the system uses an optimized graph traversal algorithm to calculate the optimal execution order of change operations, reducing the overhead of consistency checks in intermediate states. Batch updates support parallel execution, using a fragmented locking mechanism to avoid update conflicts. Operations within the same fragment are executed serially, while operations between different fragments are executed in parallel.

[0303] (c) Impact Propagation Analysis: Before executing a change operation, the system analyzes the impact of the change on other parts of the knowledge graph. For example, deleting an intermediate mechanism node may cause a break in the causal path; the system will identify the affected path in advance and prompt the user for confirmation. Impact analysis uses a limited range of graph traversal, analyzing the impact within a 3-hop range by default.

[0304] Version Control and Rollback Unit: This unit manages the version history of the knowledge graph, supporting state tracking and rollback. The specific implementation method is as follows: (a) Semantic Version Number: The system uses a semantic version number in the format Major.Minor.Patch to identify the knowledge graph version. Major version changes correspond to incompatible structural changes in the ontology model (such as adding new core entity categories); Minor version changes correspond to significant updates to knowledge content (such as adding knowledge in batches or updating important standards); Patch version changes correspond to minor corrections (such as modifying the attributes of individual knowledge items or fixing errors).

[0305] (b) Version Snapshots: Each update automatically creates a version snapshot, recording the changes (a list of changes), the reason for the change (e.g., "Document scanning update in the third week of 2024" or "HJ 1111-2020 standard update"), and the time of the change. Snapshots use an incremental storage strategy, recording only the differences from the previous version to reduce storage overhead.

[0306] (c) Rollback Operation: The system supports viewing and rolling back any version. Viewing reconstructs the knowledge graph state of historical versions by reversing the application of difference snapshots. Rollback restores the knowledge graph to a specified version by reversing the execution of change logs. Before rolling back, the system automatically assesses the impact and indicates potentially affected application functions and inference results.

[0307] Update Triggering and Scheduling Unit: This unit manages the triggering mechanism and scheduling process for knowledge graph updates. The specific implementation method is as follows: (a) Triggering modes: The system supports three triggering modes: scheduled triggering (automatically executes the literature scanning and knowledge update process every week), event triggering (the update process is started immediately when important knowledge changes are detected (such as updates to national standards or publication of major research findings), and manual triggering (administrators manually start the update process as needed).

[0308] (b) Scheduling Management: The system maintains and updates a task queue, processing tasks according to priority. Priority, from highest to lowest, is as follows: safety-related updates (e.g., changes in pollutant toxicity assessment results) > standard updates (e.g., adjustments to national standard limits) > routine knowledge updates (e.g., adding new literature knowledge to the database) > optimization updates (e.g., supplementing knowledge attributes and improving quality). The system monitors the execution status of the update process, automatically retrying failed tasks (up to 3 times), and submitting persistently failing tasks to the administrator for processing.

[0309] (c) Update Report Generation: After each update, the system automatically generates an update report, including statistics for this update (number of new entities, number of new relationships, number of modified entries, number of deleted entries), a summary of important changes (addition or modification of high-level evidence knowledge), conflict resolution records, and the planned time for the next update. The report is sent to the administrator and relevant experts via email.

[0310] Through the collaborative work of the aforementioned core units, the incremental update submodule of the knowledge graph achieves continuous integration and version management of environmental health knowledge, enabling the knowledge graph to be dynamically updated as research progresses. Integrating knowledge extraction, quality assessment, knowledge graph construction, and reasoning processes, this invention forms a quality-evidence-weight synergy mechanism that runs through all processing stages.

[0311] like Figure 7 As shown, the quality evaluation results and evidence level determination results formed in the knowledge extraction stage are mapped to weight parameters in the knowledge graph and embedded in the graph structure.

[0312] During the knowledge graph construction phase, the weight parameters are used as relation confidence scores to participate in the graph structure organization; during the reasoning phase, different reasoning paths are weighted according to the weight parameters; during the conflict resolution phase, the weights of relevant relations are adjusted based on the differences in evidence levels and the comprehensive scoring results; at the same time, the confidence information of the reasoning results can be used to update the weight parameters.

[0313] Through the aforementioned collaborative mechanism, knowledge quality, evidence strength, and reasoning computation are coordinated to support risk assessment and decision analysis in the field of environmental health.

[0314] 7. Application Example: Urban PM2.5 Health Risk Assessment

[0315] To verify the technical effectiveness and practical application value of the system of the present invention, implementation verification was carried out using urban PM2.5 health risk assessment as a typical application scenario.

[0316] 7.1 Knowledge Mining Stage

[0317] The system automatically retrieves over 10,000 articles on the health impacts of PM2.5 from PubMed and Web of Science databases between 2010 and 2024 through its intelligent literature acquisition and preprocessing submodule. After literature relevance assessment and screening, nearly 10,000 highly relevant articles are retained. The system uses a large language model adaptively pre-trained in the environmental health domain and fine-tuned with knowledge extraction instructions, employing a two-stage extraction strategy to extract knowledge triples from the literature. The first phase used specialized prompt templates targeting five core knowledge categories for targeted extraction: a pollutant characteristics extraction template extracted PM2.5 composition (such as organic carbon, elemental carbon, sulfates, nitrates, etc.) and physicochemical parameters; an exposure-effect relationship extraction template extracted the dose-response coefficients between PM2.5 concentration and various health endpoints; a biomarker relationship extraction template extracted information on PM2.5 exposure-related inflammatory markers (such as C-reactive protein and interleukin-6) and oxidative stress markers (such as 8-OHdG); a mechanism-pathway relationship extraction template extracted mechanisms of action such as oxidative stress, inflammatory response, and autonomic nervous system regulation; and an intervention effect relationship extraction template extracted effect data on interventions such as air purifier use, mask wearing, and indoor ventilation. The second phase used open-ended prompts to supplement the extraction of knowledge not covered in the first phase. The system extracted over 20,000 knowledge triplets in total.

[0318] After multi-dimensional quality assessment, the system calculates the comprehensive quality score for each triplet, with the default weights set to... =0.3、 =0.2、 =0.15、 =0.15、 =0.2. After screening, nearly 20,000 high-quality knowledge triples were retained (retention rate of approximately 80%), of which approximately 12% were high-quality evidence at levels I-II, approximately 55% were medium-quality evidence at levels III-IV, and approximately 33% were preliminary evidence at levels V-VI. Compared with the results of manual extraction, the system's extraction precision rate can reach over 90%, recall rate can reach over 85%, and F1 score can reach over 0.85.

[0319] 7.2 Knowledge Graph Construction Stage

[0320] The system, based on an ontology model with multiple core entities and various standard relation types, constructs a PM2.5 health impact knowledge subgraph from extracted knowledge triples. A four-level entity linking strategy connects entities to standard ontologies: exact matching covers approximately 45%, fuzzy matching approximately 25%, semantic matching approximately 20%, and contextual matching approximately 8%, achieving a total linking success rate of over 95%. The system synchronously integrates PM2.5-related standard knowledge from the "Technical Guidelines for Ecological and Environmental Health Risk Assessment" (HJ 1111-2020), including exposure assessment method requirements and risk characterization methods, and adopts an authoritative priority principle to handle discrepancies between standard knowledge and literature knowledge. The completed PM2.5 knowledge subgraph contains thousands of entity nodes and tens of thousands of relation edges, covering dozens of PM2.5 components, dozens of health effects, dozens of mechanisms of action, and dozens of intervention measures.

[0321] 7.3 Knowledge Reasoning Stage

[0322] A user queries "How does PM2.5 cause cardiovascular disease?". After analyzing the query characteristics, the strategy fusion unit determines the query to be "mechanism-explanatory" and dynamically selects a combination strategy of rule-based reasoning (weight 0.4) and statistical reasoning (weight 0.6). The causal path search algorithm uses an improved bidirectional breadth-first search to search for causal paths in the PM2.5 knowledge subgraph. After path length constraints (≤5 hops), causal semantic relationship type filtering, and intermediate node biological rationality checks, the top 5 ranked causal paths are returned. Pathway 1: PM2.5 → Oxidative stress → Endothelial damage → Atherosclerosis → Cardiovascular disease (Pathway reliability approximately 0.89) Pathway 2: PM2.5 → Systemic inflammation → Elevated C-reactive protein → Coronary heart disease (Pathway reliability approximately 0.85) Pathway 3: PM2.5 → Autonomic dysfunction → Decreased heart rate variability → Arrhythmia (Pathway reliability approximately 0.82) Pathway 4: PM2.5 → Coagulation dysfunction → Thrombosis → Myocardial infarction (Pathway reliability approximately 0.78) Pathway 5: PM2.5 → Lipid metabolism disorder → Hyperlipidemia → Atherosclerosis (Pathway reliability approximately 0.73) Path confidence is calculated as the geometric mean of the confidence scores of all edges.

[0323] The causal relationship between PM2.5 and cardiovascular disease was quantitatively assessed based on the Bradford Hill criteria. The scores and weights of the nine dimensions are as follows: strength 4.2 / 5 (weight 0.15), consistency 4.5 / 5 (weight 0.15), specificity 3.0 / 5 (weight 0.08), temporality 4.8 / 5 (weight 0.12), dose-response 4.3 / 5 (weight 0.15), reasonableness 4.5 / 5 (weight 0.12), coherence 4.0 / 5 (weight 0.08), experimental evidence 3.8 / 5 (weight 0.10), and analogy 3.5 / 5 (weight 0.05). The weighted total score is approximately 4.1 / 5, and the level of evidence is rated as "strong".

[0324] The system integrates evidence from multiple sources based on Dempster-Shafer theory to calculate the combined effect estimate of the increase in cardiovascular disease mortality caused by each 10 μg / m³ increase in PM2.5 exposure. The belief interval [Bel≈0.82, Pl≈0.95] indicates that the causal relationship has a high degree of confidence.

[0325] 7.4 Knowledge Application Stage

[0326] Based on the above knowledge reasoning results, the system generates a PM2.5 health risk assessment report for a certain city. The exposure assessment unit extracts the physicochemical properties and exposure factors of PM2.5 from the knowledge graph (preferably using recommended values ​​from the "Basic Dataset for Exposure Parameter Survey"), and combines this with the city's PM2.5 monitoring data (annual average concentration 45 μg / m³), using Monte Carlo simulation (tens of thousands of iterations) to calculate the population exposure dose distribution. The dose-response relationship application unit extracts the concentration-response coefficient of PM2.5-cardiovascular disease from the knowledge graph. The comprehensive risk characterization unit calculates the excess risk of PM2.5-related cardiovascular disease in the city.

[0327] The knowledge-driven decision support submodule constructs a decision objective system based on the analytic hierarchy process (AHP), retrieves intervention measure knowledge from a knowledge graph, and generates three candidate solutions: "strengthening industrial emission control," "promoting clean energy," and "constructing urban green belts." The multi-criteria decision evaluation unit assesses the solutions from four dimensions: health benefits, technical feasibility, economic rationality, and social impact. The comprehensive score calculation results are: Solution 1 approximately 82 points, Solution 2 approximately 79 points, and Solution 3 approximately 71 points. Monte Carlo robustness analysis (N=10,000 times) shows that Solution 1 maintains a top-two ranking with a probability exceeding 90%, and Solution 2 approximately 87%, indicating that the ranking results have good robustness.

[0328] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. An environmental health knowledge graph system based on a large language model, characterized in that, include: The module includes a knowledge mining module driven by a large language model, an environmental health knowledge graph construction module, a knowledge reasoning engine module, a knowledge application service module, and a knowledge base dynamic update and maintenance module. The knowledge mining module uses a large language model of environmental health that has been pre-trained for domain adaptability and fine-tuned for knowledge extraction instructions. It employs a two-stage extraction strategy that combines high-precision targeted extraction with broad-coverage general extraction to automatically extract structured knowledge triples in the field of environmental health from scientific literature. The module then filters these triples through a multi-dimensional quality assessment that includes five dimensions: extraction credibility, content consistency, source reliability, research quality, and evidence level. The knowledge graph construction module is based on a domain ontology model that includes multiple core entity categories such as environmental pollutants, environmental media, exposure pathways, biomarkers, health effects, mechanisms of action, and intervention measures, as well as multiple standard relation types. It adopts a four-level entity linking strategy of precise matching, fuzzy matching, semantic matching, and context matching to organize the extracted knowledge into a multi-level knowledge graph. The knowledge reasoning engine module integrates three strategies: rule-based reasoning, case-based reasoning, and statistical reasoning. Through the strategy fusion unit, it dynamically selects the combination of reasoning strategies and corresponding weights based on query characteristics, and realizes complex reasoning and question answering based on knowledge graph. The knowledge application service module is used to provide environmental health risk assessment, causal mechanism explanation, intelligent literature review and evidence synthesis, intelligent environmental health question answering, and multi-criteria decision support services based on the knowledge graph and reasoning results. The knowledge base dynamic update and maintenance module ensures the timeliness and consistency of knowledge through knowledge change monitoring, logical consistency and statistical consistency verification, knowledge conflict detection and resolution, incremental update and version control mechanisms.

2. The environmental health knowledge graph system and method based on a large language model according to claim 1, characterized in that, The knowledge mining module driven by the large language model includes: a document intelligent acquisition and preprocessing submodule, a large language model submodule for the environmental health domain, a knowledge triple extraction submodule, and a knowledge quality assessment submodule. The environmental health domain large language model submodule is based on a Transformer architecture-based large language model, constructed through domain-adaptive pre-training and knowledge extraction instruction fine-tuning. The domain-adaptive pre-training uses a professional corpus containing literature from environmental science, toxicology, epidemiology, and public health disciplines to continuously pre-train the basic model. Pre-training tasks include masked language modeling, document-level prediction, and knowledge relationship prediction. The knowledge extraction instruction fine-tuning employs a combination of supervised fine-tuning and reinforcement learning based on human feedback. The supervised fine-tuning stage uses a prompt template designed with a thought chain to decompose the extraction steps, while the reinforcement learning stage based on human feedback uses a proximal policy optimization algorithm. Fine-tuning uses low-rank adaptation techniques to update only the query matrix and value matrix of the attention layer. The knowledge triplet extraction submodule uses a two-stage strategy based on a large language model to extract five types of core knowledge: the first stage is high-precision targeted extraction, which uses professional prompt templates to target knowledge of pollutant characteristics, exposure-effect relationships, biomarker relationships, mechanism-pathway relationships, and intervention effect relationships; the second stage is broad-coverage general extraction, which uses open prompts to capture knowledge not covered in the first stage. The knowledge conflict detection and processing function in the knowledge quality assessment submodule performs preliminary conflict screening on newly extracted knowledge triplets from the knowledge extraction stage, identifying four types of conflicts: direct contradiction conflict, i.e., a contradiction between affirmation and denial regarding the existence or directionality of the same relation; numerical inconsistency conflict, i.e., statistically significant differences in the quantitative parameters of the same relation; conditional conflict, i.e., different conclusions are drawn for the same relation under different constraints; and temporal evolution conflict, i.e., the conclusions of the same relation change due to different research times. The detected conflicts are marked and preliminarily classified, and the information of both parties in the conflict, along with the evidence level, is output to the knowledge graph construction module for processing in the subsequent fusion stage.

3. The environmental health knowledge graph system and method based on a large language model according to claim 1, characterized in that, The environmental health knowledge graph construction module includes: an ontology model design submodule, a graph construction and fusion submodule, a knowledge graph representation and storage submodule, and a knowledge graph visualization and exploration submodule; The ontology model design submodule uses a combination of top-down and bottom-up approaches to construct an ontology for the environmental health domain, including but not limited to the following core entity categories: environmental pollutants, environmental media, exposure pathways, biomarkers, health effects, mechanisms of action, and intervention measures. The ontology model also defines a variety of standard relation types, which are organized in a three-layer structure. The top layer is divided into four categories: causal relations, association relations, compositional relations, and functional relations. Each relation type defines domain and value range constraints. The graph construction and fusion submodule adopts a four-level entity linking strategy: the exact matching layer uses a standard entity dictionary for full matching, the fuzzy matching layer uses edit distance and character n-gram similarity for matching, the semantic matching layer uses the cosine similarity of entity embedding vectors for matching, and the context matching layer uses context embedding based on a pre-trained language model for deep matching. The four levels are executed sequentially. If the current layer successfully matches, the subsequent layers are skipped. For entities that fail to be linked, temporary entities are created and marked as pending verification. The graph construction and fusion submodule also includes a standard knowledge integration unit, which integrates national and industry standards for environmental health into the knowledge graph. It adopts the principle of prioritizing authority to handle the differences between standard knowledge and scientific literature knowledge, and uses a time version management strategy to retain historical versions of standard knowledge.

4. The environmental health knowledge graph system and method based on a large language model according to claim 1, characterized in that, The knowledge reasoning engine module includes: a multimodal knowledge representation submodule, a hybrid reasoning strategy submodule, a causal chain discovery and verification submodule, and an uncertainty reasoning and representation submodule; The multimodal knowledge representation submodule integrates three knowledge graph embedding algorithms: TransE, RotatE, and ComplEx. It maps entities and relations to a low-dimensional vector space and supports temporal knowledge representation using Allen time interval algebra and probabilistic knowledge representation based on probabilistic graphical models. The hybrid reasoning strategy submodule integrates rule-based reasoning, case-based reasoning, and statistical reasoning methods. Rule-based reasoning employs a combination of forward and backward linking mechanisms, uses the Rete algorithm to optimize rule matching efficiency, and calculates the confidence of the reasoning result based on the product of the minimum confidence of all premises and the rule confidence. Case-based reasoning uses a structured case library, employs a weighted feature matching algorithm to calculate the similarity between the new problem and historical cases, selects the top K cases with the highest similarity, and adjusts the solution using a case adaptation algorithm. Statistical reasoning uses a Bayesian network model, employs the EM algorithm to process missing data for parameter learning, and selects a variable elimination algorithm, belief propagation algorithm, or do-calculus method to calculate the posterior probability distribution of the target variable based on the task type. The strategy fusion unit dynamically selects the combination and weights of reasoning strategies based on query characteristics such as query type, degree of structuring, uncertainty requirements, and time constraints, and integrates multi-path reasoning results through weighted fusion. The causal chain discovery and verification submodule employs an improved bidirectional breadth-first search algorithm to search for causal paths connecting specified starting and ending entities in the knowledge graph. Filtering is performed through path length limits, relation type selection, and biological plausibility checks of intermediate node types. Path credibility is calculated using the geometric mean of the confidence scores of all edges in the path. Paths meeting the criteria are ranked based on path credibility, path directness, and path completeness. The causal chain discovery and verification submodule also includes a mechanism completion reasoning function, identifying knowledge gaps in causal paths and inferring missing intermediate mechanisms based on similar path analysis and domain rule reasoning. The confidence score of the completion result is calculated based on a weighted combination of similar path analysis scores and domain rule support scores. Furthermore, the causal chain discovery and verification submodule performs weighted quantitative evaluation of causal relationships based on nine dimensions of the Bradford Hill criteria: strength, consistency, specificity, temporality, dose-response relationship, plausibility, coherence, experimental evidence, and analogy. When multiple research results exist, a meta-analysis method using a random-effects model is applied, calculating research weights based on the inverses of the within-study and between-study variances of each study to obtain the weighted average effect size. The uncertainty reasoning and representation submodule classifies the sources of knowledge uncertainty into random uncertainty, cognitive uncertainty, linguistic uncertainty, and measurement uncertainty, and represents them using probability distributions, evidence intervals, fuzzy sets, and error propagation models, respectively. Based on Dempster-Shafer theory, it assigns basic belief values ​​to knowledge, and uses Dempster's combination rule to normalize and combine belief values ​​from multiple sources of evidence, constructing belief intervals to represent the uncertainty of the reasoning result. When the degree of conflict exceeds a preset threshold, the basic belief values ​​with lower reliability of the evidence source are discounted. Uncertainty propagation employs Monte Carlo simulation to propagate random uncertainty and interval analysis to propagate cognitive uncertainty, generating reasoning results that include point estimates, interval estimates, and probability distributions, and automatically adjusting the deterministic expression of the conclusion based on the degree of uncertainty.

5. The environmental health knowledge graph system and method based on a large language model according to claim 1, characterized in that, The knowledge application service module includes a knowledge-assisted risk assessment submodule, a knowledge-driven causal explanation submodule, an intelligent literature review and evidence synthesis submodule, an environmental health intelligent question-and-answer submodule, and a knowledge-driven decision support submodule. The knowledge-assisted risk assessment submodule provides a systematic risk assessment from exposure assessment to risk characterization based on exposure-effect relationships and dose-response data in the knowledge graph, and automatically generates risk management recommendations. The knowledge-driven causal explanation submodule performs causal chain retrieval and multi-level mechanism explanation for the correlation between environmental factors and health effects, realizing knowledge reasoning from phenomenon description to mechanism explanation. The intelligent literature review and evidence synthesis submodule decomposes the research question into four dimensions—population, exposure, comparison, and outcome—according to the PECO framework, maps them to the corresponding subgraph regions of the knowledge graph, performs knowledge aggregation and evidence level assessment, and automatically generates a structured evidence synthesis report. The environmental health intelligent question-and-answer submodule provides users with evidence-based professional question-and-answer services through natural language understanding and intent recognition, combined with knowledge graph retrieval and reasoning engines. The knowledge-driven decision support submodule automatically generates intervention plans and performs multi-criteria evaluation and scenario analysis based on knowledge graphs and reasoning results to assist in environmental health management decisions.

6. The environmental health knowledge graph system and method based on a large language model according to claim 1, characterized in that, The knowledge base dynamic update and maintenance module includes: a knowledge change monitoring and acquisition submodule, a knowledge consistency verification submodule, an expert collaborative knowledge verification submodule, and a knowledge graph incremental update submodule; The knowledge consistency verification submodule performs systematic consistency verification on newly added or modified knowledge items before they are added to the knowledge base, including logical consistency checks and statistical consistency checks. Logical consistency checks include type constraint checks, logical contradiction detection, and transitivity consistency checks. Statistical consistency checks include numerical range reasonableness checks and distribution consistency checks. At the same time, it performs global conflict detection between the new knowledge and the existing knowledge base, identifying four types of conflicts: direct contradiction conflicts, numerical inconsistency conflicts, conditional conflicts, and time evolution conflicts. For the detected conflicts, it executes conflict resolution strategies, including priority adjudication based on evidence level, knowledge replacement based on time decay, expert arbitration triggering, and conflict coexistence annotation, ensuring the overall consistency of the knowledge base before completing the addition to the knowledge base. The knowledge graph incremental update submodule defines six basic change operations: node creation, node update, node deletion, simultaneous creation, simultaneous update, and simultaneous deletion. Each operation includes three stages: precondition check, execution logic, and post-verification. A transaction mechanism is used to ensure the atomicity and consistency of the update, and semantic version numbers are used to manage the version history of the knowledge graph. Each update automatically creates a version snapshot and supports version tracking and rollback operations.

7. A method for environmental health knowledge graphs based on a large language model, characterized in that, Includes the following steps: Step S1: Using a domain-adaptive pre-trained and knowledge extraction instruction-fine-tuned large language model for the environmental health domain, a two-stage strategy combining high-precision targeted extraction and broad-coverage general extraction is adopted to automatically extract structured knowledge triples from environmental health literature. The extracted knowledge triples are then subjected to multi-dimensional quality assessment and screening, including five dimensions: extraction credibility, content consistency, source reliability, research quality, and evidence level. Step S2: Construct a domain ontology model that includes multiple core entity categories such as environmental pollutants, environmental media, exposure pathways, biomarkers, health effects, mechanisms of action, and intervention measures, as well as various standard relation types. Employ a four-level entity linking strategy of exact matching, fuzzy matching, semantic matching, and context matching to construct a multi-level knowledge graph of the environmental health domain from the extracted knowledge. Step S3: Map entities and relationships in the knowledge graph to a low-dimensional vector space using a knowledge graph embedding algorithm. Dynamically select a combination of rule-based reasoning, case-based reasoning, and statistical reasoning strategies based on query features for hybrid reasoning. Use an improved bidirectional breadth-first search algorithm to discover causal paths and perform causal strength quantification based on the Bradford Hill criterion. Handle uncertainties in the reasoning process through evidence reasoning based on Dempster-Shafer theory. Achieve hybrid reasoning and causal discovery based on the knowledge graph. Step S4: Execute knowledge application services based on knowledge graphs and reasoning results, specifically including: conducting environmental health risk assessment based on multi-pathway exposure models and dose-response relationships; retrieving and constructing causal evidence chains to generate multi-level causal mechanism explanations; conducting intelligent literature review and evidence synthesis according to the PECO framework; realizing intelligent environmental health question answering through intent recognition, knowledge retrieval, and reasoning loops; and generating decision support schemes using the analytic hierarchy process and multi-criteria decision analysis methods. Step S5: New knowledge is continuously acquired through automatic document scanning and knowledge change detection. After logical consistency verification and conflict detection, the new knowledge is integrated into the knowledge graph in a transactional manner through an incremental update algorithm. At the same time, version history is maintained through semantic version numbers to achieve continuous updating and consistency maintenance of the knowledge base.

8. The environmental health knowledge graph method based on a large language model according to claim 7, characterized in that, Step S1 includes: intelligent document acquisition and preprocessing, construction of a large language model in the field of environmental health, extraction of knowledge triples and knowledge quality assessment. The steps for constructing the large language model in the environmental health domain include: using a professional corpus containing literature from environmental science, toxicology, epidemiology, and public health disciplines to perform domain-adaptive continuous pre-training on the basic Transformer large language model. The pre-training tasks include masked language modeling, document-level prediction, and knowledge relationship prediction. Then, based on manually annotated environmental health literature, knowledge extraction instructions are fine-tuned. In the supervised fine-tuning stage, prompt templates designed using the thought chain are used to decompose the extraction steps. In the reinforcement learning stage based on human feedback, a proximal policy optimization algorithm is used. The fine-tuning adopts a low-rank adaptation technique to update only the query matrix and value matrix of the attention layer. The knowledge triplet extraction steps include a two-stage strategy to extract five types of core knowledge: pollutant characteristic knowledge, exposure-effect relationships, biomarker relationships, mechanism-pathway relationships, and intervention effect relationships; the first stage uses a professional prompt template based on the thinking chain design for targeted extraction, and the second stage uses open prompts for general extraction; The knowledge quality assessment steps calculate a comprehensive quality score for each knowledge triple from five dimensions: extraction credibility, content consistency, source reliability, research quality, and evidence level. Knowledge triples with a comprehensive quality score below a preset quality threshold are filtered out. The evidence level assessment uses a modified GRADE method to classify evidence into six levels, from I to VI. Knowledge conflict detection identifies four types of conflicts: direct contradictions, numerical inconsistencies, conditional conflicts, and temporal evolution conflicts, and performs root cause analysis to distinguish between real conflicts and statement conflicts.

9. The environmental health knowledge graph method based on a large language model according to claim 7, characterized in that, Step S2 includes: ontology model design, graph construction and fusion, knowledge graph representation and storage, and knowledge graph visualization and exploration; The ontology model design steps employ a combination of top-down and bottom-up approaches to construct an environmental health domain ontology that includes the following core entity categories: environmental pollutants, environmental media, exposure pathways, biomarkers, health effects, mechanisms of action, and intervention measures. The ontology model also defines various standard relation types, each of which defines domain and range constraints. In the graph construction and fusion steps, entity linking is performed sequentially using a four-level strategy: exact matching uses a standard entity dictionary for full matching, fuzzy matching uses edit distance and character n-gram similarity for matching, semantic matching uses the cosine similarity of entity embedding vectors for matching, and contextual matching uses the contextual embedding of a pre-trained language model for deep matching; if the current layer successfully matches, the subsequent layers are skipped, and temporary entities are created and marked as pending verification for entities that fail to be linked. The graph construction and fusion steps also include standard knowledge integration, which maps environmental health national standard knowledge to the ontology model and integrates it into the knowledge graph. The principle of prioritizing authority is used to handle the differences between standard knowledge and scientific literature knowledge, and a time version management strategy is used to retain historical version standard knowledge.

10. The method for constructing an environmental health knowledge graph and intelligent reasoning according to claim 7, characterized in that, Step S3 includes: multimodal knowledge representation, hybrid reasoning strategy, causal chain discovery and verification, and uncertainty reasoning and representation; The hybrid reasoning strategy steps include: analyzing user queries to extract query features, dynamically selecting a combination of reasoning strategies and weights based on the query features, executing the selected strategies in parallel, and integrating the results of multiple reasoning paths through weighted fusion; wherein rule-based reasoning adopts a reasoning mechanism that combines forward and backward links and uses the Rete algorithm to optimize rule matching efficiency, case-based reasoning uses a weighted feature matching algorithm to calculate the similarity between new questions and historical cases and adjusts the solution through a case adaptation algorithm, and statistical reasoning calculates the posterior probability distribution based on a Bayesian network; The causal chain discovery and verification steps include: searching for causal paths using an improved bidirectional breadth-first search algorithm, filtering by path length constraints, relation type selection, and biological plausibility checks of intermediate nodes; calculating path confidence using the geometric mean of the confidence scores of all edges in the path; performing mechanism completion reasoning on knowledge gaps in the discovered causal paths, with completion confidence calculated based on a weighted combination of similar path analysis scores and domain rule support scores; and conducting a weighted quantitative evaluation based on nine dimensions of the Bradford Hill criteria: strength, consistency, specificity, temporality, dose-response relationship, plausibility, coherence, experimental evidence, and analogy. When multiple research results exist, a meta-analysis method using a random effects model is applied, calculating research weights based on the inverses of the within-study variance and between-study variance of each study. The uncertainty reasoning and representation steps are based on Dempster-Shafer theory to assign basic beliefs to knowledge allocation, and use Dempster combination rules to normalize and combine the belief assignments of multi-source evidence to construct belief intervals to represent the uncertainty of the reasoning results. Uncertainty propagation adopts Monte Carlo simulation to propagate random uncertainty and interval analysis to propagate cognitive uncertainty, generating reasoning results that include point estimates, interval estimates and probability distributions.