A knowledge graph construction method for information physical system security

By combining top-down and bottom-up ontology modeling and multi-task learning methods, a lightweight ontology model is constructed and multi-source heterogeneous data is integrated, which solves the security protection problem of complex multi-source heterogeneous data in cyber-physical systems and realizes efficient and intelligent security decision support.

CN120851165BActive Publication Date: 2026-06-19ZHEJIANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2025-07-14
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Traditional security protection methods are insufficient to cope with the complex multi-source heterogeneous data and dynamically changing attack patterns in cyber-physical systems, and there is an urgent need for efficient and intelligent security protection methods.

Method used

We employ a top-down and bottom-up approach to ontology modeling to construct a lightweight ontology model. We preprocess multi-source heterogeneous data, construct a knowledge graph through rule extraction and multi-task learning, and introduce a relay node reasoning mechanism and a multimodal embedding fusion model to achieve comprehensive extraction and fusion of structured, semi-structured and unstructured data.

Benefits of technology

It effectively addresses the challenges of diverse entity types, complex semantics, and multi-hop relationships in cyber-physical systems, improving graph consistency and accuracy, and enhancing the model's rapid adaptability and generalization performance under conditions of scarce samples.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120851165B_ABST
    Figure CN120851165B_ABST
Patent Text Reader

Abstract

This invention discloses a method for constructing a knowledge graph for cyber-physical system security. The method includes: constructing a cyber-physical security ontology model; collecting and preprocessing multi-source heterogeneous data; extracting entities and relations from structured and unstructured data based on rule-based and multi-task learning methods, respectively; proposing a relay node reasoning mechanism and a path judgment task to complete complex multi-hop semantic relations; achieving synonym entity alignment and graph fusion through an intra-text and global entity disambiguation fusion model; and improving the completeness and accuracy of the knowledge graph by combining a semantic and structural embedding graph completion and pruning model. This invention introduces a multi-task learning mechanism to promote collaborative optimization among entity recognition, relation extraction, and path judgment tasks, and combines a model-independent meta-learning algorithm to enhance the model's rapid adaptation and generalization capabilities in small-sample scenarios. The constructed cyber-physical security knowledge graph can be widely applied to security analysis and decision support for cyber-physical systems.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of cyber-physical system security and relates to a method for constructing a knowledge graph for cyber-physical system security. Background Technology

[0002] Cyber-Physical Systems (CPS) are complex systems that deeply integrate computing, communication, and control technologies. By integrating the physical and information worlds, they enable real-time perception, dynamic control, and intelligent decision-making of physical processes. CPS are widely used in key areas such as smart grids, smart manufacturing, intelligent transportation, and smart cities, becoming an important component of modern infrastructure. However, as the openness and interconnectivity of CPS continue to increase, the security threats they face are also becoming increasingly severe. Traditional security protection methods are insufficient to cope with the complex, multi-source, heterogeneous data and dynamically changing attack patterns in CPS, necessitating an efficient and intelligent security protection method.

[0003] Knowledge graphs, as a powerful knowledge representation and reasoning tool, demonstrate significant advantages in CPS security management. By constructing a CPS security knowledge graph, multi-source heterogeneous data can be effectively integrated, enabling unified representation and reasoning of knowledge such as security events, vulnerability information, and attack patterns. Knowledge graphs not only support real-time anomaly detection and attack attribution but also provide intelligent security decision support through semantic reasoning. Furthermore, the scalability and flexibility of knowledge graphs allow them to adapt to dynamically changing security needs in CPS, providing an efficient and reliable solution for CPS security management. Summary of the Invention

[0004] The purpose of this invention is to address the current deficiencies and shortcomings in cyber-physical system security by providing a method for constructing a knowledge graph for cyber-physical system security.

[0005] The objective of this invention is achieved through the following technical solution: a method for constructing a knowledge graph for cyber-physical system security, comprising the following steps:

[0006] S1 constructs a cyber-physical security ontology model from top to bottom. Based on the knowledge graph open standard, combined with existing cyber-physical system security-related standards and ontology models, and integrating domain expert knowledge, the ontology model is defined. A lightweight ontology construction method is adopted, which only models the direct relationships between entity categories when constructing the ontology model. Indirect relationships are expressed by combining multiple direct relationships, and are not explicitly defined in the ontology model.

[0007] S2 collects multi-source heterogeneous data in the cyber-physical system security field, performs data preprocessing operations, and classifies the multi-source heterogeneous data into structured data, semi-structured data, and unstructured text data according to data characteristics; based on the actual collected data, it improves the ontology model constructed in S1 from the bottom up.

[0008] S3, based on the preprocessed structured and semi-structured data from S2, establishes mapping rules between the data and entity types, relationships, and attributes in the ontology model; and uses a rule-based approach to extract entities and relationships from the structured and semi-structured data, thereby constructing a rule-based knowledge graph.

[0009] S4, for the unstructured text data preprocessed in S2, adopts an unstructured text information extraction method based on multi-task learning, constructs three sub-tasks: entity recognition, relation extraction and path judgment, extracts entity and relation information, and obtains knowledge triples;

[0010] S5, based on the knowledge triples obtained in S4, constructs an intra-article entity disambiguation and graph fusion model, merging synonymous entities from the same article into unified nodes to generate an article-level knowledge graph; based on the article-level knowledge graph and the rule-based knowledge graph constructed in S3, constructs a global entity disambiguation and graph fusion model to establish a unified knowledge graph;

[0011] S6, based on the unified knowledge graph established by S5, constructs graph completion and pruning tasks. It adopts a semantic and structural bimodal embedding fusion model to complete or prune all entity pairs that need to be completed or pruned, respectively, so as to complete missing relations and delete redundant relations.

[0012] S7 stores the knowledge graph after S6 graph completion and pruning into a relational database, organizes entity and relation information in a standardized form, constructs a cyber-physical security ontology model based on S1 and improves it in S2, builds a D2RQ mapping file, completes the standard conversion of data in the relational database to RDF, and obtains a cyber-physical security knowledge graph in standard RDF representation.

[0013] Furthermore, in S1, the constructed ontology covers core concepts such as cyber-physical devices, vulnerability information, attack patterns, attackers, protection strategies, and security events; in S2, the multi-source heterogeneous data includes: threat intelligence and security event data, on-site operational data and sensor information, network security logs and traffic data, vulnerability database information, system configuration and operational documents, and the data preprocessing operations include: noise data removal, format conversion, and redundancy deduplication.

[0014] Furthermore, in S3, in semi-structured data processing, regular expression matching and JSON file format parsing are used to construct entity and relation extraction rules, thereby realizing the construction of knowledge triples; in structured data processing, based on the correspondence between the original fields and entities / attributes in the ontology model, a field mapping model is constructed to directly realize the structured data migration from the source database to knowledge graph triples.

[0015] Furthermore, in S4, the entity recognition task takes the original text sequence as input, and after passing through the RoBERTa layer, BiLSTM layer, Linear layer, and CRF layer implemented with dynamic mask, the entity category sequence BIO label is obtained; the relation extraction task takes the text sequence with entity information as input, and after passing through the RoBERTa layer, Linear layer, and Softmax layer implemented with dynamic mask, the relationship categories between entities and the candidate path sequence are obtained; the path judgment task takes the combination of the original text sequence and candidate paths as input, and after passing through the RoBERTa layer, Linear layer, and Softmax layer, the reasonableness of the candidate path is determined.

[0016] Furthermore, in S4, in actual semantic text, entities may not have direct relationships predefined in the ontology model constructed based on the lightweight ontology construction method, but are associated through indirect relationships. A relay node reasoning mechanism is adopted. First, the relation extraction task is extended by constructing potential two-hop relationships for all entity category pairs predefined in the ontology model that may have two-hop associations. When the relation extraction model identifies a potential two-hop relationship between two entities, the relay node reasoning mechanism is triggered. First, it determines whether there is a node in the text that can establish a direct relationship with both entities; this node is the relay node. If it exists, the semantic information structure extracted from the text is considered complete. If it does not exist, all entity types that can serve as relay node types are searched based on the ontology model, and all theoretically possible connection paths are constructed as candidate path sequences.

[0017] Furthermore, in S4, the path judgment task evaluates the rationality of the candidate paths obtained through the relay node reasoning mechanism. If the candidate path is judged to be reasonable, a placeholder entity representing the relay node is constructed. This entity acts as a temporary relay node in the path construction stage to maintain the coherence of the knowledge path structure, and is aligned with similar entities in other graphs in the subsequent entity disambiguation and graph fusion stages.

[0018] Furthermore, in S4, a multi-task learning framework is constructed. Based on the shared semantic representation of the RoBERTa encoder, adapters for entity recognition, relation extraction, and path judgment tasks are designed respectively. The label prediction in entity recognition and relation extraction tasks is constrained by a dynamic masking mechanism, and the model is trained by jointly optimizing the loss function.

[0019] Furthermore, in S4, a model-independent meta-learning method is used for inner and outer loop training for entity recognition and relation extraction tasks; and standard supervised training is performed for path judgment tasks.

[0020] Furthermore, in S5, the intra-article entity disambiguation and graph fusion model and the global entity disambiguation and graph fusion model use the RoBERTa pre-trained language model and multilayer perceptron to obtain the entity pair alignment probability using entity and its context information, and merge synonymous entities into unified nodes based on the hierarchical clustering method of referential chain and cluster average referential probability.

[0021] Furthermore, in S6, a semantic and structural bimodal embedding fusion model is combined. The entity structure representation is obtained through the ComplEx embedding model, and the entity semantic representation is obtained through the RoBERTa pre-trained language model using the entity, its adjacent entities, and their correspondence. The two representations are combined and a multilayer perceptron is used to determine whether entity pairs need to be completed or pruned.

[0022] The beneficial effects of this invention are as follows: By combining ontology modeling, rule extraction, and multi-task learning, it achieves comprehensive extraction and fusion of security knowledge from structured, semi-structured, and unstructured data, effectively addressing the problems of diverse entity types, complex semantics, and multi-hop relationships in cyber-physical systems. The introduction of relay node reasoning mechanisms and path judgment tasks enhances the modeling capability for complex semantic relationships; the use of intra-text and global entity disambiguation fusion methods improves graph consistency and accuracy. The introduction of a multi-task learning framework effectively enhances information sharing and collaboration capabilities among tasks such as entity recognition, relation extraction, and path judgment; and the combination of model-independent meta-learning algorithms (MAML) significantly enhances the model's rapid adaptability and generalization performance under conditions of scarce samples. Attached Figure Description

[0023] Figure 1 A schematic diagram of the architecture of the knowledge graph construction method for cyber-physical system security provided in an embodiment of the present invention;

[0024] Figure 2 A schematic diagram of an unstructured text information extraction method based on multi-task learning provided in an embodiment of the present invention;

[0025] Figure 3 This is a schematic diagram of a two-stage entity disambiguation and graph fusion task provided in an embodiment of the present invention;

[0026] Figure 4 This is a structural diagram of a knowledge graph construction device for cyber-physical system security provided in an embodiment of the present invention. Detailed Implementation

[0027] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0028] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.

[0029] This embodiment provides a method for constructing a knowledge graph for cyber-physical system security, which includes the following steps:

[0030] (1) Combining top-down and bottom-up ontology modeling strategies, we construct an ontology model for cyber-physical security to address the complexity, heterogeneity, and dynamic characteristics of cyber-physical system security. We also propose a lightweight ontology construction method to reduce the complexity of the ontology model.

[0031] The lightweight ontology construction method addresses the challenges of diverse entities and complex relationships in the cyber-physical system security domain. It proposes modeling only the direct relationships between entity categories during ontology construction. Indirect relationships can be expressed through combinations of multiple direct relationships without being explicitly defined in the ontology model, thus effectively reducing its complexity. For example, in industrial control systems, for the three entity types of HMI (Human-Machine Interface), PLC (Programmable Logic Controller), and actuators, only direct relationships such as "HMI connects to PLC" and "PLC directly controls actuator" are modeled, while indirect relationships like "HMI affects actuator" are not explicitly defined in the model.

[0032] Specifically, in the top-down ontology modeling phase, based on open knowledge graph standards such as RDF and OWL, combined with cyber-physical system security-related standards such as STIX 2.1, ATT&CK framework, and IEC 62443, and referencing existing ontology models such as UCO (Unified Cyber ​​Ontology) and MALOnt, and integrating domain expert knowledge, a lightweight ontology construction method is used to define the core concepts, relationships, attributes, and hierarchical structure of the ontology. The ontology modeling tool Protégé is used to construct a standard ontology model. The constructed ontology covers core concepts such as cyber-physical devices, vulnerability information, attack patterns, attackers, protection strategies, and security events, aligning with mainstream standards and existing ontology models to ensure the model's universality and scalability.

[0033] (2) Collect multi-source heterogeneous data to support the construction of high-quality knowledge graphs in the field of cyber-physical system security, and preprocess the data to divide it into structured data, semi-structured data, and unstructured text data.

[0034] The collected data includes: threat intelligence and security incident data (released by organizations such as CNTD, ThreatBook, and FireEye), field operation data and sensor information (data collected by various types of sensors such as temperature, pressure, and voltage), network security logs and traffic data (firewall logs and intrusion detection system logs in SCADA systems and industrial control networks, network traffic data corresponding to industrial communication protocols such as Modbus and DNP3), vulnerability database information (public vulnerability sources such as CVE, CNVD, and NVD), and system configuration and operation documents (PLC / RTU configuration files, control policy descriptions, etc.).

[0035] For the aforementioned multi-source heterogeneous data, data preprocessing operations are performed, including noise data removal, format conversion, and redundancy deduplication, to ensure data accuracy and consistency. Based on data characteristics, the data is divided into three categories: structured data (such as field operation data and sensor information in the database), semi-structured data (such as logs, configuration files, and XML / JSON format vulnerability descriptions), and unstructured text data (such as threat intelligence text and operational documents).

[0036] At the same time, based on the collected multi-source heterogeneous data, the missing concepts in the ontology model constructed in step (1) are supplemented, and the ontology model is improved from the bottom up.

[0037] (3) The rule-based method is used to extract entities and relationships from structured and semi-structured data with fixed format, clear fields and clear semantic boundaries. It has the advantages of high extraction accuracy, fast execution efficiency and low deployment cost.

[0038] Specifically, mapping rules are established between the structured and semi-structured data processed in step (2) and the entity types, relations, and attributes in the ontology. In the semi-structured data processing, regular expression matching and JSON file format parsing are used to construct entity and relation extraction rules, thereby realizing the construction of knowledge triples. In the structured data processing, a field mapping model is constructed based on the correspondence between the original fields and the entities / attributes in the ontology model, directly realizing the structured data migration from the source database to the knowledge graph triples. Combining the triples obtained from the semi-structured and structured data processing, the rule knowledge graph G is realized. rules The construction.

[0039] (4) A method for extracting unstructured text information based on multi-task learning is proposed. Three sub-tasks are constructed: entity recognition, relation extraction and path judgment. The three tasks are optimized and mutually enhanced to gain a deeper understanding of entity boundaries, relationships between entities and semantics in the context of the text, and to extract entity and relation information from unstructured text data.

[0040] Among them, the entity recognition task is responsible for identifying the location and category of entities in the text, and the relation extraction task identifies the relationship categories between multiple entities and constructs "entity-relationship-entity" triples.

[0041] However, in real-world semantic text, entities may not have direct relationships predefined in the ontology model constructed using lightweight ontology building methods, but rather be associated through indirect relationships. To address the issues of incomplete entity relationship coverage and broken path chains in domain ontology, a relay node reasoning mechanism is further proposed. First, the relation extraction task is extended by constructing "potential two-hop relationships" for all entity category pairs (e.g., HMI and executor) predefined in the ontology model that may have two-hop associations. When processing text, the relation extraction model not only identifies predefined direct relationships but also identifies "potential two-hop relationships" between entities. Specifically, when the relation extraction model identifies such a "potential two-hop relationship" between two entities, it triggers the relay node reasoning mechanism. This mechanism first determines whether there exists a node in the text that can establish a direct relationship with both entities; this node is the relay node. If it exists, the extracted semantic information structure is considered complete; if not, all candidate path sequences are constructed: based on the ontology model, all entity types that can serve as relay node types are searched, and all theoretically possible connection paths are constructed accordingly. For example, if the text only mentions that "HMI-01" and "Valve-A" are related, but the intermediate PLC is missing, the system will find by querying the ontology model that "PLC" is a relay node type that can connect the two types of entities, HMI and actuator, and thus automatically construct a candidate path: HMI-01 → Connection → [PLC] → Control → Valve-A.

[0042] The path determination task evaluates the rationality of these candidate paths, using them as input to determine their validity. If a candidate path is deemed rational, a placeholder entity UNK_ENTITYTYPE_NUM representing a relay node is constructed, where ENTITYTYPE indicates the entity category and NUM represents the UNK entity number within that category. This entity acts as a temporary relay node during the path construction phase, maintaining the coherence of the knowledge path structure, and aligns with similar entities in other graphs during subsequent entity disambiguation and graph fusion phases.

[0043] A multi-task learning framework is constructed to achieve collaborative training of the three tasks mentioned above. Based on the shared RoBERTa encoder semantic representation, adapters for entity recognition, relation extraction, and path determination are designed separately, and the corresponding loss functions are jointly optimized. Each task shares underlying language knowledge while retaining task-specific modeling capabilities, thereby improving the overall extraction capability and avoiding interference between tasks.

[0044] Meanwhile, considering the scarcity of labeled data in the cyber-physical system security field, a model-agnostic meta-learning (MAML) method is adopted. By constructing multi-task, multi-class training episodes, the model's ability to quickly adapt to small sample scenarios and its ability to transfer across tasks are enhanced.

[0045] Specifically, the entity recognition model consists of a pre-trained language model RoBERTa, a bidirectional long short-term memory network (BiLSTM), a linear layer, and a conditional random field (CRF) layer, enabling accurate recognition of entity boundaries and BIO annotation types in unstructured text.

[0046] The input to the entity recognition model is to add [something] at the beginning and end respectively. <s> and< / s> The original text sequence marked, where <s> and< / s> The markers serve as identifiers for the beginning and end of sentences, and the text sequence is denoted as:

[0047] S ner ={w1,w2,...,w n}

[0048] Among them, w i Let represent the i-th word, and n be the number of words.

[0049] To adapt to the input format of the pre-trained language model RoBERTa, the text sequence is segmented at the sub-word level using a word segmenter based on the Byte-Pair Encoding (BPE) strategy, generating a sub-word sequence:

[0050] T ner =BPE(S ner )={t1,t2,...,t m}, m≥n

[0051] Among them, t i Let be the i-th subword obtained by the word segmenter, and m be the length of the segmented subword sequence. Input the subword sequence into the pre-trained language model RoBERTa to obtain the context-related semantic representation of each subword, forming a representation sequence:

[0052]

[0053] Where, d RoBERTa Output the hidden vector dimension for RoBERTa. To further enhance the bidirectional dependency modeling capability of word representations, RoBERTa outputs E. ner As input to BiLSTM, for each subword t i BiLSTM calculates its forward hidden state and backward hidden state By concatenating the components, we obtain a two-way contextual representation for each subword:

[0054]

[0055] Where, d LSTM This represents the hidden state dimension of a unidirectional LSTM. The resulting BiLSTM output sequence is:

[0056]

[0057] The output of BiLSTM is mapped to the label space through a linear layer to achieve label discrimination. For each token, h is represented. i Projecting the vector yields the label score vector:

[0058] s i =Wh i +b,s i ∈R k

[0059] in, Let b be a learnable weight matrix, where b∈R k Here, k is the number of entity categories, representing the bias term. Combining the scores of all tokens yields the original emission score sequence:

[0060] ES ner =[s1,s2,...,s m ]∈R m×k

[0061] A dynamic masking mechanism is introduced to constrain label prediction in entity recognition tasks. Specifically, based on the traditional Conditional Random Field (CRF) discriminant structure, the probability of predicting illegal labels is explicitly suppressed during the label scoring stage. This is implemented by considering the set of possible entity categories for the current entity recognition task. Where L ner For all entity category sets, construct a mask vector M of the same dimension. ner ∈R k Its element value in dimension j is defined as:

[0062]

[0063] Here, `dim2entity` is the mapping from vector dimensions to entity categories. The mask is added to the original emission score sequence ES. ner The final emission score is obtained:

[0064] P i =s i +M ner

[0065] The final emission score sequence can be obtained by combining the scores of all tokens:

[0066] P = [P1, P2, ..., P m ]∈R m×k

[0067] In this approach, the score for the illegal label dimension is set to a minimum value in the output score for each token, thereby completely filtering out illegal labels in subsequent label path scoring and prediction. The CRF layer considers the transition relationships between labels at the sequence level, improving prediction accuracy under BIO annotation. The scoring function for the labeled path is defined as follows:

[0068]

[0069] Where Y is the predicted label sequence, Y i Let A ∈ R be the label predicted at position i. k×k The label transition matrix is ​​a learnable matrix. This indicates that the label Y starts from the (i-1)th label. i-1 Transfer to the i-th label Y i The possibility, This indicates that the BiLSTM output at position i predicts the label Y. i The score. The model training objective is to maximize the conditional probability of the true labeled path, and the corresponding negative log-likelihood loss function is defined as:

[0070]

[0071] Among them, Y true Let y be the actual label sequence, and y be the set of all possible legal label paths.

[0072] In the entity recognition reasoning stage, the CRF decoding process uses the Viterbi algorithm to find the highest-scoring path from all possible labeled paths as the final output, i.e.:

[0073]

[0074] in, The predicted label sequence is the final output BIO sequence, which represents the entity boundary and category information of each subword in the text. By reverse mapping from subword to the original word level, the boundary position and type of the entity in the original text are restored.

[0075] The relation extraction model is a multi-layer neural network structure consisting of RoBERTa, linear layers, and Softmax classification, with the goal of achieving accurate classification of entity pair relation types in text sequences containing entity information.

[0076] The input S of the relation extraction model re Add to the beginning and end respectively <s> and< / s> A labeled text sequence with entity information, wherein the text sequence with entity information carries entity recognition task information by adding a specific entity order and entity category identifier to the text sequence, as shown in the following example:

[0077] <s> The primary HMI< / head InterfaceDevice> sends operational commands to the<tail ControlDevice> PLC< / tail ControlDevice> To manage the assembly line.< / s>

[0078] S re T is obtained through the BPE word segmentation strategy. re The context-dependent vector representation H is obtained through the RoBERTa model. re The BPE segmentation strategy and RoBERTa model implementation here are similar to the entity recognition part. Extracting H... re middle <s>The semantic representation corresponding to the location serves as the representation vector of the semantic relations for the entire entity:

[0079]

[0080] Where, d RoBERTa Output the hidden vector dimension for RoBERTa. Then, pass a linear layer to h... sentence Mapping to the label space yields the label scoring vector s sentence ∈R l , where l is the number of all relation categories (including all direct relations defined in the ontology model and additionally constructed "potential two-hop relations"), and the linear layer implementation is similar to the entity recognition part.

[0081] A dynamic masking mechanism is introduced to explicitly suppress the prediction probability of illegal labels, and the label scoring vector s is processed. sentence Processing is performed. A mask vector M of the same dimension is constructed. re ∈R l Add the mask to the label scoring vector s sentence The final score f is obtained. sentence The dynamic masking implementation is similar to the entity recognition part. Softmax classification is used to calculate the probability distribution of relation categories:

[0082]

[0083] Where y∈{1,2,...,l} is the index of all relation categories, f y Let L represent the final score for relation category y. The training objective is to minimize the cross-entropy loss function to reduce the difference between the model's predicted relation category distribution and the true labels. The cross-entropy loss function L for the relation extraction task is... RE The definition is as follows:

[0084]

[0085] Among them, y True It is the index of the true relation labels of the current sample.

[0086] When performing relation extraction prediction, the relation prediction with the highest probability is selected from all possible relation predictions as the relationship between the predicted entity pairs, i.e.:

[0087]

[0088] in, The label is used to predict the relationship category. When the predicted relationship category is "potential two-hop relationship", a candidate path is constructed through the relay node reasoning mechanism, and the subsequent path judgment model judges whether the candidate path is reasonable.

[0089] The path determination model is a multi-layer neural network model consisting of RoBERTa, a linear layer, and a softmax discriminant, which is used to determine whether a candidate path is reasonable.

[0090] The input to the path determination model is <s> Original sentence< / s> Candidate Path< / s> Combination sequence S pr The candidate path is the entity relationship path from entity A to entity C via entity B and the relay relationship. Specifically, for the original sentence "The operation of HMI-01 caused the abnormality of valve-A", the relation extraction model identifies a "potential two-hop relationship" between the head entity "HMI-01" and the tail entity "valve-A", but the relay entity connecting them (i.e., entity B, "PLC" in this scenario) is missing in the sentence. At this time, based on the ontology model, the system infers that "PLC" is the relay node connecting the two and constructs a candidate path: HMI-01 connects to PLC to control valve-A, and the combined sequence S pr for: <s> The operation of HMI-01 caused an anomaly in valve-A.< / s> HMI-01 connects to PLC to control valve A.

[0091] S pr After the BPE word segmentation strategy, the sub-word sequence T is obtained. pr After being encoded by the RoBERTa model, a context-aware vector representation H is obtained. pr ,use <s>position vector This represents the comprehensive semantics of the entire input sequence. h pr The input is a linear layer, mapped to a binary label space, to obtain a path discrimination scoring vector s. pr ∈R 2 The BPE word segmentation strategy, RoBERTa model, and linear layer implementation are similar to the entity recognition part. The probability distribution p(y′|T) of whether the path is reasonable is calculated using Softmax. pr The objective is to minimize the difference between the predicted and true labels. The cross-entropy loss function L for the path determination task is... PR The definition is as follows:

[0092] L PR =-y′·log(p(y′|T) pr ))+(1-y′)·log(1-p(y′|T pr ))

[0093] Where y′∈{0,1} represents the judgments of an unreasonable and a reasonable path, respectively.

[0094] When making path predictions, the category with the highest probability is selected as the final decision.

[0095]

[0096] in, To determine whether a path is reasonable, if the path is deemed reasonable, a placeholder entity for the corresponding category of the relay node in that path is constructed, and a knowledge triple is constructed between the placeholder entity and the entities at both ends of the candidate path.

[0097] To adapt to the low-sample characteristics of cyber-physical system security, both entity recognition and relation extraction tasks employ a meta-learning training method based on the MAML (Model-Agnostic Meta-Learning) framework. This method extracts general knowledge from small samples across multiple tasks, improving the model's generalization ability on new tasks. In each episode, the model uses the support set D... support Update the current model parameters θ to obtain the adapted parameters θ′:

[0098]

[0099] Where α is the inner loop learning rate, L inner These represent the training loss for entity recognition or relation extraction tasks, respectively.

[0100] Use the adapted parameters θ′ in query set D query Calculate the loss L outer And perform meta-gradient updates on the initial parameters θ:

[0101]

[0102] Where β is the outer loop learning rate. Gradient accumulation is performed in each training round using the outer loss of multiple episodes.

[0103] In the multi-task learning phase, entity recognition, relation extraction, and path determination are trained simultaneously. Considering that entity recognition and relation extraction tasks employ a meta-learning framework, while path determination uses a traditional supervised learning approach, the multi-task learning objective function is a weighted sum of the losses from the three tasks:

[0104]

[0105] in, The meta-learning outer loop loss for entity recognition, L is the outer loop loss for meta-learning in relation extraction. PR λ is the binary cross-entropy loss for the path determination task, and λ is the loss weight for the path determination task, used to balance the training intensity among the three.

[0106] For each training round, multiple episodes are sampled from the entity recognition and relation extraction tasks, and meta-learning inner and outer loop training is performed on each. Simultaneously, batch samples are extracted from the path judgment dataset for standard supervised training. The three losses are weighted and merged, backpropagated, and shared parameters are updated. To achieve efficient information exchange and parameter sharing between tasks, the same RoBERTa encoder is used for all three tasks.

[0107] The training data sources for entity recognition tasks include: open-source general-purpose meta-learning entity recognition datasets (such as Few-NERD); specialized domain entity recognition datasets built for cyber-physical system security (such as DNRTI); and small-shot entity recognition datasets based on unstructured text data and manually annotated with domain knowledge. An episode-based strategy is used to organize the data into multiple training tasks for meta-learning. Each episode contains a support set and a query set. The support set contains 5 to 10 entity categories (5-10 ways), and each category contains 5 to 10 entity samples (5-10 shots). The query set contains the same entity categories as the support set, and the number of samples in each category is three times that of the support set. The training data for entity recognition tasks is based on the standard BIO annotation format, where each word corresponds to a label, where B indicates the starting position of the entity, I indicates inside the entity, and O indicates not belonging to any predefined entity category.

[0108] The training data sources for the relation extraction task include: open-source general-purpose domain meta-learning relation extraction datasets (such as Few-REL); entity relation annotation data constructed based on cyber-physical system security domain entity recognition datasets (such as DNRTI), using existing entity labels to assist in labeling relation types; and few-sample relation extraction datasets based on unstructured text data and manually annotated with domain knowledge. The relation extraction dataset is constructed using an episode construction mechanism consistent with that of the entity recognition task.

[0109] The training data for the path determination task comes from network threat intelligence, which consists of candidate path sequences constructed from two-hop relationship data and original sentences, as well as manually labeled data on whether they are reasonable.

[0110] After processing the unstructured text data in step (2), entity recognition, relation extraction, and path determination are used to obtain a set of triples:

[0111]

[0112] Among them, h i t i Represents the head entity and the tail entity, r i N represents the semantic relationship between them. triples The number of triples.

[0113] (5) Based on the set of triples S obtained in step (4) triples This paper addresses the challenges of directly constructing a knowledge graph using a triplet set through a phased entity disambiguation and graph fusion framework. These challenges include: the same entity may have different representations in different locations; entities in triplets may be in natural language forms that require further normalization; and UNK entities in path determination tasks require further alignment and fusion. The phased entity disambiguation and graph fusion framework consists of two stages: intra-textual entity disambiguation and graph fusion, and global entity disambiguation and graph fusion.

[0114] In the intra-document entity disambiguation and graph fusion stage, the focus is on entity alignment and fusion within a single document. Since entities within the same document often exhibit explicit or implicit referencing, synonym substitution, and shortening within the context, semantic alignment is performed using entities and their contextual information. Referencing chains and contextual clustering are used to determine whether entities within the document refer to the same object. Finally, synonymous entities are merged into unified nodes to generate a document-level knowledge graph. This stage primarily addresses the issues of local semantic consistency and referential consistency.

[0115] In the global entity disambiguation and graph fusion stage, it is necessary to merge the document-level knowledge graph and the knowledge graph extracted based on rules in step (3) across documents to establish a unified knowledge graph. Unlike the intra-document stage, the global stage completes entity alignment and global fusion based on features such as structural similarity and semantic similarity between triples. This stage mainly solves the consistency problems of cross-text entity consistency and graph fusion.

[0116] For in-text entity disambiguation and graph fusion, the input is the entity set in the triplet set extracted in step (4):

[0117]

[0118] Where, m i For the natural language representation of entities, t i For entity category tags, s i p represents the sentence position of an entity within a text passage. i N represents the position of the entity in the sentence. entities For the number of entities. Retrieve the original document text containing the entity context:

[0119] S context ={s1,s2,...}

[0120] Among them, s i Let be the i-th sentence in the text. Each sentence is a token sequence after BPE processing. The BPE processing is similar to the entity recognition part.

[0121] Specifically, firstly, an entity context representation is constructed, where the entity is located in sentence s. i For its contextual semantic window, use the RoBERTa model trained in step (4) to analyze s. i Encode the token to obtain a sentence-level representation:

[0122]

[0123] in, For s i The length of the token sequence corresponding to the sentence. Based on the entity's position in the sentence, extract the representation of the corresponding token from the sentence representation, and use average pooling to obtain the entity representation:

[0124]

[0125] Subsequently, in-text referencing detection and referencing chain construction are implemented for any entity pairs (e) of the same entity category. i ,e j Construct its semantic matching feature vector:

[0126]

[0127] The semantic matching feature vector is then input into a two-layer perceptron (MLP) for discrimination:

[0128] First layer (hidden layer):

[0129]

[0130] in, For learnable weight matrix, This is a learnable bias.

[0131] Second layer (output layer):

[0132]

[0133] in, Let b2 be a learnable weight matrix, and b2 ∈ R. 1 This is a learnable bias.

[0134] The final output is the referential probability, which represents the probability that the entity pair represents a referential relationship:

[0135]

[0136] The training data for the multilayer perceptron comes from the entity set S extracted in step (4). entities The positive and negative sample pairs are constructed manually using the context of the text and the entity pairs within that context. Positive samples are labeled as entity pairs of the same class within the same referential chain, while negative samples are labeled as entity pairs of the same class within different referential chains. The training objective is to minimize the binary classification cross-entropy loss.

[0137] L coref =-y″ i,j ·log(p coref (i,j))+(1-y″ i,j )·log(1-p coref (i,j))

[0138] Among them, y″ i,j ∈{0,1} is a label indicating whether the entity pair belongs to the same referential chain.

[0139] After completing the in-text entity recognition and referential judgment, the probability of referential relationship between entity pairs is obtained. However, since referential recognition often depends on local context, this point-to-point judgment method is prone to introducing two types of errors: some entity pairs are misjudged as referential relationships due to accidental contextual similarity; entity pairs belonging to the same referential chain are not judged as referential relationships due to syntactic distance, expression differences, etc.

[0140] To address this, we introduce a hierarchical clustering method based on the average cluster referential probability. This method takes a global perspective, comprehensively considering the overall relationship structure between multiple entities, rather than focusing on local judgments between single entity pairs, thus achieving a more robust and consistent referential chain construction. First, based on the entity set S... entities For each entity category, all entities form a separate cluster:

[0141] C = {{e1}, {e2}, ...}

[0142] For two entity clusters C a C b Define its average referential similarity:

[0143]

[0144] Among them, |C a |、|C b | Represents entity cluster C a C b The number of entities contained therein.

[0145] In the cluster merging implementation, two clusters (C1, C2, C a C b ), so that:

[0146]

[0147] like:

[0148] sim(C a C b )≥τ

[0149] Then merge C a With C b :

[0150] C new =C a ∪C b ,C←C\{C a C b }∪C new

[0151] Otherwise, stop clustering. Here, τ is the set similarity threshold. Finally, the clustering results are obtained:

[0152] C = {C1, C2, ...}

[0153] Among them, each C i It refers to a set of equivalent entities in the chain.

[0154] For each reference chain C i ={e1,e2,...},e j =(m j ,t j ,s j ,p j Choose the longest non-pronoun natural language expression from among them. As an entity name:

[0155]

[0156] and all e j ∈C i m j Replace with Intra-text entity disambiguation and graph fusion were completed to obtain a text-level knowledge graph G. local .

[0157] After completing intra-chapter entity disambiguation and graph fusion, multiple chapter-level knowledge graphs G are obtained. local Meanwhile, based on step (3), a rule-based knowledge graph G is obtained. rules However, different graphs may still have issues such as differences in entity representation and dispersed structures. Therefore, further research is needed on global entity disambiguation and graph fusion.

[0158] In the global entity disambiguation and graph fusion stage, the input includes a chapter-level knowledge graph G. local = {G1, G2, ...}, the knowledge graph G obtained from rule extraction rules = {G1, G2, ...}, where G i =(V i E i V represents the knowledge graph extracted from the i-th document or the knowledge graph extracted from the i-th structured / semi-structured data according to rules. i E is a set of entity nodes. i Given a set of edges representing relationships, the goal is to construct a unified knowledge graph G. global = (V, E).

[0159] Extract the set of triples from all input knowledge graphs:

[0160]

[0161] Among them, h i t i ∈S entities Representing the head entity and tail entity respectively, r i N represents the semantic relationship between them. new_triples For all triples, each entity has a corresponding natural language representation m. e .

[0162] For each entity e to be aligned, construct its set of neighboring entity relations (consisting of its adjacent entities and relations):

[0163] Neighbor(e) = {(r i ,e i )|(e,r i ,e i )∈S triples ∨(e i ,r i ,e)∈S new_triples }

[0164] Linearize it into a sequence:

[0165]

[0166] in, For e j The corresponding natural language representation. After segmenting the sequence BPE, T is obtained. e Inputting RoBERTa yields a structure-aware semantic representation of the entity, where BPE segmentation and RoBERTa implementation are similar to the entity recognition part:

[0167]

[0168] Where, d RoBERTa L represents the hidden vector dimension and the sequence length. RoBERTa fine-tunes the RoBERTa model obtained in step (4) by constructing positive and negative sample pairs. The data comes from the set of triples S in all input knowledge graphs. new_triples The annotation method is similar to the in-text entity disambiguation and graph fusion implementations, and RoBERTa is trained using contrastive loss.

[0169]

[0170] Among them, P pair For the constructed positive sample pairs, N pair For the constructed negative sample pairs, δ is the negative sample interval threshold.

[0171] Pick <s>The location is represented as the final vector representation of the entity:

[0172]

[0173] For any pair of entities of the same entity category (e a ,e b ), calculate the cosine similarity between their semantic vectors, as the probability of entity alignment:

[0174]

[0175] Similarly, a hierarchical clustering method based on the average clustering probability is used to cluster all entity pairs with the same entity category. Its implementation is similar to the intra-text entity disambiguation and graph fusion part, finally obtaining a unified knowledge graph G. global .

[0176] (6) Propose graph completion and pruning tasks to complete missing relationships and delete redundant relationships, in order to address the challenges of the unified knowledge graph G. global The problems of missing and redundant relationships still exist. A fusion model combining semantic and structural bimodal embeddings is proposed to simultaneously achieve graph completion and graph pruning tasks.

[0177] The input is the entity pair that needs to be completed or pruned (e a ,e b ), where entity e needs to be completed. a Entity category and e b The entity categories have relationships in the ontology model, and in G global There is no corresponding relationship in the data, and entity e needs to be pruned. a With e b Between G global There is a relationship between them.

[0178] Specifically, entity e is first constructed using the RoBERTa model. a With e b semantic embedding h ea with h eb Its implementation is similar to the global entity disambiguation and graph fusion part in step (5). Secondly, using the ComplEx embedding model, all entities and relations are trained on the triplet set to obtain complex embedding vector representations of entities and relations. For each head entity e... h Relationship r, tail entity e t Embedded into the complex space, they are represented as follows:

[0179] e h =Re(e h )+i·Im(e h )

[0180] e t =Re(e t )+i·Im(e t )

[0181] r = Re(r) + i·Im(r)

[0182] Here, Re and Im represent the real and imaginary parts of the embedding vector, respectively, and all parts are real-valued vectors.

[0183] ComplEx embedding models for triples (e h ,r,e t The scoring is performed using the following function:

[0184]

[0185] in, For e t The complex conjugate of , <.,.,.> denotes the trilinear dot product; training is performed using a contrastive loss function with negative sampling, for the set of positive sample triples τ + With the set of negative sample triples τ - The objective function is defined as:

[0186]

[0187] Where, λ Comp Θ is a hyperparameter used to control the strength of the L2 regularization term. Comp The set of all parameters that need to be learned in the ComplEx model, with positive samples drawn from the unified knowledge graph G. global Negative samples are obtained from triples by randomly replacing the tail or head entity.

[0188] After training, the real and imaginary parts of the entities are concatenated to obtain the real-valued structure vector. and Combining semantic embedding and structural vectors, we obtain:

[0189]

[0190] Score output via MLP layer:

[0191] s ab =Sigmoid(W2·ReLU(W1x) ab +b1)+b2)

[0192] Where W1 and W2 are trainable weight matrices, and b1 and b2 are trainable biases.

[0193] The training data comes from the unified knowledge graph G constructed in step (5). global For missing or redundant relationships between entity pairs, the graph completion and graph pruning datasets were manually labeled, and trained using a binary classification cross-entropy loss function.

[0194] L = -[y·log(s) ab )+(1-y)·log(1-s ab )]

[0195] Where y∈{0,1} is the label indicating whether an entity pair is to be completed or pruned. After training, during the graph completion phase, the set CP of unconnected entity pairs in the legal type pairs is enumerated. For each pair (h,t)∈CP, semantic and structural embeddings are calculated and fused, and the model is used to predict the score s. ht If s ht >τ comp Then a new edge is established, with the relation type being a valid type between entity pairs, τ. comp This is a constant threshold for whether to perform completion. During the graph pruning phase, the unified knowledge graph G is traversed. global For all triples (h,r,t), compute and fuse the semantic and structural embeddings of the head and tail entities, and use the model to predict the score s. ht If s ht <τ del If so, then delete that edge, τ del This is a constant threshold value for whether or not to perform pruning. Finally, the final knowledge graph is obtained.

[0196] (7) Based on the knowledge graph completed and pruned in step (6), it is stored in a MySQL relational database. Entity and relation information is organized in a standardized form. Entity information and relation information are established as entity tables and relation tables according to type, respectively. The primary key field id in the entity table is a unique identifier for the entity. The fields in the relation table include head_entity_id (head entity ID) and tail_entity_id (tail entity ID). Based on the cyber-physical security ontology model constructed in step (1) and improved in subsequent step (2), a D2RQ mapping file is constructed, specifying the entity and relation categories corresponding to each table in the database and the fields in the table. The D2RQ software is used to complete the standard conversion of data in the MySQL database to RDF, resulting in a cyber-physical security knowledge graph in standard RDF representation.

[0197] Figure 1 This is a simplified architecture diagram of a knowledge graph construction method for cyber-physical system security. First, based on open knowledge graph standards, and combined with publicly available standards, ontology models, and expert knowledge in the cyber-physical system security field, a cyber-physical security ontology model is constructed. Second, multi-source heterogeneous data is collected and preprocessed to supplement and improve the cyber-physical security ontology model. Then, for structured and semi-structured data, a rule-based entity and relation extraction model is constructed under the guidance of the cyber-physical security ontology model to obtain a rule-based knowledge graph. For unstructured data, an information extraction method based on multi-task learning is used, and a relay node reasoning method is introduced to achieve entity recognition and relation extraction, obtaining knowledge triples and candidate path sequences. Further, relay entity-related triples are obtained through candidate path judgment. Next, for the set of triples obtained from information extraction of unstructured data from the same document, intra-document entity disambiguation and graph fusion are used to obtain a document-level knowledge graph, which is then... By combining all chapter-level knowledge graphs and rule-based knowledge graphs obtained through rule-based information extraction, a unified knowledge graph is obtained through global entity disambiguation and graph fusion. Graph completion and pruning are then performed on the unified knowledge graph. For all candidate entity pairs, entity structure embeddings are obtained using a ComplEx embedding model, and entity semantic embeddings are obtained using a RoBERTa pre-trained language model. A multilayer perceptron is used to determine missing or incorrect relationships between entity pairs, and triple completion and pruning are performed to obtain a cyber-physical security knowledge graph. Finally, the cyber-physical security knowledge graph is stored in a MySQL database, a D2RQ mapping file is constructed based on an ontology model, and the cyber-physical security knowledge graph in standard RDF representation is constructed using the D2RQ open-source standard RDF conversion software.

[0198] Figure 2 This is a detailed architecture diagram of an information extraction method based on multi-task learning. The entity recognition, relation extraction, and path determination models share the pre-trained model RoBERTa, and specific adapters for each model are constructed based on the proposed relay node inference method. The entity recognition task takes a raw text sequence as input, which is processed through a RoBERTa layer, a BiLSTM layer, a Linear layer, and a CRF layer implemented with dynamic masking to obtain the entity category sequence (BIO) annotations. The relation extraction task takes a text sequence containing entity information as input, which is processed through a RoBERTa layer, a Linear layer, and a Softmax layer implemented with dynamic masking to obtain the relationship categories between entities and the candidate path sequence. The path determination task takes a combination of the raw text sequence and candidate paths as input, which is processed through a RoBERTa layer, a Linear layer, and a Softmax layer to determine the rationality of the candidate paths.

[0199] Figure 3 This is a similar architecture diagram for a two-stage entity disambiguation and graph fusion task. The intra-text entity disambiguation and graph fusion module targets entity pairs of the same entity category extracted from the same text. It uses a RoBERTa pre-trained model and performs average pooling based on entity position to obtain semantic representations of the entity pairs. A multilayer perceptron is used to obtain the probability that entity pairs refer to the same entity. Based on the probability that all entity pairs refer to the same entity, a hierarchical clustering method based on the average cluster referencing probability is used to construct referencing chains. Entities with the same referencing chain are unified to obtain a text-level knowledge graph. The global entity disambiguation and graph fusion module uses the text-level knowledge graph obtained from the intra-text entity disambiguation and graph fusion module, as well as the rule-based knowledge graph, to construct entity pairs of the same entity category. It also constructs an entity pair structural context sequence based on the entity pair, its adjacent entities, and the relationships between them. This sequence is then processed using a RoBERTa pre-trained model and... <s>The semantic representation corresponding to the flag is used as the semantic representation of the entity pair. The probability that the entity pair refers to the same entity is obtained based on the cosine similarity between the semantic representations of the entity pairs. Based on the probability that all entity pairs refer to the same entity, the hierarchical clustering method based on the cluster average referencing probability is used to construct the referencing chain. Entities in the same referencing chain are unified to obtain a unified knowledge graph.

[0200] Corresponding to the aforementioned embodiments of the knowledge graph construction method for cyber-physical system security, the present invention also provides embodiments of a knowledge graph construction apparatus for cyber-physical system security.

[0201] See Figure 4 The knowledge graph construction apparatus for cyber-physical system security provided in this embodiment includes a memory and one or more processors. The memory stores executable code, and when the processor executes the executable code, it is used to implement the knowledge graph construction method for cyber-physical system security in the above embodiment.

[0202] The embodiments of the knowledge graph construction device for cyber-physical system security of the present invention can be applied to any device with data processing capabilities, such as a computer or other similar device. The device embodiments can be implemented in software, hardware, or a combination of both. Taking software implementation as an example, as a logical device, it is formed by the processor of any data processing device loading the corresponding computer program instructions from non-volatile memory into memory for execution. From a hardware perspective, such as... Figure 4 The diagram shown is a hardware structure diagram of any data processing-capable device containing the knowledge graph construction device for cyber-physical system security according to the present invention, except... Figure 4 In addition to the processor, memory, network interface, and non-volatile memory shown, any data processing device in the embodiment may also include other hardware depending on the actual function of the data processing device, which will not be described in detail here.

[0203] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.

[0204] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the present invention according to actual needs. Those skilled in the art can understand and implement this without creative effort.

[0205] This invention also provides a computer-readable storage medium storing a program thereon, which, when executed by a processor, implements the knowledge graph construction method for cyber-physical system security described in the above embodiments.

[0206] The computer-readable storage medium can be an internal storage unit of any data processing device as described in any of the foregoing embodiments, such as a hard disk or memory. The computer-readable storage medium can also be an external storage device of any data processing device, such as a plug-in hard disk, smart media card (SMC), SD card, flash card, etc., equipped on the device. Furthermore, the computer-readable storage medium can include both internal storage units and external storage devices of any data processing device. The computer-readable storage medium is used to store the computer program and other programs and data required by the data processing device, and can also be used to temporarily store data that has been output or will be output.

[0207] The above description is merely a preferred embodiment of one or more embodiments of this specification and is not intended to limit the scope of one or more embodiments of this specification. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of one or more embodiments of this specification should be included within the protection scope of one or more embodiments of this specification.< / s> < / s> < / s>

Claims

1. A method for constructing a knowledge graph for cyber-physical system security, characterized in that, Includes the following steps: S1 constructs a cyber-physical security ontology model from top to bottom. Based on the knowledge graph open standard, combined with existing cyber-physical system security-related standards and ontology models, and integrating domain expert knowledge, the ontology model is defined. A lightweight ontology construction method is adopted, which only models the direct relationships between entity categories when constructing the ontology model. Indirect relationships are expressed by combining multiple direct relationships, and are not explicitly defined in the ontology model. S2 collects multi-source heterogeneous data in the cyber-physical system security field, performs data preprocessing operations, and classifies the multi-source heterogeneous data into structured data, semi-structured data, and unstructured text data according to data characteristics; based on the actual collected data, it improves the ontology model constructed in S1 from the bottom up. S3, based on the preprocessed structured and semi-structured data from S2, establishes mapping rules between the data and entity types, relationships, and attributes in the ontology model; and uses a rule-based approach to extract entities and relationships from the structured and semi-structured data, thereby constructing a rule-based knowledge graph. S4, for the unstructured text data preprocessed in S2, adopts an unstructured text information extraction method based on multi-task learning, constructs three sub-tasks: entity recognition, relation extraction and path judgment, extracts entity and relation information, and obtains knowledge triples; The entity recognition task takes a raw text sequence as input. After passing through a CRF layer implemented with RoBERTa, BiLSTM, Linear, and dynamic masks, it obtains a BIO-labeled entity category sequence. The relation extraction task takes a text sequence with entity information as input. After passing through a Softmax layer implemented with RoBERTa, Linear, and dynamic masks, it obtains the relationship categories between entities and a candidate path sequence. The path judgment task takes a combination of a raw text sequence and candidate paths as input. After passing through a RoBERTa, Linear, and Softmax layer, it obtains a judgment on whether the candidate path is reasonable. In actual semantic text, entities may not have direct relationships predefined in the ontology model built based on lightweight ontology construction methods, but are associated through indirect relationships; Employing a relay node reasoning mechanism, the relation extraction task is first extended by constructing potential two-hop relations for all predefined entity category pairs in the ontology model that may have two-hop associations. When the relation extraction model identifies a potential two-hop relation between two entities, the relay node reasoning mechanism is triggered. First, it is determined whether there is a node in the text that can establish a direct relationship with both entities; this node is the relay node. If it exists, the semantic information structure extracted from the text is considered complete. If it does not exist, all entity types that can serve as relay node types are searched based on the ontology model, and all theoretically possible connection paths are constructed as candidate path sequences. S5, based on the knowledge triples obtained in S4, constructs an intra-article entity disambiguation and graph fusion model, merging synonymous entities from the same article into unified nodes to generate an article-level knowledge graph; based on the article-level knowledge graph and the rule-based knowledge graph constructed in S3, constructs a global entity disambiguation and graph fusion model to establish a unified knowledge graph; S6, based on the unified knowledge graph established by S5, constructs graph completion and pruning tasks. It adopts a semantic and structural bimodal embedding fusion model to complete or prune all entity pairs that need to be completed or pruned, respectively, so as to complete missing relations and delete redundant relations. S7 stores the knowledge graph after S6 graph completion and pruning into a relational database, organizes entity and relation information in a standardized form, constructs a cyber-physical security ontology model based on S1 and improves it in S2, builds a D2RQ mapping file, completes the standard conversion of data in the relational database to RDF, and obtains a cyber-physical security knowledge graph in standard RDF representation. 2.The knowledge graph construction method for cyber-physical system security according to claim 1, wherein, In S1, the constructed ontology covers core concepts such as cyber-physical devices, vulnerability information, attack patterns, attackers, protection strategies, and security events; in S2, the multi-source heterogeneous data includes: threat intelligence and security event data, on-site operational data and sensor information, network security logs and traffic data, vulnerability database information, system configuration and operational documents, and the data preprocessing operations include: noise data removal, format conversion, and redundancy deduplication. 3.The knowledge graph construction method for CPS security of claim 1, wherein, In S3, during semi-structured data processing, regular expression matching and JSON file format parsing are used to construct entity and relation extraction rules, thereby building knowledge triples. During structured data processing, a field mapping model is constructed based on the correspondence between the original fields and entities / attributes in the ontology model, directly realizing the structured data migration from the source database to knowledge graph triples. 4.The knowledge graph construction method for CPS security of claim 1, wherein, In S4, the path judgment task evaluates the rationality of candidate paths obtained through the relay node reasoning mechanism. If a candidate path is judged to be rational, a placeholder entity representing a relay node is constructed. This entity acts as a temporary relay node in the path construction stage to maintain the coherence of the knowledge path structure, and is aligned with similar entities in other graphs in the subsequent entity disambiguation and graph fusion stages. 5.The knowledge graph construction method for CPS security of claim 1, wherein, In S4, a multi-task learning framework is constructed. Based on the shared semantic representation of the RoBERTa encoder, adapters for entity recognition, relation extraction, and path judgment tasks are designed respectively. The label prediction in entity recognition and relation extraction tasks is constrained by a dynamic masking mechanism, and the model is trained by jointly optimizing the loss function. 6.The knowledge graph construction method for CPS security of claim 1, wherein, In S4, a model-independent meta-learning method is used for inner and outer loop training for entity recognition and relation extraction tasks; standard supervised training is performed for path judgment tasks.

7. The knowledge graph construction method for cyber-physical system security according to claim 1, characterized in that, In S5, the intra-text entity disambiguation and graph fusion model and the global entity disambiguation and graph fusion model use the RoBERTa pre-trained language model and multilayer perceptron to obtain the entity pair alignment probability using entity and its context information, and merge synonymous entities into unified nodes based on the hierarchical clustering method of referential chain and cluster average referential probability. 8.The knowledge graph construction method for CPS security of claim 1, wherein, In S6, a semantic and structural bimodal embedding fusion model is combined. The entity structure representation is obtained through the ComplEx embedding model, and the entity semantic representation is obtained through the RoBERTa pre-trained language model using the entity, its adjacent entities, and their correspondence. The two representations are combined and a multilayer perceptron is used to determine whether entity pairs need to be completed or pruned.