Software supply chain risk assessment method based on knowledge enhancement and graph neural network

By constructing knowledge graphs and graph attention network models, and combining them with business rules, the problems of data fragmentation and poor interpretability in software supply chain risk assessment are solved, enabling accurate assessment and risk tracing of software projects, and providing closed-loop security analysis support.

CN122198660APending Publication Date: 2026-06-12THE THIRD RES INST OF MIN OF PUBLIC SECURITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
THE THIRD RES INST OF MIN OF PUBLIC SECURITY
Filing Date
2026-04-29
Publication Date
2026-06-12

Smart Images

  • Figure CN122198660A_ABST
    Figure CN122198660A_ABST
Patent Text Reader

Abstract

The application relates to the technical field of network security and information security, and discloses a software supply chain risk assessment method and system based on knowledge enhancement and a graph neural network and an electronic device. In the method, a knowledge graph of a software project is constructed, and a dependency subgraph of a target software project is extracted from the knowledge graph, wherein the dependency subgraph at least includes risk features of each software package in the target software project, dependency relationships between the software packages, and association relationships between the software packages and the target software project; the dependency subgraph is input into a pre-trained graph attention network model to generate a basic risk prediction value; and the basic risk prediction value is post-processed through a preset business rule to output a final risk assessment result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the fields of network security and information security technology, and in particular to a software supply chain risk assessment method, software supply chain risk assessment system and electronic device based on knowledge enhancement and graph neural networks. Background Technology

[0002] The software supply chain is a core support for the nation's critical information infrastructure and digital economy system, and its security is directly related to national security and social stability. With the acceleration of digital transformation, open-source components are being reused extensively, leading to an explosive growth in software supply chain attacks. The stealth, transmissibility, and dynamic evolution of open-source components make network risks easy to spread and difficult to control, making fine-grained risk assessment during the development phase a necessity for the industry.

[0003] The relevant software supply chain risk assessment technologies have significant shortcomings: rule-based reasoning methods rely on only a single data source, and manually predefined rules cannot cover risk propagation scenarios with complex dependencies. The models lack self-learning capabilities, rule updates are lagging, and maintenance costs are high. Pure graph neural network methods are black-box models, lacking business priors, resulting in poor interpretability, difficulty in capturing implicit risks, and high dependence on labeled data, leading to poor evaluation results for new components. Summary of the Invention

[0004] The purpose of this application is to provide a software supply chain risk assessment method, system, and electronic device based on knowledge enhancement and graph neural networks, which enables accurate determination of the risk assessment results of target software projects.

[0005] To address the aforementioned technical problems, embodiments of this application provide a software supply chain risk assessment method based on knowledge enhancement and graph neural networks, comprising: constructing a knowledge graph of a software project; extracting a dependency subgraph of the target software project from the knowledge graph; inputting the dependency subgraph into a pre-trained graph attention network model to generate basic risk prediction values; and post-processing the basic risk prediction values ​​using preset business rules to output a final risk assessment result; wherein the dependency subgraph at least includes the risk characteristics of each software package in the target software project, the dependency relationships between the software packages, and the association relationships between the software packages and the target software project.

[0006] This application also provides a software supply chain risk assessment system, comprising: a graph construction module for constructing a knowledge graph of a software project and extracting a dependency subgraph of the target software project from the knowledge graph, wherein the dependency subgraph includes at least the risk characteristics of each software package in the target software project, the dependency relationships between the software packages, and the association relationships between the software packages and the target software project; a model prediction module for inputting the dependency subgraph into a pre-trained graph attention network model to generate basic risk prediction values; and a business rule module for post-processing the basic risk prediction values ​​according to preset business rules to output a final risk assessment result.

[0007] Compared to related technologies, the implementation method of this application is based on constructing a software project knowledge graph and extracting a dependency subgraph of the target software project. This dependency subgraph includes at least the risk characteristics of each software package in the target software project, the dependencies between these packages, and the associations between each package and the target software project. The dependency subgraph is input into a pre-trained graph attention network model. This model can fully identify the direct and indirect dependencies of each software package in the target software project, and based on this, determine the risk characteristics of each package and the risk associations between them, thereby generating basic risk prediction values. Furthermore, the basic risk prediction values ​​are post-processed using preset business rules to correct the prediction values, thus accurately determining the risk assessment results of the target software project. Attached Figure Description

[0008] One or more embodiments are illustrated by way of example with reference numerals in the accompanying drawings. These illustrations do not constitute a limitation on the embodiments. Elements with the same reference numerals in the drawings are denoted as similar elements. Unless otherwise stated, the figures in the drawings are not to be limited by scale.

[0009] Figure 1 This is a flowchart of a software supply chain risk assessment method according to some embodiments of this application; Figure 2 This is another flowchart of a software supply chain risk assessment method according to some embodiments of this application; Figure 3 This is a schematic diagram of the structure of a software supply chain risk assessment system according to some embodiments of this application; Figure 4 This is a schematic diagram of the structure of an electronic device according to some embodiments of this application. Detailed Implementation

[0010] To more clearly illustrate the technical solutions of the embodiments in this specification, the embodiments will be described in detail below with reference to the accompanying drawings. Obviously, the content described below are some examples or embodiments of this specification. For those skilled in the art, without creative effort, the technical solutions or means disclosed in this specification can be applied to other scenarios based on this technical content.

[0011] It should be understood that the terms "system," "device," "unit," and / or "module" used in this specification are a method of distinguishing different components, elements, parts, sections, or assemblies at different levels. However, if other words can achieve the same purpose, they may be replaced by other expressions.

[0012] Unless otherwise specified, the technical terms used to describe components, elements, etc. in this specification are not singular but may include plural. Generally speaking, terms such as "comprising" or "including" only indicate that explicitly identified steps, elements, or components are included, and these steps, elements, and components do not constitute an exclusive list, as the described method or apparatus may also include other steps or components.

[0013] This specification uses flowcharts to illustrate the operational steps performed by the apparatus or system of related embodiments. However, unless otherwise specified, the order in which these steps are described should not be construed as a limitation on the order of execution. Those skilled in the art can adjust the order of these steps based on the knowledge and information conveyed by the embodiments in this specification. Adjustments include, but are not limited to, reversing the order of steps, merging multiple steps, and splitting a step.

[0014] As the background technology indicates, relevant software supply chain risk assessment technologies mainly include rule-based reasoning methods or pure graph neural network methods. Both of these methods suffer from data fragmentation and biased assessments, thus failing to accurately determine the risk assessment results for the target software project. This application constructs a unified knowledge graph, identifies the direct and indirect dependencies of each software package in the target software project through a pre-trained graph attention network model, and determines the risk characteristics of each software package and the risk relationships between them, thereby generating basic risk prediction values. Furthermore, the basic risk prediction values ​​are post-processed using preset business rules to correct the prediction values, thereby accurately determining the risk assessment results for the target software project.

[0015] In view of this, this application proposes a software supply chain risk assessment method based on knowledge enhancement and graph neural networks. This method constructs a knowledge graph of the software project, extracts a dependency subgraph of the target software project from the knowledge graph, inputs the dependency subgraph into a pre-trained graph attention network model to generate basic risk prediction values, and performs post-processing on the basic risk prediction values ​​using preset business rules to output the final risk assessment result. The dependency subgraph includes at least the risk characteristics of each software package in the target software project, the dependencies between the software packages, and the association between the software packages and the target software project. This method enables accurate determination of the risk assessment result of the target software project.

[0016] Figure 1 This is an exemplary flowchart of a software supply chain risk assessment method based on knowledge enhancement and graph neural networks, according to some embodiments of this application. Figure 1 As shown, in some embodiments, the software supply chain risk assessment method based on knowledge enhancement and graph neural networks may include the following steps.

[0017] Step 110: Construct a knowledge graph of the software project and extract the dependency subgraph of the target software project from the knowledge graph.

[0018] In step 110, the dependency subgraph includes at least the risk characteristics of each software package in the target software project, the dependency relationships between the software packages, and the association relationships between the software packages and the target software project.

[0019] In some embodiments, the knowledge graph in step 110 is constructed as follows: generating attribute features of multiple software packages in the software project, and generating risk features of each software package based on the attribute features; determining the dependencies between the multiple software packages and the association between each software package and the software project according to the configuration file of the software project; and generating a knowledge graph based on the risk features, dependencies, and associations.

[0020] In one example, this application extracts the attribute features of each software package within a software project. These attribute features may include core information such as the package's unique identifier (key), package name (package_name), installed version (installed_version), and sub-dependencies (dependencies). Next, the software project's configuration file is parsed to determine the dependencies between multiple software packages. For example, using the software project as the root node, the association between each software package and the software project, as well as the dependencies between each software package, are determined. Finally, using software packages and the software project as entities, risk features as entity attributes, and the dependencies between software packages and the association between software packages and the software project as relationships between entities, the knowledge graph is constructed according to a triple structure of entities, relationships, and attributes.

[0021] This application generates the attribute feature in the following manner: generating a dependency tree based on the software project's configuration file, and traversing the dependency tree to extract the hierarchical features of each software package, wherein the root node of the dependency tree is the software project and the leaf nodes are software packages; using the hierarchical features, the version number of each software package, and the association relationship as basic features; using the creation time of the software project to which each software package belongs, as well as the download volume, last update time, and community recognition of each software package as knowledge-enhancing features; and concatenating the basic features and knowledge-enhancing features to form the attribute feature.

[0022] Specifically, this application batch-obtains configuration files from publicly available software projects and uses package management tools (such as pipdeptree) to parse and generate a hierarchical dependency tree structure with the project as the root node. This dependency tree is stored in JSON (JavaScript Object Notation) format, and each node contains core information such as a unique package identifier (key), package name (package_name), version number (installed_version), and a list of sub-dependencies (dependencies). Based on the constructed dependency tree, the application extracts the basic features and knowledge-enhancing features of each software package, transforming the unstructured dependencies between software packages and between software packages and the software project into computable numerical feature vectors, which serve as the attribute features of the software packages.

[0023] Based on the fundamental characteristics of each software package, a breadth-first search (BFS) is used to traverse the dependency tree. Starting with the project root node at depth 0, the hierarchical features of each software package are calculated layer by layer, i.e., the depth of the package within the dependency tree, to determine its importance in the dependency chain. The version number of each package (e.g., 1.2.3.post4) is decomposed into a list of numbers, with non-numeric parts directly deleted. The dimension of this version number is fixed at 3 dimensions to eliminate version format differences. The association between each software package and the software project is determined, and this association is represented by a binary or discrete feature label. For example, the label is set to "is_direct", where is_direct=1 indicates a direct association between the software package and the software project, and is_direct=0 indicates an indirect association.

[0024] The knowledge enhancement features of each software package are supplemented by calling external open-source ecosystem application programming interfaces (APIs) to enrich the ecosystem-level features of each package. In one example, this application calls the API to obtain the download count (download_count) of each package over the past month, the creation time of the software project (i.e., the number of days since the software project's first release, project_age), and the last update time (last_update) of the package. In addition, this application also obtains community recognition data reflecting the package's popularity and maintenance activity, including the number of stars and forks the package has received in the community. During the acquisition of these knowledge enhancement features, a request delay (0.1s / time) and an exception handling mechanism are added to avoid process interruptions caused by API rate limitations and call failures.

[0025] After obtaining the basic features and knowledge-enhancing features, the basic features (including 3-dimensional version number, 1-dimensional relationship marker, and 1-dimensional hierarchy depth) and the knowledge-enhancing features (including download volume, creation time, last update time, and community recognition, a total of 5 features) are concatenated to form a 10-dimensional attribute feature vector, which is then stored as a JSON file or a CSV (Comma-Separated Values) file.

[0026] In this way, by calling the API to obtain the number of software packages downloaded, the project age, and the last update time, knowledge-enhanced features are formed to improve the risk assessment feature system. These features can quantify the popularity and maintenance activity of software packages, thereby improving the accuracy of risk assessment. Setting up request delay and exception handling mechanisms can avoid API rate limits and prevent call failures from interrupting the process, ensuring the stable execution of data collection and subsequent assessment.

[0027] It should be clarified that the 10-dimensional attribute feature vector described in this application is merely illustrative and does not limit the attribute features to being composed only of the aforementioned features, nor does it limit the feature vector dimension to only 10 dimensions. Those skilled in the art can select different features according to actual needs and adaptively expand and adjust the dimensions of the feature vector.

[0028] In this application, risk characteristics of multiple software packages are generated based on attribute features in the following manner: determining vulnerability information and marking information indicating vulnerability severity based on the package name and version number of the software package; generating vulnerability features based on vulnerability information and marking information, and setting the vulnerability information as the core risk value of the software package; and generating risk features based on package name, version number, vulnerability features, core risk value, and attribute features.

[0029] In one example, this application calls an open-source vulnerability knowledge base API, passing in the package name and version number of a software package. It then matches and retrieves vulnerability information such as the vulnerability ID, Common Vulnerability Scoring System (CVSS) score, and number of vulnerabilities corresponding to the package. Simultaneously, it generates binary markers indicating the presence of critical vulnerabilities based on the vulnerability severity level. Based on the vulnerability and marker information, it statistically generates vulnerability features such as the total number of vulnerabilities, the number of vulnerabilities at each severity level, vulnerability density, and vulnerability quantity classification. Furthermore, it maps vulnerability severity to continuous numerical values ​​based on the CVSS score, taking the highest value for the software package as the core risk value. Finally, it integrates the package name and version number as basic identifiers with the aforementioned vulnerability features, core risk value, and attribute features to generate complete risk features for risk assessment and knowledge graph construction.

[0030] Specifically, this application calls a vulnerability query interface, inputting the package name, version number, and ecosystem (e.g., PyPI) of the software package. It employs two query modes: exact matching and semantic version matching (semver), ensuring accurate matching between the software package and vulnerability information and improving vulnerability identification accuracy. An API request session with a retry policy is constructed, setting a retry rule of 5 total retries and a backoff factor of 1.5. This adapts to interface exception status codes such as 429 (request exceeded) and 500 (service error), effectively resolving API call timeout and failure issues and ensuring a continuous and stable query process. Furthermore, to eliminate matching biases caused by version format differences and further improve the vulnerability query matching rate, this application standardizes the software package version number, removing non-standard suffixes such as prefixes and truncated suffixes, unifying the format to a standard version format (e.g., xyz). Thus, through vulnerability querying, version standardization, and retry assurance, vulnerability data collection and standardization are completed, thereby obtaining the original vulnerability information of the software package.

[0031] Furthermore, the original vulnerability information of the software package obtained in the aforementioned steps includes vulnerabilities, vulnerability severity levels, and the number of vulnerabilities per unit time (i.e., vulnerability density). This application evaluates each vulnerability in the software package based on the CVSS scoring system, mapping vulnerability severity to a quantitative risk value of 0.5 to 5.0 (0.5 for no vulnerability, CRITICAL=5.0, HIGH=4.0, MEDIUM=3.0, LOW=2.0, INFO=1.0), and takes the highest risk value of a single package as the core risk value; the total number of vulnerabilities in the software package is counted, and the number of vulnerabilities corresponding to CRITICAL, HIGH, MEDIUM, LOW, and INFO levels is counted to form vulnerability severity levels; the total number of vulnerabilities is divided into four discrete characteristics: no vulnerabilities, few vulnerabilities, medium vulnerabilities, and a large number of vulnerabilities; the number of vulnerabilities per unit time (i.e., vulnerability density) is calculated based on the number of days the software package is old; after normalizing the vulnerability severity levels, the number of vulnerabilities per unit time, and the marking information, the vulnerability features are concatenated to obtain the vulnerability characteristics. Based on this, the risk value of each vulnerability in the software package is determined, the highest risk value is taken as the core risk value, and the marking information is used to indicate whether the software package contains the highest severity level vulnerability. For example, if a CRITICAL vulnerability exists, the marking information is assigned a value of 1, otherwise it is assigned a value of 0.

[0032] Finally, after numericalizing and normalizing the package name, version number, vulnerability characteristics, core risk values, and attribute characteristics, a unified feature vector is generated by feature concatenation. This vector integrates the package's attribute characteristics, core risk values, and other content into a structured record, which is then written line by line to a CSV file for persistent storage. In this CSV file, "query successful," "query failed or timed out," and "invalid package" are distinguished for subsequent data cleaning. Unconvertible risk values / vulnerability counts are assigned a value of -1, and invalid packages are marked with a risk value of -2 to ensure data integrity.

[0033] In this way, through the above process, the original vulnerability information is transformed into a computable, trainable, and interpretable structured risk label, providing standard supervised learning samples for graph neural network models, while forming multi-dimensional risk quantification indicators to support the accurate assessment, level determination, and result traceability of software supply chain risks.

[0034] After obtaining the risk characteristics, dependencies, and relationships of each software package, a knowledge graph is constructed based on these characteristics and relationships. Specifically, this includes the following steps: using the software project, multiple software packages within the project, and the vulnerabilities of each package as nodes in the knowledge graph; constructing edges between nodes based on relationships and dependencies; setting attribute values ​​for the edges between each software package and its corresponding vulnerability based on the core risk value; and deduplicating and / or merging duplicate nodes and redundant edges to generate the knowledge graph.

[0035] The above steps aim to integrate scattered and isolated software package information, as well as the complex dependencies between software programs, into a unified and structured knowledge representation—a knowledge graph of the software supply chain. Constructing this knowledge graph is a core step in overcoming the "data fragmentation" problem, providing input for subsequent Graph Attention Network (GAT) models that can simultaneously represent node attributes and topological relationships, and forming the foundation for deep integration of rules and GAT. The knowledge graph construction in this stage follows the "entity-relationship-attribute" triple model, specifically implemented as follows.

[0036] First, the ontology schema of the knowledge graph is defined, clarifying entity types, relation types, and core attributes to ensure data standardization and consistency. Specifically, software projects, packages that projects depend on, and vulnerabilities corresponding to each package are designated as the three core node types of the knowledge graph, completing the instantiation of graph nodes. Attribute values ​​are then set for each instantiated node entity. The Package attribute includes at least the package name, version number, and the 10-dimensional feature vector generated in the aforementioned embodiment; the Vulnerability attribute includes the vulnerability ID (osv_id), CVSS score (cvss_score), and risk level (severity). Then, based on the affiliation between software packages and software projects, ROOT_OF association edges are constructed from software project nodes to software package nodes; based on the dependency relationships between various software packages, DEPENDS_ON association edges are constructed from dependent software package nodes to dependent software package nodes, forming the topological connection structure of the graph; and HAS_VULN edges are defined between software packages and their corresponding vulnerabilities, indicating that the software package has a known vulnerability, and the core risk value generated in the aforementioned embodiment is set as the attribute value of the HAS_VULN association edge between the software package node and its corresponding vulnerability node, thus completing the binding of risk quantification information and graph relationships.

[0037] Then, a graph data layer is constructed. Based on the ontology schema of the defined knowledge graph, this application extracts entities, relationships and attributes from existing structured data and instantiates the knowledge graph.

[0038] From the dependency tree JSON and CSV files generated in the aforementioned embodiments, all unique Package entities are extracted. Each entity is uniquely identified by its package name and version, and its corresponding multidimensional feature vector, dependency depth, and other attributes are aggregated. Additionally, based on the vulnerability query results, Vulnerability entities are extracted, using the returned vulnerability ID as a unique identifier, and their CVSS score, severity level, and other attributes are stored. Furthermore, a Project entity is created for each software project, storing its project name, repository URL, and other metadata.

[0039] Next, based on the business logic between entities, three types of core relational edges are established to clarify the topology of the graph: traversing the generated dependency tree to create corresponding relational edges for direct dependencies between parent and child packages, representing dependency propagation between software packages; creating relational edges for software project entities and all software package entities that the software project has, establishing the attribution association between the project and the complete dependency network; and traversing the vulnerability list queried for each software package to create relational edges for the software package and the corresponding vulnerability, and using the quantified risk value of the software package as the attribute of the edge, realizing the binding of risk information and relational relationships.

[0040] Finally, Neo4j graph database or in-memory graph structure is used for storage to form a large-scale heterogeneous graph with Package nodes as the core, which is associated with vulnerability knowledge and carries project dependencies. During the construction process, duplicate entities and redundant relationships are deduplicated and merged to ensure the simplicity and data consistency of the knowledge graph.

[0041] After constructing the knowledge graph, the GAT model is used to predict node risks in the software dependency graph. Based on the complete dependency relationship of the knowledge graph, the dependency chain is retrieved in reverse from the high-risk software package as the starting point to automatically locate the source of risk, the complete transmission path and the scope of impact, and achieve accurate location of the root cause of risk. A security analysis closed loop of risk prediction, risk tracing and disposal decision is constructed to provide clear basis for vulnerability repair, dependency management and risk blocking, and support the proactive prevention and closed-loop governance of software supply chain risks.

[0042] Step 120: Input the dependent subgraph into a pre-trained graph attention network model to generate a base risk prediction.

[0043] In one example, the features of the software package nodes in the dependency subgraph are cleaned, normalized, and dimension-unified, transforming unstructured features into fixed-dimensional numerical feature vectors. A graph structure is constructed based on the dependencies between software packages, with software packages as nodes and dependencies as edges, adapted to the input format of the Graph Attention Network (GAT) model. The preprocessed dependency subgraph is then input into a pre-trained GAT model. The model assigns differentiated weights to different neighboring nodes through an attention mechanism, aggregating node features with topological dependency features, automatically learning the complex and non-linear risk propagation patterns in the software supply chain, and outputting basic risk prediction values ​​for each software package (including regression-based continuous risk scores and binary critical vulnerability classification results).

[0044] Specifically, the CSV file data obtained in the aforementioned embodiments is used to construct the dataset, and the data within this dataset is preprocessed as follows: First, invalid data (failed status, abnormal risk values, unknown package name / version) is filtered out, and valid samples are retained; the generated 10-dimensional feature vectors are forced to have uniform dimensions (truncating excessively long vectors and padding with zeros to short vectors); numerical features (such as the number of vulnerabilities and download volume) are processed using "median padding and standardization (StandardScaler)," and categorical features (such as vuln_count_category) are processed using "mode padding and one-hot encoding (OneHotEncoder)." Finally, the processed data is divided into training and test sets in an 8:2 ratio, and stratified sampling is used for the classification task to ensure a balanced sample distribution.

[0045] Next, a graph structure (PackageGraphDataset) is constructed with software packages as nodes and dependencies as edges: each node corresponds to a dependency package, and the node features are preprocessed numerical feature vectors; the dependency graph connection rules are simplified (each node connects to 5 nodes before and after it) to simulate the dependencies between software packages; the graph data (node ​​features x, edge index edge_index, label y) is encapsulated based on the Data class of PyTorchGeometric to adapt to the input format of the GNN model.

[0046] The pre-trained graph attention network model is a GAT model implemented based on PyTorchGeometric. The network structure of this GAT model is customized for regression and classification tasks in software supply chain risk assessment, as follows: The input layer uses a unified feature dimension after preprocessing, specifically composed of a unified and regularized 10-dimensional basic feature vector, vulnerability quantity features, and encoded classification features. Multi-source fused structured features are used as model input, providing a standardized data foundation for subsequent risk feature learning. The hidden layer is a two-layer GAT hidden layer structure. The first GAT layer uses four attention heads, with an output dimension of hidden_channels*4, enhancing the extraction capability of package node topological dependency features and ecosystem features through multi-attention head concatenation. The second GAT layer outputs hidden_channels / / 2, refining the high-order fused features to adapt to the task requirements of the output layer. The output layer is adapted to handle both risk assessment and classification tasks. The regression task outputs a 1D result to predict the continuous risk value of the software package; the classification task outputs a 2D result to determine whether the software package contains serious vulnerabilities, achieving the dual goals of risk quantification and risk type determination. To avoid model overfitting and improve generalization ability, this application adds a Dropout layer (dropout rate of 0.3) to the network structure of the GAT model, and uses the Adam optimizer with a weight decay parameter of 1e-5. Through random neuron dropout and weight constraints, the stability of model training and the accuracy of evaluation are ensured.

[0047] After completing the GAT model structure construction and data encapsulation, the model training and evaluation phase begins. This phase is used to quantitatively verify the model's risk assessment performance, monitor the training convergence status, and solidify and store the model and preprocessing workflow. For the risk value regression prediction and critical vulnerability classification determined by this invention, appropriate evaluation metrics are used for performance verification: The regression task, for continuous risk value prediction of software packages, uses three metrics—Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE)—to quantitatively evaluate the error level between the model's predicted risk values ​​and the actual risk values, comprehensively measuring the numerical prediction accuracy. The classification task, for binary classification of critical vulnerabilities, uses four metrics—Accuracy, Precision, Recall, and F1 score—to comprehensively evaluate the model's accuracy in identifying critical vulnerabilities, its rate of no missed detections, and its overall classification performance. During training, loss curves and various evaluation metric curves for both the training and test sets are plotted simultaneously. The trends in these curves provide a clear visual indication of model convergence and the presence of overfitting or underfitting, thus validating the model's training effectiveness and generalization ability. After training, the trained GAT model weight file and the complete data preprocessing pipeline are saved together. This ensures that the model and preprocessing workflow can be directly loaded for subsequent risk inference of new samples, maintaining complete consistency between the data processing logic and the training phase. This guarantees the stability and consistency of risk assessment results, providing support for subsequent business rule integration and deployment in real-world scenarios.

[0048] Step 130: Post-process the basic risk prediction values ​​using preset business rules to output the final risk assessment results.

[0049] Based on the trained GAT model, a business rule module integrating four types of business rules is adopted, using hierarchical rule configuration, switch-based control, and non-intrusive post-processing. Each type of rule is executed in a fixed order of anomaly correction, feature weighting, scenario adaptation, and hard threshold. The activation and deactivation of each type of rule are controlled through a globally independent switch. The GAT model undertakes the core function of software supply chain risk assessment and prediction. The business rule module only corrects, adjusts, or ultimately overwrites the basic risk prediction values ​​output by the GAT model. While retaining the model's topological feature learning ability, it endows the risk assessment results with business interpretability and scenario adaptability.

[0050] The overall technical architecture of the business rules module follows four core design principles: First, rules and models are decoupled. The GAT model independently completes risk value prediction, while the business rules module runs independently as a post-processing module, only adjusting, covering, and re-judging the basic risk prediction values ​​output by the GAT model, without intruding on the core logic of model training and inference. Second, configuration-based management: all rule thresholds, weights, bonuses, and deductions are defined in key-value pairs, supporting dynamic modification and updates, and rule adjustments can be completed without retraining the GAT model. Third, layered execution logic: rules are executed strictly in the order of anomaly correction, feature weighting, scenario adaptation, and hard thresholds. Data anomaly correction is completed first, followed by fine-tuning of risk values, and finally, the results are covered by industry red line rules to ensure the rationality and rigor of the evaluation logic. Fourth, global switch control: independent Boolean control switches are set for each type of business rule, which can quickly enable or disable a single type of rule and flexibly adapt to the rule combination requirements of different application scenarios.

[0051] Please see Figure 2 Before step 120, step 115 is executed, in which the missing or extreme values ​​in the attribute features of the node are corrected according to the anomaly correction rules, and a preset risk score is added to the basic risk prediction value; in step 130, the basic risk prediction value is post-processed according to the preset business rules, including the following: the basic risk prediction value is corrected in sequence according to the feature weighting rules, the scenario adaptation rules and the hard threshold rules.

[0052] In some embodiments, the basic risk prediction value is corrected sequentially according to the feature weighting rule, scenario adaptation rule, and hard threshold rule, including: Step 1301: Apply feature weighting rules to adjust the scores of the basic risk prediction values ​​after anomaly correction rules based on the download volume, update time, and number of each software package in the software project.

[0053] Step 1302, Application scenario adaptation rules: Based on the business scenario corresponding to the software project, the basic risk prediction value after feature weighting rule processing is adjusted a second time.

[0054] Step 1303: Apply the hard threshold rule. If the basic risk prediction value meets the preset threshold condition, output the direct judgment result to cover all the previous adjustment values, as the final risk assessment result.

[0055] The anomaly correction rules, as a pre-execution step, perform final anomaly handling on the feature data such as download volume and star count input to the GAT model. For missing features, median values ​​from similar software are used for padding; for extreme values, 99th percentile truncation is used. Simultaneously, a small risk score is added to missing features and extreme value truncation with fixed weights to quantify the potential risks brought by data anomalies and ensure the data validity of subsequent calculations. The feature weighting rules, based on the original basic risk prediction values ​​of the GAT model, perform conditional addition and subtraction operations on three core dimensions: timeliness, community activity, and dependency scale. Each dimension has a score cap to avoid excessive interference from a single feature. Specifically, timeliness is calculated based on the last update time of dependencies; activity is assessed by the number of stars and forks; and cumulative scoring is based on the total number of dependencies (e.g., 0.3 points for every 50 additional dependencies). The system aligns with industry consensus on software supply chain risk assessment. It employs scenario-adaptive rules, customizing risk adjustment rules for typical business scenarios such as finance, open source, and enterprises. In the financial sector, high-risk dependencies and newly launched dependencies receive increased risk weights; in the open source sector, high-activity projects receive reduced risk scores; and in the enterprise sector, self-developed libraries and unregistered third-party libraries receive deductions and additions respectively, meeting the differentiated assessment needs of different scenarios. A hard threshold rule serves as a safety net for industry red lines. A high-risk assessment takes effect immediately if any threshold condition is met (e.g., percentage of medium-to-high-risk dependencies ≥ threshold, highest risk level of a single dependency ≥ threshold, dependency not updated for more than 2 years, project star count < unpopularity threshold). A low-risk assessment requires all joint threshold conditions to be met simultaneously (e.g., project age ≥ threshold, download volume ≥ threshold, number of stars ≥ threshold). This rule directly covers the results of all preceding rule adjustments, ensuring that risk level assessments comply with industry security standards.

[0056] The business rules module and the GAT model adopt a decoupled post-processing fusion architecture. The core execution logic is that the model performs basic predictions and the rules correct the results. The model structure and training inference logic remain unchanged throughout the process, and the fusion is completed only through data layer interaction. Specifically, it consists of three steps: First, the GAT model uses the preprocessed feature vectors and dependent subgraphs as the basic risk prediction values ​​as input data for the rules module. Second, in a hierarchical order of anomaly correction, feature weighting, scenario adaptation, and hard threshold, the basic risk prediction values ​​are sequentially corrected for anomalies, weighted for dimensions, customized for scenarios, and subject to red-line fallback judgment. Finally, the first three layers of rules perform cumulative adjustments to the risk values, and the hard threshold rules perform a comprehensive judgment, outputting the final risk assessment result after fusion rules. All rules support independent on / off control and dynamic parameter configuration without modifying the model structure or retraining. This retains the high-precision risk assessment advantages of graph neural networks while adapting to the application needs of actual business scenarios, achieving a balance between assessment accuracy and business interpretability.

[0057] This application constructs a knowledge graph integrating multi-source heterogeneous data through multiple implementations, unifying the representation of multi-dimensional security elements such as vulnerabilities, components, and developers. This makes the input information for risk assessment more comprehensive and richer, effectively reducing risk omissions caused by single data sources. The rule-guided GAT collaborative assessment model proposed in this application combines the interpretability of business rules with the powerful topology learning capabilities of GAT. The rule layer provides professional judgment benchmarks consistent with security experts, while the GAT layer captures complex and non-linear risk propagation patterns. The dynamic fusion mechanism makes the model biased towards interpretable rules in simple scenarios and towards powerful data-driven predictions in complex scenarios, thus achieving a balance between high accuracy and high interpretability overall. In summary, this application provides a complete software supply chain risk assessment framework, from data integration, knowledge graph, and graph attention network modeling to result visualization. It can directly transform technological achievements into practically applicable systems, providing development and security operations teams with closed-loop decision support from risk discovery, analysis, to handling. This effectively solves core problems in existing technologies such as data fragmentation, biased assessment, and the difficulty in balancing interpretability and accuracy.

[0058] The steps of the various methods described above are only for clarity. In practice, they can be combined into one step or some steps can be split into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this application. Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but without changing the core design of the algorithm and process, are also within the scope of protection of this patent.

[0059] This application also provides a software supply chain risk assessment system, such as... Figure 3 As shown, it includes: a graph construction module, used to construct a knowledge graph of the software project, and extract a dependency subgraph of the target software project from the knowledge graph, wherein the dependency subgraph includes at least the risk characteristics of each software package in the target software project, the dependency relationships between the software packages, and the association relationships between the software packages and the target software project; a model prediction module, used to input the dependency subgraph into a pre-trained graph attention network model to generate basic risk prediction values; and a business rule module, used to post-process the basic risk prediction values ​​according to preset business rules to output the final risk assessment result.

[0060] It is worth mentioning that all modules involved in this embodiment are logical modules. In practical applications, a logical unit can be a physical unit, a part of a physical unit, or a combination of multiple physical units. Furthermore, to highlight the innovative aspects of this application, this embodiment does not introduce units that are not closely related to solving the technical problems proposed in this application; however, this does not mean that other units are absent in this embodiment.

[0061] This application specification also provides an electronic device, such as... Figure 4 As shown, it includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the above-described software supply chain risk assessment method based on knowledge enhancement and graph neural networks.

[0062] The memory and processor are connected via a bus, which can include any number of interconnecting buses and bridges, connecting various circuits of one or more processors and memories. The bus can also connect various other circuits, such as peripheral devices, voltage regulators, and power management circuits, which are well known in the art and will not be described further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver can be a single element or multiple elements, such as multiple receivers and transmitters, providing a unit for communicating with various other devices over a transmission medium. Data processed by the processor is transmitted over the wireless medium via an antenna, which further receives data and transmits it to the processor.

[0063] The processor manages the bus and general processing, and also provides various functions, including timing, peripheral interfaces, voltage regulation, power management, and other control functions. Memory is used to store data used by the processor during operation.

[0064] Those skilled in the art will understand that the above embodiments are specific implementations of this application, and in practical applications, various changes can be made in form and detail without departing from the spirit and scope of this application.

Claims

1. A software supply chain risk assessment method based on knowledge enhancement and graph neural networks, characterized in that, include: Construct a knowledge graph of the software project, and extract the dependency subgraph of the target software project from the knowledge graph; The dependent subgraph is input into a pre-trained graph attention network model to generate a basic risk prediction value; The basic risk prediction values ​​are post-processed using preset business rules to output the final risk assessment result; The dependency subgraph includes at least the risk characteristics of each software package in the target software project, the dependency relationships between the software packages, and the association relationships between the software packages and the target software project.

2. The software supply chain risk assessment method based on knowledge enhancement and graph neural networks according to claim 1, characterized in that, The knowledge graph is constructed in the following manner: Generate attribute characteristics of multiple software packages in the software project, and generate risk characteristics of each of the multiple software packages based on the attribute characteristics; Based on the configuration file of the software project, determine the dependencies between the multiple software packages, as well as the association between each software package and the software project; The knowledge graph is generated based on the risk characteristics, dependencies, and associations.

3. The software supply chain risk assessment method based on knowledge enhancement and graph neural networks according to claim 2, characterized in that, The attribute characteristics of generating multiple software packages in the software project include: A dependency tree is generated based on the configuration file of the software project, and the dependency tree is traversed to extract the hierarchical features of each software package, wherein the root node of the dependency tree is the software project, and the leaf nodes are the software packages; The hierarchical features, the version number of each software package, and the association relationship are used as basic features; The creation time of the software project to which each package belongs, as well as the download volume, last update time, and community recognition of each package, are used as knowledge enhancement features. The basic features and the knowledge-enhancing features are concatenated to form the attribute features.

4. The software supply chain risk assessment method based on knowledge enhancement and graph neural networks according to claim 3, characterized in that, The generation of risk features for each of the multiple software packages based on the attribute features includes: Based on the package name and version number of the software package, determine the vulnerability information of the software package and the marking information used to indicate the severity of the vulnerability; Vulnerability features are generated based on the vulnerability information and the tagging information, and the vulnerability information is set as the core risk value of the software package. The risk feature is generated based on the package name, the version number, the vulnerability characteristics, the core risk value, and the attribute characteristics.

5. The software supply chain risk assessment method based on knowledge enhancement and graph neural networks according to claim 4, characterized in that, The vulnerability information includes the vulnerability, the vulnerability level, and the number of vulnerabilities per unit time; the tagging information is used to indicate whether the software package contains the highest severity level vulnerability; The step of generating vulnerability features based on the vulnerability information and the tagging information, and setting the vulnerability features as the core risk value of the software package based on the vulnerability information, includes: After normalizing the vulnerability quantity level, the number of vulnerabilities per unit time, and the marking information, the vulnerability features are obtained by splicing them together. Determine the risk value of each vulnerability in the software package, and use the highest risk value as the core risk value.

6. The software supply chain risk assessment method based on knowledge enhancement and graph neural networks according to claim 5, characterized in that, The step of generating the knowledge graph based on the risk characteristics, the dependencies, and the associations includes: The software project, multiple software packages of the software project, and vulnerabilities of each software package are used as nodes in the knowledge graph; Construct the associated edges between the nodes based on the association and dependency relationships; Based on the core risk value, an attribute value is set for the edge between each software package and its corresponding vulnerability; Duplicate nodes and redundant associated edges are deduplicated and / or merged to generate the knowledge graph.

7. The software supply chain risk assessment method based on knowledge enhancement and graph neural networks according to claim 1, characterized in that, The preset business rules include feature weighting rules, scenario adaptation rules, and hard threshold rules; The post-processing of the basic risk prediction value using preset business rules includes: The basic risk prediction value is corrected sequentially according to the feature weighting rule, the scenario adaptation rule, and the hard threshold rule.

8. The software supply chain risk assessment method based on knowledge enhancement and graph neural networks according to claim 7, characterized in that, The step of sequentially correcting the basic risk prediction value according to the feature weighting rule, the scenario adaptation rule, and the hard threshold rule includes: The feature weighting rules are applied to adjust the scores of the basic risk prediction value after anomaly correction rule processing based on the download volume, update time, and number of each software package in the software project. Applying the scenario adaptation rules, the basic risk prediction value after processing by the feature weighting rules is adjusted a second time according to the business scenario corresponding to the software project; Applying the hard threshold rule, if the basic risk prediction value meets the preset threshold condition, a direct judgment result is output to cover all previous adjustment values, which is then used as the final risk assessment result.

9. A software supply chain risk assessment system, characterized in that, A method for performing the software supply chain risk assessment based on knowledge enhancement and graph neural networks as described in any one of claims 1 to 8 includes: The knowledge graph construction module is used to construct a knowledge graph of a software project and extract a dependency subgraph of the target software project from the knowledge graph. The dependency subgraph includes at least the risk characteristics of each software package in the target software project, the dependency relationships between the software packages, and the association relationships between the software packages and the target software project. The model prediction module is used to input the dependent subgraph into a pre-trained graph attention network model to generate a basic risk prediction value. The business rules module is used to post-process the basic risk prediction values ​​according to preset business rules in order to output the final risk assessment results.

10. An electronic device, characterized in that, include: At least one processor; as well as, A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the software supply chain risk assessment method based on knowledge enhancement and graph neural networks as described in any one of claims 1 to 8.