Open source software graph updating method and device based on local subgraph incremental fusion
By using a local subgraph incremental fusion method and a graph neural network model to perform embedding representation learning and similarity calculation on local subgraphs, the problem of low update efficiency of knowledge graph in open-source software supply chain is solved, and efficient and high-precision global knowledge graph update is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INST OF SOFTWARE - CHINESE ACAD OF SCI
- Filing Date
- 2026-01-23
- Publication Date
- 2026-06-19
AI Technical Summary
Existing open-source software supply chain knowledge graphs struggle to meet the timeliness requirements of information when faced with continuously evolving and asynchronously updated data, resulting in low update efficiency.
A local subgraph-based incremental fusion method is adopted. The embedding representation of entities in the local subgraph is learned through a graph neural network model, the vector similarity between entities is calculated, and the local subgraph is updated based on this. Finally, the updated local subgraph is written into the global knowledge graph.
It improves the efficiency and accuracy of knowledge graph updates, reduces computational load, and achieves high-precision global knowledge graph updates.
Smart Images

Figure CN122240146A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to an open-source software graph update method and apparatus based on local subgraph incremental fusion. Background Technology
[0002] With the widespread adoption of open-source software in the software supply chain, information on software, developers, organizations, and vulnerabilities is increasingly becoming decentralized and distributed across ecosystems. Constructing an open-source software supply chain knowledge graph can integrate software, dependencies, developers, organizations, and security information from different ecosystems, forming a panoramic view of the supply chain and providing fundamental data support for cross-ecosystem software analysis, risk management, and development support.
[0003] Due to the continuously evolving, multi-source distributed, and asynchronously updated nature of data in the open-source software supply chain, the entire open-source software supply chain knowledge graph needs to be re-computed whenever a new software is released, a version is updated, or a vulnerability is disclosed, making it difficult to meet the requirements for information timeliness. Therefore, improving the update efficiency of the knowledge graph is a key technical problem that urgently needs to be solved. Summary of the Invention
[0004] This invention provides an open-source software graph update method and apparatus based on incremental fusion of local subgraphs, which can improve the update efficiency of knowledge graphs.
[0005] This invention provides an open-source software graph update method based on incremental fusion of local subgraphs, comprising the following steps: Obtain the data to be updated from the global knowledge graph, which is constructed based on entities and relationships extracted from open-source software supply chain data; Based on the entities to be updated in the data to be updated, a local subgraph is determined from the global knowledge graph; Based on the graph neural network model, embedding representation learning is performed on each entity in the local subgraph and the entity to be updated to determine the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph. Based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph, the local subgraph is updated, and the global knowledge graph is updated based on the updated local subgraph to obtain the updated global knowledge graph.
[0006] According to an open-source software graph update method based on incremental fusion of local subgraphs provided by the present invention, the process of determining the embedding vector of the current entity in the local subgraph includes: Obtain the initial feature vector of the current entity in the local subgraph; Based on the graph attention mechanism in the graph neural network model, the attention coefficient between the current entity and its neighboring entities is calculated in the topology of the local subgraph. Based on the attention coefficient, the feature information of the neighboring entities is aggregated to update the initial feature vector of the current entity, resulting in an aggregated feature vector; The aggregated feature vector is iteratively updated until a preset iteration threshold is reached, and the iterated aggregated feature vector is used as the embedding vector of the current entity.
[0007] According to an open-source software knowledge graph update method based on incremental fusion of local subgraphs provided by the present invention, the step of determining local subgraphs from the global knowledge graph based on the entities to be updated in the data to be updated includes: Based on the cosine similarity between the entity to be updated and each entity in the global knowledge graph, multiple similar entities are determined from the global knowledge graph; The local subgraph is constructed based on the target entity and the edges connecting the target entity; the target entity includes the entity to be updated, the similar entity, the entity that has a second-degree relationship with the entity to be updated, and the entity that has a second-degree relationship with the similar entity.
[0008] According to an open-source software graph update method based on incremental fusion of local subgraphs provided by the present invention, the method for updating the local subgraph based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph includes: Based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph, the candidate entity with the highest similarity to the entity to be updated is determined from the entities in the local subgraph. If the vector similarity between the embedding vector of the entity to be updated and the embedding vector of the candidate entity is greater than a similarity threshold, the entity to be updated and the candidate entity are merged. If the vector similarity between the embedding vector of the entity to be updated and the embedding vector of the candidate entity is less than or equal to the similarity threshold, the entity to be updated is added as a new entity to the local subgraph.
[0009] According to an open-source software graph update method based on incremental fusion of local subgraphs provided by the present invention, the step of merging the entity to be updated with the candidate entity includes: Create a merged entity in the local subgraph, and link the candidate entity and the entity to be updated to the merged entity; Based on a preset conflict resolution strategy, the attributes of the entity to be updated and the candidate entities are merged, and the merged attributes are assigned to the merged entity. Migrate all relationships between the entity to be updated and the candidate entity to the merged entity.
[0010] According to the present invention, an open-source software graph update method based on incremental fusion of local subgraphs is provided, wherein the fusion of attributes of the entity to be updated and the candidate entities based on a preset conflict resolution strategy includes at least one of the following: For attributes of the text description type, the description text of the entity to be updated is concatenated with the description text of the candidate entity; For attributes of the category label type, the union of the label set of the entity to be updated and the label set of the candidate entity is taken.
[0011] This invention also provides an open-source software graph update device based on incremental fusion of local subgraphs, comprising the following modules: The data to be updated module is used to acquire the data to be updated in the global knowledge graph, which is constructed based on entities and relationships extracted from open-source software supply chain data. The local subgraph determination module is used to determine local subgraphs from the global knowledge graph based on the entities to be updated in the data to be updated; The embedding vector determination module is used to perform embedding representation learning on each entity in the local subgraph and the entity to be updated based on the graph neural network model, and to determine the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph. The update module is used to update the local subgraph based on the vector similarity between the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph, and update the global knowledge graph based on the updated local subgraph to obtain the updated global knowledge graph.
[0012] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the open-source software graph update method based on local subgraph incremental fusion as described above.
[0013] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the open-source software graph update method based on local subgraph incremental fusion as described above.
[0014] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the open-source software graph update method based on local subgraph incremental fusion as described above.
[0015] This invention provides an open-source software knowledge graph update method and apparatus based on incremental fusion of local subgraphs. Instead of directly manipulating the global knowledge graph after acquiring the data to be updated, it determines a small local subgraph from the global knowledge graph based on the entity to be updated, significantly reducing the computational load of subsequent update operations. An embedding representation learning process is performed on each entity in the local subgraph and the entity to be updated using a graph neural network model to calculate the vector similarity between the entity to be updated and each entity in the local subgraph. Based on the calculated similarity of the embedding vectors, a high-precision update process for the local subgraph is achieved, and the updated local subgraph is further written into the global knowledge graph, realizing a high-precision update process for the global knowledge graph. This high-precision update process based on local subgraphs not only improves the update efficiency of the global knowledge graph but also enhances the update accuracy. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0017] Figure 1 This is a flowchart illustrating the open-source software graph update method based on incremental fusion of local subgraphs provided by this invention.
[0018] Figure 2 This is a schematic diagram of the global knowledge graph construction process provided by the present invention.
[0019] Figure 3 This is a schematic diagram of the global knowledge graph update process provided by the present invention.
[0020] Figure 4 This is a flowchart illustrating the open-source software graph update method based on incremental fusion of local subgraphs provided by this invention.
[0021] Figure 5 This is a schematic diagram of the structure of the open-source software graph update device based on local subgraph incremental fusion provided by the present invention.
[0022] Figure 6 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0023] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0024] Figure 1 This is a flowchart illustrating the open-source software graph update method based on incremental fusion of local subgraphs provided by this invention, as shown below. Figure 1 As shown, the method includes the following: Step 110: Obtain the data to be updated in the global knowledge graph, which is constructed based on entities and relationships extracted from open-source software supply chain data; Step 120: Based on the entities to be updated in the data to be updated, determine local subgraphs from the global knowledge graph; Step 130: Based on the graph neural network model, perform embedding representation learning on each entity in the local subgraph and the entity to be updated to determine the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph. Step 140: Based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph, update the local subgraph, and update the global knowledge graph based on the updated local subgraph to obtain the updated global knowledge graph.
[0025] The execution subject of the open-source software graph update method based on local subgraph incremental fusion provided by this invention can be an electronic device, a component in an electronic device, an integrated circuit, or a chip. The electronic device can be a mobile electronic device or a non-mobile electronic device. For example, a mobile electronic device can be a mobile phone, tablet computer, laptop computer, PDA, ultra-mobile personal computer (UMPC), netbook, or personal digital assistant (PDA), etc., while a non-mobile electronic device can be a server, network attached storage (NAS), or personal computer (PC), etc. This invention does not impose specific limitations.
[0026] The following example, using a computer executing the open-source software graph update method based on local subgraph incremental fusion provided by this invention, illustrates the technical solution of this invention in detail.
[0027] In step 110, the data to be updated in the global knowledge graph is obtained.
[0028] In this invention, the global knowledge graph refers to a structured knowledge base encompassing the entire business logic of the open-source software supply chain ecosystem. Its construction is based on open-source software supply chain data, which is collected from various open-source software ecosystem nodes, such as code hosting platforms, software package repositories, and vulnerability databases. Entities in the global knowledge graph can include software, storage repositories, developers, licenses, vulnerabilities, organizations, etc.; relationships include dependencies, inclusion, contributions, and influences.
[0029] The global knowledge graph is constructed by extracting entities (software, storage repositories, developers, licenses, vulnerabilities, etc.) and their relationships (dependencies, inclusions, contributions, impacts, etc.) from existing open-source software supply chain data. Nodes represent entities, and edges represent relationships. The specific construction process can be described as follows: Figure 2 The schematic diagram of the global knowledge graph construction process provided by this invention is shown, and specifically includes: Step 210: Define the ontology structure of the knowledge graph based on the open-source software supply chain scenario. The open-source software supply chain is a real-world business system. During development and operation, it involves a supply chain network formed by all upstream open-source software projects, source code packages, binary packages, package managers, storage repositories, as well as developers and maintainers, communities, vendors, and end users, based on relationships such as dependency, packaging, building, hosting, collaboration, guidance, delivery, and feedback. As the top-level framework for knowledge organization, the ontology of the open-source software supply chain knowledge graph can systematically define multi-dimensional entities such as "software," "hardware," "programming languages," "package managers," "instruction set architecture," "organization," "personnel," and "repository," as well as the attribute fields contained within these entities, based on the business logic of the entire open-source software ecosystem. It clarifies the complex semantic relationships between entities, such as the development relationship between software and personnel, the usage relationship between software and programming languages, and the dependency relationship between software programs. This ontology structure constructs a logical framework in a structured form of triples, providing a paradigm for data standardization in the current scenario and ensuring that the knowledge graph architecture can adapt to the complexity and diversity of the open-source software supply chain.
[0030] Step 220: Based on the ontology structure, perform structured transformation and information extraction on multi-source heterogeneous data to identify entities and the relationships between them. Extract entities (software, storage repositories, developers, licenses, vulnerabilities, etc.) and their relationships (dependencies, inclusion, contributions, impacts, etc.) from existing open-source software supply chain data to construct an open-source software supply chain knowledge graph. Specifically, to achieve the construction of a cross-ecosystem knowledge graph, a joint embedding learning framework based on a pre-trained language model and a graph attention network is used to integrate the textual semantic features and graph structural features of entities, generating high-quality entity vector representations. Equivalent entities are identified based on vector similarity calculations as anchor points, integrating the various single-ecosystem knowledge graphs into a unified cross-ecosystem knowledge graph.
[0031] In this invention, structured data is directly mapped to specified fields of specified entities defined in the ontology. For semi-structured data (such as JSON format), the program extracts the corresponding field content according to the entity structure defined in the ontology. For unstructured data, a step-by-step strategy is used to drive a large language model to identify entities defined in the ontology structure, and semantic association analysis methods are used to extract semantic relationships between entities.
[0032] Specifically, a large language model can be used to identify all entities in a text and extract detailed information for each entity. First, the name of each entity needs to be extracted; next, the type of the entity needs to be determined; then, a concise description summarizing the basic meaning or characteristics of the entity should be written. Based on this, according to the attributes and relational structure defined by the ontology model, constraint rules that the large language model must follow during the recognition process are set. Under the guidance of these rules, all attribute information related to the entity will be extracted and organized as key-value pairs.
[0033] After integrating the entity tuples, a large language model is further used to identify the relationships between the entities identified in the previous step. First, all clearly related initial entity pairs need to be identified; each initial entity pair consists of two entities: a source entity and a target entity. Then, the relationship type is extracted for each initial entity pair to explain the reasons for their association.
[0034] The large language model used can be a pre-trained language model based on the Transformer architecture with a model parameter scale of no less than 9B (9 billion), supporting multi-turn dialogue and text generation functions, used to perform natural language understanding, fine-tuned to adapt to the domain requirements corresponding to unstructured open source software supply chain data, and knowledge extraction from unstructured open source software supply chain data through localized deployment.
[0035] Based on the extracted entities and relationships, multiple initial knowledge graphs corresponding to different open-source software supply chain data can be constructed. Any two initial knowledge graphs can be denoted as the source graph and the target graph, respectively. Entities in the source graph are called source entities, and entities in the target graph are called target entities. The purpose of knowledge fusion is to merge the two knowledge graphs by identifying target entities in the target graph that point to the same objects in the real world as the source entities, thereby forming a larger, more comprehensive, and more accurate fused software supply chain knowledge graph, also known as the global knowledge graph.
[0036] Step 230 involves knowledge fusion within the open-source software supply chain. Further, after constructing a global knowledge graph, a further knowledge fusion process can be performed on the constructed global knowledge graph. This involves acquiring the textual features (entity name, description, etc.), structural features, and attribute features of each entity in the global knowledge graph. Each type of feature exists in vector form. Then, the similarity between the vectors of the source and target entities is used to determine whether the two entities can be fused, thereby achieving the fusion optimization process of the global knowledge graph.
[0037] Specifically, a pre-trained text embedding model is used to encode the text features of entities, outputting a text feature vector for each entity; a graph neural network is used to encode the structural information of entities, outputting a structural feature vector.
[0038] Graph attention neural networks are used to capture structural information. These networks aggregate neighbor information through a message-passing mechanism, assigning different attention coefficients to each neighbor during aggregation. This encodes the topological relationships of the graph structure into low-dimensional vectors, allowing for better utilization of the graph structure itself within the knowledge graph. For attribute features, each entity's attributes and values are first processed into an attribute-value sequence. Then, a pre-trained text embedding model encodes the attribute features into feature vectors. A learnable gating mechanism is used to weight and sum these three vectors to form the final entity embedding vector.
[0039] After obtaining the embedded representation of the entity, for any two entities, the cosine similarity of their joint embedding vector is calculated to determine whether they point to the same objective entity, thereby optimizing the fusion process.
[0040] After constructing the global knowledge graph, the data to be updated in the global knowledge graph is obtained. Data to be updated refers to newly incoming data in incremental scenarios, such as newly released software versions, newly disclosed vulnerabilities, or developer change information. This data is typically triggered by change events.
[0041] In step 120, a local subgraph is determined from the global knowledge graph based on the entities to be updated in the data to be updated.
[0042] After acquiring the data to be updated, the entities within that data are identified. To avoid recalculating the entire massive global knowledge graph, computational boundaries can be defined, i.e., local subgraphs.
[0043] A local subgraph refers to a subset of the graph that is directly or indirectly affected by the change. By extracting local subgraphs, subsequent embedding calculations and similarity comparisons are limited to a smaller range, greatly reducing computational complexity.
[0044] The process of determining a local subgraph involves comparing the entity to be updated with entities in the global knowledge graph, filtering based on similarity, and then determining the local subgraph from the global knowledge graph.
[0045] In step 130, embedding representation learning is performed on each entity in the local subgraph and the entity to be updated based on the graph neural network model to determine the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph.
[0046] Graph neural network models, specifically graph attention networks, aim to map nodes (entities) in a graph to a low-dimensional vector space, enabling these vectors to capture the structural relationships and semantic information between entities.
[0047] The embedding representation learning is performed on each entity in the local subgraph and the entity to be updated using a graph neural network model to obtain the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph.
[0048] In step 140, the local subgraph is updated based on the vector similarity between the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph, and the global knowledge graph is updated based on the updated local subgraph to obtain the updated global knowledge graph.
[0049] Optionally, a schematic diagram of the global knowledge graph update process can be shown as follows: Figure 3 The schematic diagram of the global knowledge graph update process provided by this invention is shown.
[0050] The global knowledge graph update process can be performed based on an incremental fusion engine.
[0051] Step 310: Merge request triggering and subgraph extraction.
[0052] Once the data acquisition mechanism captures and preprocesses a change event, it sends a FusionRequest to the fusion engine. This request contains data to be updated, and a set of entities to be updated can be constructed based on multiple entities contained within this data. The fusion engine extracts a local subgraph to be fused from the global knowledge graph based on the entities in the set of entities to be updated. . The extraction scope includes: the set of entities to be updated Entities in Similar entities, entities that have a second-degree relationship with similar entities, and entities that are similar to... Entities with a second-degree relationship in the middle (can be called context entities) ), and the edges connecting these entities.
[0053] Step 320: Embedding and similarity calculation based on attention graph neural network.
[0054] The embedding representation learning is performed on each entity in the local subgraph and the entity to be updated based on the attention graph neural network, and the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph are determined.
[0055] The embedding vector of the entity to be updated is calculated to have a vector similarity with the embedding vectors of each entity in the local subgraph. Entity alignment and merging are then performed based on the vector similarity to update the local subgraph.
[0056] Vector similarity measures the semantic and structural closeness of two entities. Similarity calculation methods can include cosine similarity.
[0057] Step 330, Entity Alignment and Merging. By calculating the similarity of the embedding vectors between the entity to be updated and other entities in the local subgraph, it can be determined whether the entity to be updated represents entirely new knowledge or points to the same object as an existing entity in the graph. Based on the determination result, the corresponding update operation is performed on the local subgraph, such as adding a new entity or merging identical entities.
[0058] Step 340: Subgraph Write Back and Global Update. Write the updated local subgraph (containing the newly inserted entities, the merged unified entities, and the updated relationships) back to the global knowledge graph.
[0059] Optionally, to ensure data consistency, traceability, and recoverability, the update process can be executed using database transactions to guarantee the atomicity of incremental updates. For example, if any failure occurs during the update process, the knowledge graph can be rolled back to a consistent state before the update. Simultaneously, detailed update operation logs can be recorded, including change time, entity ID to be updated, and operation type, to support version tracking and incremental backup of the knowledge graph.
[0060] This invention provides an open-source software knowledge graph update method based on incremental fusion of local subgraphs. Instead of directly manipulating the global knowledge graph after acquiring the data to be updated, it determines a small local subgraph from the global knowledge graph based on the entity to be updated, significantly reducing the computational load of subsequent update operations. An embedding representation learning method is performed on each entity in the local subgraph and the entity to be updated using a graph neural network model to calculate the vector similarity between the entity to be updated and each entity in the local subgraph. Based on the calculated similarity of the embedding vectors, a high-precision update process for the local subgraph is achieved, and the updated local subgraph is further written into the global knowledge graph, realizing a high-precision update process for the global knowledge graph. This high-precision update process based on local subgraphs not only improves the update efficiency of the global knowledge graph but also enhances the update accuracy.
[0061] In one embodiment, the process of determining the embedding vector of the current entity in the local subgraph includes: Obtain the initial feature vector of the current entity in the local subgraph; Based on the graph attention mechanism in the graph neural network model, the attention coefficient between the current entity and its neighboring entities is calculated in the topology of the local subgraph. Based on the attention coefficient, the feature information of the neighboring entities is aggregated to update the initial feature vector of the current entity, resulting in an aggregated feature vector; The aggregated feature vector is iteratively updated until a preset iteration threshold is reached, and the iterated aggregated feature vector is used as the embedding vector of the current entity.
[0062] It should be noted that the current entity is any entity in the local subgraph.
[0063] The initial feature vector is a vectorized representation of the entity's own attributes. The specific determination process can be as follows: extract the entity's textual attributes, such as the entity's name and description, and use Term Frequency-Inverse Document Frequency (TF-IDF) or a pre-trained language model (such as Bidirectional Encoder Representations from Transformers (BERT)) to convert them into the initial feature vector.
[0064] Optionally, the graph neural network model can specifically employ a graph attention network. The graph attention mechanism in a graph attention network assigns different importance weights, or attention coefficients, to different neighbors in the topology of a local subgraph through a self-attention process. This coefficient represents the importance of a neighboring node to the current node when aggregating information.
[0065] Based on the attention coefficients, the feature information of the neighboring entities is aggregated to update the initial feature vector of the current entity, resulting in an aggregated feature vector. Specifically, the feature vector of each neighboring entity is multiplied by its corresponding normalized attention coefficient, and then all weighted neighbor feature vectors are summed and aggregated to obtain a new feature vector representation containing neighborhood information, i.e., the aggregated feature vector.
[0066] The aggregation process described above can be viewed as a layer in a graph attention network. By stacking multiple layers of graph attention networks, an entity can aggregate information from its more distant neighbors. With each aggregation update, the entity gains a wider receptive field. A preset number of iterations (i.e., the number of network layers) can be set, for example, L iterations. After the iterations are complete, the final aggregated feature vector is used as the final embedding vector for that entity.
[0067] Optionally, the local embedding and similarity calculation process based on the attention map neural network can be as follows: Input: Local subgraph The entities in the list and the entities to be updated.
[0068] Output: The vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph.
[0069] First: Construct input features.
[0070] Local subgraphs Each entity The attribute features (such as the TF-IDF vector of name and description) are used as the initial feature vector. .
[0071] Second: Graph attention layer update.
[0072] An L-layer graph attention network is used to aggregate neighbor information and update node representations. This is achieved through L-layer iterations, where the embedding of each node is aggregated with the latest information from its neighbors.
[0073] The computation process for each layer includes: Calculate the attention coefficient: for nodes Each neighbor Calculate an attention coefficient It indicates that in the context of this fusion, the neighbors right The importance of.
[0074] ; in, The index of the layer number L. for A layer-shared learnable weight matrix is used to transform node features. for A learnable attention vector shared by the layers. || is the vector concatenation operation. LeakyReLU is a non-linear activation function.
[0075] Normalized attention weight calculation: using the softmax function to calculate the attention weights. The attention coefficients of all neighbors are normalized to obtain the final attention weight. .
[0076] ; in, express Neighbors right The importance of.
[0077] Aggregate neighbor information and update node embeddings: Based on attention weights, perform a weighted sum of the transformed features of neighbors, and then obtain the node embeddings through a non-linear activation function. In the A new representation of layers.
[0078] ; It is a non-linear activation function, such as ELU or ReLU.
[0079] After propagation through L layers, each node is obtained. The final context-aware embedding vector: ; at this time It has been integrated The latest information on multi-hop neighbors, including the new relationships introduced by the newly added entities.
[0080] Calculate entity pair similarity: For each pair of candidate entities that needs to be judged (For example, the entity to be updated and any entity in the local subgraph), calculate their cosine similarity based on the final embedding.
[0081] ; The range of cosine similarity is [-1, 1]. The closer the value is to 1, the more similar the representations of the two entities are in the current local graph context, and the higher the probability that they are synonymous entities.
[0082] By employing a lightweight graph attention network for embedding representation learning, while ensuring the model's expressive power, the efficiency of embedding representation and similarity calculation can be improved by orders of magnitude by limiting the computational range. This ensures that incremental entities can fully combine the structural and semantic information of their local neighborhoods to achieve high-quality vectorized representation and accurate correlation discrimination.
[0083] In one embodiment, determining a local subgraph from the global knowledge graph based on the entities to be updated in the data to be updated includes: Based on the cosine similarity between the entity to be updated and each entity in the global knowledge graph, multiple similar entities are determined from the global knowledge graph; The local subgraph is constructed based on the target entity and the edges connecting the target entity; the target entity includes the entity to be updated, the similar entity, the entity that has a second-degree relationship with the entity to be updated, and the entity that has a second-degree relationship with the similar entity.
[0084] To determine similar entities in the global knowledge graph for a given entity to be updated, cosine similarity can be calculated between its initial feature vector (e.g., a text vector based on name and description) and the feature vectors of all or some entities in the global knowledge graph. To improve efficiency, a coarse screening process can be performed first, such as indexing, to narrow down the candidate pool. Then, entities with similarity scores higher than a preset first threshold are identified as the set of similar entities.
[0085] The target entities are identified as similar entities, entities with a second-degree relationship to the similar entities, the entity to be updated, and entities with a second-degree relationship to the entity to be updated. Entities with a second-degree relationship refer to the "neighbors of neighbors" of similar entities. These target entities and all relationships connecting them are extracted from the global knowledge graph, forming a local subgraph. Then, based on the target entities and the edges connecting them, the local subgraph is constructed.
[0086] Optionally, to further balance efficiency and context scope, the target entity may also include only the similar entities, the entity to be updated, entities that have a first-degree relationship (i.e., are directly connected) with the entity to be updated, and entities that have a first-degree relationship with the similar entities. This approach results in a smaller local subgraph and higher computational efficiency.
[0087] In this way, it is ensured that potential matching objects highly related to the entity to be updated and their surrounding context are included in the local subgraph, providing the necessary information for subsequent accurate similarity recalculation and entity merging, and effectively improving the accuracy of association determination.
[0088] In one embodiment, updating the local subgraph based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph includes: Based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph, the candidate entity with the highest similarity to the entity to be updated is determined from the entities in the local subgraph. If the vector similarity between the embedding vector of the entity to be updated and the embedding vector of the candidate entity is greater than a similarity threshold, the entity to be updated and the candidate entity are merged. If the vector similarity between the embedding vector of the entity to be updated and the embedding vector of the candidate entity is less than or equal to the similarity threshold, the entity to be updated is added as a new entity to the local subgraph.
[0089] After the embedding representation learning is completed, each entity to be updated and each context entity in the local subgraph obtains an embedding vector that reflects its structural and semantic information. At this point, the vector similarity between the entity to be updated and all context entities in the local subgraph is calculated again, and the context entity with the highest similarity score is selected as the candidate entity.
[0090] If the similarity between the embedding vector of the entity to be updated and the embedding vector of the candidate entity is greater than a similarity threshold, the entity to be updated and the candidate entity are merged. This similarity threshold (e.g., a value close to 1.0, such as 0.95) is a preset high threshold used to determine whether two entities refer to the same object in the real world. If the similarity exceeds this threshold, they are considered equivalent, triggering the entity merging operation.
[0091] Alternatively, if the vector similarity between the embedding vector of the entity to be updated and the embedding vector of the candidate entity is less than or equal to a similarity threshold, the entity to be updated is added as a new entity to the local subgraph. If the highest similarity score also fails to exceed the threshold, the entity to be updated is considered to have no corresponding entity in the existing knowledge graph, representing entirely new knowledge. In this case, it should be added as a new node to the local subgraph.
[0092] In one embodiment, merging the entity to be updated with the candidate entity includes: Create a merged entity in the local subgraph, and link the candidate entity and the entity to be updated to the merged entity; Based on a preset conflict resolution strategy, the attributes of the entity to be updated and the candidate entities are merged, and the merged attributes are assigned to the merged entity. Migrate all relationships between the entity to be updated and the candidate entity to the merged entity.
[0093] Logically, a new merged entity is created, which will serve as the sole representative of the entity to be updated and the candidate entities. Simultaneously, the existing entities to be updated and candidate entities are marked as merged and linked to this new merged entity. Structurally, the candidate entity nodes can be retained and updated as merged entities, or entirely new nodes can be created.
[0094] Since both the entity to be updated and the candidate entity may have their own attributes, conflicts may occur during merging, such as different version numbers or different description texts. These conflicts must be resolved according to pre-defined rules to ensure the consistency and accuracy of the entity attributes after merging.
[0095] The migration process involves resetting all incoming edges pointing to the entity to be updated and the candidate entity, as well as all outgoing edges originating from them, to point to the newly created merged entity.
[0096] Through the above systematic merging operations, it can be ensured that after identifying equivalent entities, their information can be integrated into a single authoritative representation without loss, which significantly improves the overall data quality of the knowledge graph.
[0097] In one embodiment, fusing the attributes of the entity to be updated and the candidate entities based on a preset conflict resolution strategy includes at least one of the following: For attributes of the text description type, the description text of the entity to be updated is concatenated with the description text of the candidate entity; For attributes of the category label type, the union of the label set of the entity to be updated and the label set of the candidate entity is taken.
[0098] For example, if two entities both have attributes describing text types, the two text strings can be concatenated to form a new, more complete descriptive text. Natural language processing techniques can also be used to summarize or deduplicate the concatenated text, generating a more refined description.
[0099] For attributes of the category label type, the union of the label set of the entity to be updated and the label set of the candidate entity is taken, which can retain all category information most comprehensively.
[0100] Optionally, for ordered attributes such as version numbers, a strategy of using the latest version number or the highest version number can be adopted.
[0101] By configuring conflict resolution strategies, inconsistencies between equivalent entities at the attribute level can be addressed, preventing information loss or corruption and thus significantly improving the quality of knowledge graphs.
[0102] This invention also provides a flowchart illustrating the application of the open-source software graph update method based on local subgraph incremental fusion provided by this invention, as shown below. Figure 4 As shown, it includes: Step 410: Collect and store open-source software supply chain data in real time.
[0103] A distributed crawler system with event monitoring capabilities was built to continuously collect supply chain-related data from various open-source software ecosystem nodes. Data sources cover mainstream code hosting platforms, software package repositories, vulnerability databases, license information repositories, and community development platforms. The system uses an event-driven mechanism to achieve incremental data capture and near real-time synchronization by monitoring API updates, webhook messages, and repository dynamics, ensuring data timeliness and integrity. Collected data includes software metadata, dependency information, developer and organization information, license text, vulnerability announcements, and community development activities, and is stored in a non-relational database to support flexible structured and semi-structured data management.
[0104] Step 420: Construct a global knowledge graph of open-source software supply chain data.
[0105] This paper extracts entities (software, storage repositories, developers, licenses, vulnerabilities, etc.) and their relationships (dependencies, inclusions, contributions, impacts, etc.) from existing open-source software supply chain data to construct an open-source software supply chain knowledge graph. Nodes represent entities, and edges represent relationships. To achieve cross-ecosystem knowledge graph construction, a joint embedding learning framework based on a pre-trained language model and a graph attention network is used to fuse the textual semantic features and graph structural features of entities, generating high-quality entity vector representations. Equivalent entities are identified based on vector similarity calculations as anchor points, integrating the various single-ecosystem knowledge graphs into a unified cross-ecosystem knowledge graph.
[0106] Step 430: Update the global knowledge graph using incremental fusion of local subgraphs.
[0107] When new data flows in, the system only merges newly added or modified entities and their associated subgraphs, avoiding full graph reconstruction. The fusion process consists of the following sub-steps: Subgraph extraction: Extract local subgraphs related to newly added entities from the global graph; Similarity recalculation: Graph attention network is used to re-embed entities within the subgraph and recalculate their similarity. Entity merging: If the similarity exceeds the threshold 𝜃 and there is no conflict, then entity merging is performed; Subgraph write-back and global update: The merged subgraph is rewritten into the cross-ecosystem knowledge graph.
[0108] Step 440: Global knowledge graph update and version management.
[0109] To ensure data consistency, traceability, and recoverability during frequent incremental updates of the knowledge graph, a comprehensive graph update and version management mechanism is designed. After each update of the knowledge graph, the updated version is saved, and detailed information for each version update is recorded, providing a timestamp and version number for each version. This supports backtracking or comparing differences along the timeline.
[0110] This approach employs a local subgraph extraction and incremental fusion mechanism based on change events to address the high computational overhead and low efficiency caused by full graph reconstruction when new data enters the knowledge graph. By precisely limiting the computational scope from the global graph to the affected local subgraphs, it significantly avoids the enormous computational resource consumption and time delays associated with traditional full-scale fusion methods, providing crucial real-time assurance for core application scenarios.
[0111] This paper employs a lightweight graph attention network to recompute embeddings and similarities within local subgraphs to address the challenge of quickly and accurately assessing the correlation between new or updated entities and existing entities in incremental scenarios. While maintaining the model's expressive power, it achieves an order-of-magnitude improvement in the efficiency of embedding representation and similarity calculation by limiting the computational scope of the graph neural network. This ensures that incremental entities can fully leverage the structural and semantic information of their local neighborhoods for high-quality vectorized representation and accurate correlation discrimination, effectively balancing efficiency and accuracy in the incremental fusion process.
[0112] This system employs an entity merging operation with conflict resolution strategies to address the challenge of integrating the attributes and relationships of equivalent entities after entity alignment, thereby generating a unified entity and ensuring the integrity and accuracy of knowledge. Through pre-defined and configurable conflict resolution strategies, this operation systematically handles inconsistencies at the attribute level among equivalent entities, preventing information loss or corruption. This significantly improves the overall data quality, logical consistency, and reliability of the knowledge graph, providing a stable and reliable knowledge foundation for upper-layer applications.
[0113] The open-source software graph update device based on local subgraph incremental fusion provided by the present invention will be described below. The open-source software graph update device based on local subgraph incremental fusion described below can be referred to in correspondence with the open-source software graph update method based on local subgraph incremental fusion described above.
[0114] like Figure 5As shown, the device includes: The data acquisition module 510 is used to acquire the data to be updated in the global knowledge graph, which is constructed based on entities and relationships extracted from open-source software supply chain data. The local subgraph determination module 520 is used to determine local subgraphs from the global knowledge graph based on the entities to be updated in the data to be updated; Embedding vector determination module 530 is used to perform embedding representation learning on each entity in the local subgraph and the entity to be updated based on a graph neural network model, and to determine the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph. The update module 540 is used to update the local subgraph based on the vector similarity between the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph, and update the global knowledge graph based on the updated local subgraph to obtain the updated global knowledge graph.
[0115] This invention provides an open-source software graph update device based on incremental fusion of local subgraphs. Instead of directly manipulating the global knowledge graph after acquiring the data to be updated, it determines a small local subgraph from the global knowledge graph based on the entity to be updated, significantly reducing the computational load of subsequent update operations. An embedding representation learning is performed on each entity in the local subgraph and the entity to be updated using a graph neural network model to calculate the vector similarity between the entity to be updated and each entity in the local subgraph. Based on the calculated similarity of the embedding vectors, a high-precision update process for the local subgraph is achieved, and the updated local subgraph is further written into the global knowledge graph, realizing a high-precision update process for the global knowledge graph. This high-precision update process based on local subgraphs not only improves the update efficiency of the global knowledge graph but also enhances the update accuracy.
[0116] In one embodiment, the embedding vector determination module 530 is specifically used for: The process of determining the embedding vector of the current entity in the local subgraph includes: Obtain the initial feature vector of the current entity in the local subgraph; Based on the graph attention mechanism in the graph neural network model, the attention coefficient between the current entity and its neighboring entities is calculated in the topology of the local subgraph. Based on the attention coefficient, the feature information of the neighboring entities is aggregated to update the initial feature vector of the current entity, resulting in an aggregated feature vector; The aggregated feature vector is iteratively updated until a preset iteration threshold is reached, and the iterated aggregated feature vector is used as the embedding vector of the current entity.
[0117] In one embodiment, the local subgraph determination module 520 is specifically used for: The step of determining a local subgraph from the global knowledge graph based on the entities to be updated in the data to be updated includes: Based on the cosine similarity between the entity to be updated and each entity in the global knowledge graph, multiple similar entities are determined from the global knowledge graph; The local subgraph is constructed based on the target entity and the edges connecting the target entity; the target entity includes the entity to be updated, the similar entity, the entity that has a second-degree relationship with the entity to be updated, and the entity that has a second-degree relationship with the similar entity.
[0118] In one embodiment, the update module 540 is specifically used for: The step of updating the local subgraph based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph includes: Based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph, the candidate entity with the highest similarity to the entity to be updated is determined from the entities in the local subgraph. If the vector similarity between the embedding vector of the entity to be updated and the embedding vector of the candidate entity is greater than a similarity threshold, the entity to be updated and the candidate entity are merged. If the vector similarity between the embedding vector of the entity to be updated and the embedding vector of the candidate entity is less than or equal to the similarity threshold, the entity to be updated is added as a new entity to the local subgraph.
[0119] In one embodiment, the update module 540 is further configured to: The step of merging the entity to be updated with the candidate entity includes: Create a merged entity in the local subgraph, and link the candidate entity and the entity to be updated to the merged entity; Based on a preset conflict resolution strategy, the attributes of the entity to be updated and the candidate entities are merged, and the merged attributes are assigned to the merged entity. Migrate all relationships between the entity to be updated and the candidate entity to the merged entity.
[0120] In one embodiment, the update module 540 is further configured to: The method of fusing the attributes of the entity to be updated and the candidate entities based on a preset conflict resolution strategy includes at least one of the following: For attributes of the text description type, the description text of the entity to be updated is concatenated with the description text of the candidate entity; For attributes of the category label type, the union of the label set of the entity to be updated and the label set of the candidate entity is taken.
[0121] Figure 6 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 6 As shown, the electronic device may include a processor 610, a communications interface 620, a memory 630, and a communication bus 640, wherein the processor 610, communications interface 620, and memory 630 communicate with each other via the communication bus 640. The processor 610 can call logical instructions in the memory 630 to execute an open-source software knowledge graph update method based on incremental fusion of local subgraphs. This method includes: acquiring data to be updated from a global knowledge graph, wherein the global knowledge graph is constructed based on entities and relationships extracted from open-source software supply chain data; Based on the entities to be updated in the data to be updated, a local subgraph is determined from the global knowledge graph; Based on the graph neural network model, embedding representation learning is performed on each entity in the local subgraph and the entity to be updated to determine the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph. Based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph, the local subgraph is updated, and the global knowledge graph is updated based on the updated local subgraph to obtain the updated global knowledge graph.
[0122] Furthermore, the logical instructions in the aforementioned memory 630 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0123] On the other hand, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being able to be stored on a non-transitory computer-readable storage medium, the computer program being executed by a processor, the computer being able to execute the open source software graph update method based on local subgraph incremental fusion provided by the above methods, the method including: obtaining the data to be updated of the global knowledge graph, the global knowledge graph being constructed based on entities and relationships extracted from open source software supply chain data; Based on the entities to be updated in the data to be updated, a local subgraph is determined from the global knowledge graph; Based on the graph neural network model, embedding representation learning is performed on each entity in the local subgraph and the entity to be updated to determine the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph. Based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph, the local subgraph is updated, and the global knowledge graph is updated based on the updated local subgraph to obtain the updated global knowledge graph.
[0124] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the open-source software graph update method based on local subgraph incremental fusion provided by the above methods, the method comprising: acquiring data to be updated of a global knowledge graph, wherein the global knowledge graph is constructed based on entities and relationships extracted from open-source software supply chain data; Based on the entities to be updated in the data to be updated, a local subgraph is determined from the global knowledge graph; Based on the graph neural network model, embedding representation learning is performed on each entity in the local subgraph and the entity to be updated to determine the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph. Based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph, the local subgraph is updated, and the global knowledge graph is updated based on the updated local subgraph to obtain the updated global knowledge graph.
[0125] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0126] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0127] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. An open-source software graph update method based on incremental fusion of local subgraphs, characterized in that, include: Obtain the data to be updated from the global knowledge graph, which is constructed based on entities and relationships extracted from open-source software supply chain data; Based on the entities to be updated in the data to be updated, a local subgraph is determined from the global knowledge graph; Based on the graph neural network model, embedding representation learning is performed on each entity in the local subgraph and the entity to be updated to determine the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph. Based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph, the local subgraph is updated, and the global knowledge graph is updated based on the updated local subgraph to obtain the updated global knowledge graph.
2. The open-source software graph update method based on incremental fusion of local subgraphs as described in claim 1, characterized in that, The process of determining the embedding vector of the current entity in the local subgraph includes: Obtain the initial feature vector of the current entity in the local subgraph; Based on the graph attention mechanism in the graph neural network model, the attention coefficient between the current entity and its neighboring entities is calculated in the topology of the local subgraph. Based on the attention coefficient, the feature information of the neighboring entities is aggregated to update the initial feature vector of the current entity, resulting in an aggregated feature vector; The aggregated feature vector is iteratively updated until a preset iteration threshold is reached, and the iterated aggregated feature vector is used as the embedding vector of the current entity.
3. The open-source software graph update method based on incremental fusion of local subgraphs according to claim 1, characterized in that, The step of determining a local subgraph from the global knowledge graph based on the entities to be updated in the data to be updated includes: Based on the cosine similarity between the entity to be updated and each entity in the global knowledge graph, multiple similar entities are determined from the global knowledge graph; The local subgraph is constructed based on the target entity and the edges connecting the target entity; the target entity includes the entity to be updated, the similar entity, the entity that has a second-degree relationship with the entity to be updated, and the entity that has a second-degree relationship with the similar entity.
4. The open-source software graph update method based on incremental fusion of local subgraphs according to claim 1, characterized in that, The step of updating the local subgraph based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph includes: Based on the vector similarity between the embedding vector of the entity to be updated and the embedding vectors of each entity in the local subgraph, the candidate entity with the highest similarity to the entity to be updated is determined from the entities in the local subgraph. If the vector similarity between the embedding vector of the entity to be updated and the embedding vector of the candidate entity is greater than a similarity threshold, the entity to be updated and the candidate entity are merged. If the vector similarity between the embedding vector of the entity to be updated and the embedding vector of the candidate entity is less than or equal to the similarity threshold, the entity to be updated is added as a new entity to the local subgraph.
5. The open-source software graph update method based on incremental fusion of local subgraphs according to claim 4, characterized in that, The step of merging the entity to be updated with the candidate entity includes: Create a merged entity in the local subgraph, and link the candidate entity and the entity to be updated to the merged entity; Based on a preset conflict resolution strategy, the attributes of the entity to be updated and the candidate entities are merged, and the merged attributes are assigned to the merged entity. Migrate all relationships between the entity to be updated and the candidate entity to the merged entity.
6. The open-source software graph update method based on incremental fusion of local subgraphs according to claim 5, characterized in that, The method of fusing the attributes of the entity to be updated and the candidate entities based on a preset conflict resolution strategy includes at least one of the following: For attributes of the text description type, the description text of the entity to be updated is concatenated with the description text of the candidate entity; For attributes of the category label type, the union of the label set of the entity to be updated and the label set of the candidate entity is taken.
7. An open-source software graph update device based on incremental fusion of local subgraphs, characterized in that, include: The data to be updated module is used to acquire the data to be updated in the global knowledge graph, which is constructed based on entities and relationships extracted from open-source software supply chain data. The local subgraph determination module is used to determine local subgraphs from the global knowledge graph based on the entities to be updated in the data to be updated; The embedding vector determination module is used to perform embedding representation learning on each entity in the local subgraph and the entity to be updated based on the graph neural network model, and to determine the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph. The update module is used to update the local subgraph based on the vector similarity between the embedding vector of the entity to be updated and the embedding vector of each entity in the local subgraph, and update the global knowledge graph based on the updated local subgraph to obtain the updated global knowledge graph.
8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the open-source software graph update method based on local subgraph incremental fusion as described in any one of claims 1 to 6.
9. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the open-source software graph update method based on local subgraph incremental fusion as described in any one of claims 1 to 6.
10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the open-source software graph update method based on local subgraph incremental fusion as described in any one of claims 1 to 6.