Large-scale complex network node alignment method based on large language model
By synergistically integrating structural embedding models and large language models, a multi-channel similarity joint decision-making framework is constructed, which solves the problems of alignment accuracy and computational cost in large-scale complex networks and achieves efficient and accurate node alignment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU NORMAL UNIVERSITY
- Filing Date
- 2026-01-28
- Publication Date
- 2026-06-19
AI Technical Summary
Existing network alignment techniques have limited information representation capabilities in large-scale complex networks, rely on a single method for similarity determination, are significantly affected by noise and heterogeneity, have high computational costs, and are difficult to guarantee alignment accuracy and scalability.
We employ a collaborative fusion of structural embedding models and large language models to construct an alignment framework that combines rapid structural screening, fine discrimination by large language models, and joint decision-making based on multi-channel similarity. We make decisions by combining structural similarity with prior community information through pseudo-text description and normalized similarity scoring.
It improves alignment accuracy in highly obfuscated scenarios, significantly reduces the calling cost of large language models, ensures system stability and effective control of computing resources, and achieves high-precision node alignment.
Smart Images

Figure CN122241016A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer technology, particularly the field of complex network analysis and cross-network entity alignment, specifically relating to a network alignment method for large-scale complex networks based on a large language model. This method leverages the capabilities of large language models in structural semantic understanding, pattern abstraction, and reasoning discrimination to achieve efficient, accurate, and scalable alignment of nodes representing the same real entity in different large-scale networks. Background Technology
[0002] With the development of information technology, various types of complex networks, such as social networks, knowledge graphs, biomolecular networks, and transportation and infrastructure networks, are constantly emerging in practical applications. Against this backdrop, cross-network entity alignment has gradually become one of the key fundamental issues in network analysis and data fusion. Network alignment refers to identifying and matching nodes representing the same real object or entity in multiple interconnected networks. For example, in cross-platform social systems, it is necessary to align account nodes belonging to the same user on different platforms to support user profile fusion, abnormal behavior detection, and recommendation services; in the construction of multi-source knowledge graphs, it is necessary to identify nodes describing the same entity in different data sources to achieve knowledge integration and unified retrieval; in the field of bioinformatics, aligning homologous protein nodes in protein-protein interaction networks of different species helps in functional prediction and drug development.
[0003] Existing network alignment techniques mainly rely on the structural features and attribute similarity of nodes. Typical methods include the following categories: one is the combinatorial optimization method based on graph isomorphism or subgraph matching. This type of method theoretically has high alignment accuracy, but the computational complexity increases sharply as the network size increases, making it difficult to apply to real-world large-scale scenarios; another is the alignment method based on vector embedding, which uses random walks, graph neural networks, or linear mappings to map nodes in different networks to the same low-dimensional representation space, and then performs matching based on distance or similarity; and yet another is the alignment method based on community or cluster structure, which first identifies relatively stable groups in each network, and then performs node alignment at the group level and within the group to improve overall robustness.
[0004] However, in real-world applications of large and complex networks, the aforementioned methods still face significant shortcomings. First, their information representation capabilities are limited. Most methods focus on local adjacency relationships or low-dimensional numerical representations, making it difficult to explicitly characterize the comprehensive features of a node's local structural patterns, relationship combinations, and attribute semantics. Furthermore, they struggle to support flexible reasoning and discrimination across networks. Second, similarity determination methods are simplistic. Commonly used metrics such as cosine similarity and Euclidean distance only provide continuous numerical measurements, making it difficult to clearly distinguish between "whether they are the same entity" at the semantic level. This is especially problematic when candidate nodes are highly similar or when structural noise is significant, easily leading to mismatches. Third, noise and heterogeneity have a significant impact. Different networks commonly suffer from structural gaps, redundant connections, inconsistent attribute distributions, and scale differences. Relying solely on structural or numerical features often fails to maintain stable performance. Moreover, in large-scale scenarios, directly applying complex models for detailed discrimination of all node pairs would result in unacceptable computational costs and service overhead, while using only lightweight models struggles to guarantee alignment accuracy.
[0005] In recent years, large language models have demonstrated significant advantages in text understanding, structured information representation, and complex reasoning and decision-making. If the structural relationships and attribute features of network nodes can be transformed into pseudo-textual descriptions that large language models can process, and these descriptions can be used for semantic-level matching and judgment among candidate nodes, it is hoped that the limitations of traditional methods that rely solely on numerical similarity can be overcome. However, existing research largely remains at the stage of simply textualizing node information and directly calling the model, without systematically addressing the following key issues: how to automatically construct compact and discriminative node descriptions from high-dimensional structural features; how to conduct fine-grained evaluations of large language models only for nodes with high uncertainty while controlling model call costs; and how to integrate the output of large language models with various structural similarity and community prior information to form stable and reliable alignment decisions.
[0006] Therefore, there is an urgent need for a large-scale complex network alignment method that deeply integrates large language models with traditional structural alignment mechanisms. This method would fully leverage the advantages of large language models in reasoning, while effectively controlling computational resources and improving alignment accuracy and scalability. Summary of the Invention
[0007] The purpose of this invention is to provide a method for aligning nodes in large-scale complex networks based on a large language model. This method constructs a pseudo-textual structural description of nodes, aggregates candidate target nodes by source nodes, and uses a large language model to output a normalized node similarity score. It then combines structural similarity with community prior information to make joint decisions, thereby achieving high-precision node alignment between large-scale, noisy, and structurally heterogeneous networks while ensuring alignment accuracy and effectively controlling computational overhead.
[0008] To achieve the above objectives, the network alignment method proposed in this invention integrates structural embedding models and large language models to construct an overall alignment framework of "rapid structural screening—fine discrimination by the large language model—joint decision-making based on multi-channel similarity." The method mainly includes the following steps: First, structural embedding representations of nodes are learned in both the source and target networks, and the structural similarity between nodes is calculated based on these embeddings to generate a candidate node set. Node pairs with high structural similarity and strong confidence are then quickly aligned. Second, the local structural features and attribute information of nodes are compressed and organized to construct a pseudo-text representation suitable for understanding by the large language model. Only for node pairs with low confidence in the structural model, the large language model is called once within their corresponding candidate node set to perform a similarity evaluation, obtaining a normalized large language model similarity score. Finally, the structural similarity, mapping similarity, community similarity, and large language model similarity are jointly fused in a learnable manner to output a one-to-one matching result between source network nodes and target network nodes. Specifically, the method of this invention is as follows:
[0009] Step (1) Construct a network model and calculate structural similarity;
[0010] (1-1) Construct the source and target networks to be aligned: Select two networks to be aligned, one as the source network and the other as the target network; represent the source network as an undirected graph. , where the node set edge set N is the number of nodes in the source network; the target network is represented as an undirected graph. , where the node set edge set M represents the number of nodes in the target network;
[0011] (1-2) Input the source network into the structure embedding model, perform representation learning on the nodes in the network, and obtain the corresponding source network node embedding matrix. , where the matrix The nth row is used as a node The embedding vectors of the target network are obtained by inputting the target network into the structure embedding model, learning the representations of the nodes in the network, and obtaining the corresponding target network node embedding matrix. The m-th row is used as a node. The input vector, where d is the embedding dimension. Represents the real number field;
[0012] The structural embedding model is used to encode the multi-level topological relationships of nodes, and can comprehensively characterize the local connectivity features and global structural positions of nodes.
[0013] The embedding vectors are normalized, and based on the normalized node embeddings, the structural similarity matrix between the source network and the target network is calculated. ,in and The normalized embedding matrix, This matrix represents the transpose; it is used for subsequent candidate set generation and high-confidence determination.
[0014] Step (2) Generate pseudo-text descriptions for nodes and filter candidate nodes;
[0015] (2-1) Obtaining node structure and feature information:
[0016] For any node in the source network Get its set of neighboring nodes and calculate the nodes degree value That is, the number of neighboring nodes. If the source network dataset provides the original feature matrix of the nodes. Then obtain the feature vector of that node. Otherwise, initialize it as a zero vector, where F is the feature dimension;
[0017] For any node in the target network Get its set of neighboring nodes and calculate the nodes degree value , If the target network dataset provides the original feature matrix of the nodes. Then obtain the feature vector of that node. Otherwise, initialize to a zero vector;
[0018] (2-2) Generate node pseudo-text descriptions:
[0019] For the source node and target node Its pseudotext description includes at least the following: node identification information, used to uniquely identify the node; neighbor structure summary information, including the number of neighbor nodes and some neighbor node identifiers; node feature summary information, including approximate numerical representations of several dimensions in the node feature vector; and structure embedding summary information, including approximate numerical representations of several dimensions in the node embedding vector.
[0020] (2-3) Selection and filtering of candidate target nodes based on structural similarity:
[0021] Based on the structural similarity matrix For each source node The corresponding similarity row vectors are sorted, and the top K target nodes with the highest scores are selected to form the candidate target node set for the source node. , ;
[0022] If the source node The maximum similarity value in the structural similarity matrix Greater than or equal to the preset threshold If the alignment result of the structural embedding model for the source node has high confidence, the alignment is directly completed using the structural similarity result; otherwise, if the maximum similarity value in the structural similarity matrix is less than a preset threshold... , or source node If the source node is located within the neighborhood of the training anchor point, then the large language model is invoked to process the source node. The similarity score vector of the candidate target nodes is output by performing a one-time joint evaluation on the node and its corresponding candidate target node set.
[0023] The similarity score vector is recorded as the source node. The corresponding large language model similarity information is used to reorder the candidate target node set, and based on the subsequent fusion with structural similarity and other prior information, the final target node alignment result corresponding to the source node is determined;
[0024] Step (3) Similarity estimation based on a large language model;
[0025] (3-1) For the selected source node and its corresponding set of candidate target nodes Construct prompt word templates for reasoning in large language models;
[0026] (3-2) Large Language Model Invocation and JSON Parsing: Input the prompt word template into the Large Language Model (LLM), obtain the output text through dialogue or API call, parse the JSON object from it, and obtain a similarity vector of length K. , For LLM to source node Alignment probability score with the k-th candidate node. ;
[0027] Step (4) Construction and normalization of the similarity matrix of the large language model:
[0028] For all source nodes that call the large language model, their similarity vectors are filled into the large language model similarity matrix. Corresponding position , Let be the index of the k-th candidate node in the target network; for node pairs not participating in the large language model evaluation, set it to 0 or the local normalized value of structural similarity; Global normalization is performed to obtain the processed LLM similarity matrix. , To prevent division by zero of extremely small constants;
[0029] Step (5) Construct a multi-channel similarity joint decision framework, including: GNN structural similarity The similarity of the PALE structure mapping is calculated by a multi-layered embedding model. The structural embeddings and linear mappings trained through negative sampling are obtained; community similarity is based on minimum cut. By softly partitioning the embedding space, learning soft community assignments for the source and target networks, and then constructing an alignment matrix between communities based on community center similarity and anchor co-occurrence frequency; the community prior matrix... Based on the soft community alignment results, a priori bonus is given to node pairs that are in similar communities;
[0030] The above channels and LLM similarity matrix A joint input decision-making model is used, employing a linear or nonlinear combination structure with learnable weights, and end-to-end optimization is performed using the ranking loss of training anchor points; the fused output is a basic similarity matrix. Furthermore, the community prior knowledge is added on top of that. ,in Prior weights for the community;
[0031] Step (6) Analyze the similarity matrix obtained from the final joint decision. Perform the maximum value selection by row, and select the target node with the highest score for each source node as the alignment result.
[0032] Compared with existing network alignment methods that rely solely on structural similarity or simple attribute similarity, the present invention has the following advantages:
[0033] 1. Introducing LLM's structural understanding and reasoning to improve accuracy in highly confusing scenarios. By compressing node local structure, features, and embedding information into natural language pseudo-text and explicitly displaying structural prior scores, the large language model can perform "graph structure understanding and reasoning" between candidate nodes that more closely resembles human intuition, thereby significantly improving alignment accuracy in highly confusing and structurally similar scenarios. This is demonstrated on the test dataset used in this invention. It offers at least a 4% improvement over the version without LLM integration.
[0034] 2. Candidate evaluation within source node groups significantly reduces LLM call costs. An evaluation strategy of aggregating candidates by source node and comparing multiple candidates in a single prompt word simultaneously compares source node A with multiple candidate nodes Bi, avoiding the call volume explosion caused by traditional pairwise calls. Combined with structural confidence threshold filtering and a call strategy only near anchor points, this further reduces the computational and cost overhead of large language models.
[0035] 3. Robust parsing and fallback mechanism ensure system stability. The system employs multi-strategy parsing on the JSON output from the large language model, coupled with a fallback mechanism based on structural similarity normalization. Even if the large language model output is abnormal, missing, or parsing fails, the system can still provide a reasonable similarity estimate, ensuring the continuous and stable operation of the overall alignment process.
[0036] 4. Multi-channel similarity joint decision-making, balancing global consistency and fine-grained local matching. GNN embedding similarity, PALE structure mapping similarity, minimum cut community alignment similarity, and LLM similarity are integrated into a unified joint decision-making framework. Training is performed using a learnable weighting mechanism combined with ranking loss. This ensures global topological consistency while leveraging the fine-grained discrimination capability of LLM on locally difficult samples, achieving higher overall accuracy.
[0037] 5. Wide applicability and engineering application value. It can be widely applied to complex scenarios such as social network account alignment, multi-source knowledge graph entity alignment, cross-species biological network alignment, and transportation or supply chain network node alignment, providing high-quality basic alignment results for cross-platform recommendation, knowledge fusion, function prediction and risk control, and has significant engineering application value and theoretical innovation significance. Attached Figure Description
[0038] Figure 1 This is a flowchart illustrating the overall framework of the method of the present invention. Detailed Implementation
[0039] The invention will be further described in detail below with reference to specific embodiments. This embodiment uses the Douban social network dataset for verification. This dataset comes from Douban.com and consists of two sub-networks: Douban Online and Douban Offline, containing 3906 and 1118 user nodes and their corresponding social connections, respectively. There are some overlapping entities belonging to the same user in the online and offline networks. This embodiment aims to use the method proposed in this invention to accurately identify and align node pairs representing the same real user in these two networks. For example... Figure 1 As shown, the specific method is as follows:
[0040] Step (1) Construct a network model and calculate structural similarity;
[0041] (1-1) Construct the source and target networks to be aligned:
[0042] Read the Douban dataset file, using the Douban Online and Douban Offline sub-networks as the source and target networks respectively, and read the G.json files of each network to obtain the network topology; at the same time, read feats.npy to obtain the node attribute feature matrix, which will be used for subsequent pseudo-text description construction and multi-channel fusion discrimination.
[0043] The source network is represented as an undirected graph. , where the node set edge set N is the number of nodes in the source network;
[0044] The target network is represented as an undirected graph. , where the node set edge set M represents the number of nodes in the target network.
[0045] Preprocessing operations are performed on the source and target networks respectively, including: removing self-loop edges and duplicate edges in the network, and retaining only undirected and unweighted edge connections, thereby obtaining a structurally normalized network representation. This allows the preprocessed network to truly reflect the topological characteristics of the original network, providing a unified data foundation for subsequent node embedding learning and alignment calculation.
[0046] (1-2) Input the source network into the structure embedding model, perform representation learning on the nodes in the network, and obtain the corresponding source network node embedding matrix. , where the matrix The nth row is used as a node The embedding vectors of the target network are obtained by inputting the target network into the structure embedding model, learning the representations of the nodes in the network, and obtaining the corresponding target network node embedding matrix. The m-th row is used as a node. The input vector, where d is the embedding dimension. Represents the real number field.
[0047] The structural embedding model is used to encode the multi-level topological relationships of nodes, comprehensively characterizing the local connectivity features and global structural positions of nodes. Preferably, the structural embedding model performs multi-level information propagation on the source and target networks respectively to obtain the node representations at each layer, and obtains the final node embedding through inter-layer weighted convergence. More preferably, the structural embedding model is a graph neural network NAME model that integrates Euclidean space and hyperbolic space representations, and achieves effective expression of the hierarchical structural features of complex networks by introducing attention mechanisms and non-Euclidean geometric constraints.
[0048] The embedding vectors are L2 normalized to eliminate the influence of different dimensional scales on similarity calculation. Based on the normalized node embeddings, the structural similarity matrix between the source and target networks is calculated. ,in and The normalized embedding matrix, This indicates the transpose. This matrix is used for subsequent candidate set generation and high-confidence determination.
[0049] Step (2) Generate pseudo-text descriptions for nodes and filter candidate nodes;
[0050] Before generating pseudo-text descriptions of nodes and filtering candidate nodes, this embodiment first obtains known aligned node pairs from the real alignment dictionary provided by the dataset, forming an anchor set. `node,split=0.2.train.dict` is loaded, and its node pairs are divided into training and validation sets for training and parameter selection of the joint decision module; `node,split=0.2.test.dict` is loaded as the test set for final accuracy evaluation. Furthermore, the set of neighboring nodes of anchor nodes (e.g., the set of 2-hop neighbors of anchor nodes) can be calculated based on the network topology, used for the subsequent filtering strategy of anchor neighborhood triggering large language model reordering coverage.
[0051] (2-1) Obtaining node structure and feature information:
[0052] For any node in the source network Get its set of neighboring nodes and calculate the nodes degree value That is, the number of neighboring nodes. Read the original feature matrix from the feats.npy file of the source network data. Obtain the feature vector of this node. F is the feature dimension. For any node in the target network... Get its set of neighboring nodes and calculate the nodes degree value , Read the original feature matrix from the feats.npy file of the target network data. Obtain the feature vector of this node. .
[0053] (2-2) Generate node pseudo-text descriptions:
[0054] Based on node-specific structural embedding representations, structural statistics, and node feature information, structured pseudo-text descriptions are constructed for each node in the source and target networks for subsequent large language model processing. For the source nodes... and target node Its pseudotext description includes at least the following: node identification information, used to uniquely identify the node; neighbor structure summary information, including the number of neighbor nodes and some neighbor node identifiers; node feature summary information, including approximate numerical representations of several dimensions in the node feature vector; and structure embedding summary information, including approximate numerical representations of several dimensions in the node embedding vector.
[0055] Pseudo-text descriptions do not rely on the true semantic text attributes of nodes. Instead, they compress and map numerical structural and attribute features into structured labels that conform to the form of natural language, enabling large language models to understand and compare the structural patterns and feature combinations of nodes.
[0056] (2-3) Selection and filtering of candidate target nodes based on structural similarity:
[0057] Based on the structural similarity matrix For each source node The corresponding similarity row vectors are sorted, and the top K target nodes with the highest scores are selected to form the candidate target node set for the source node. , .
[0058] To reduce the frequency of calls to large language models and control overall computational costs, a source node selection strategy based on structural similarity is introduced. If the source node... The maximum similarity value in the structural similarity matrix Greater than or equal to the preset threshold If the similarity score is 0.90, the structural embedding model is deemed to have high confidence in aligning the source node, and the structural similarity result is used directly for alignment without invoking the large language model. If the maximum similarity value in the structural similarity matrix is less than a preset threshold... , or source node If the source node is located within the 2-hop neighborhood of the training anchor point, then the large language model is invoked to process the source node. The similarity score vector of the candidate target nodes is output by performing a one-time joint evaluation on the node and its corresponding candidate target node set.
[0059] The similarity score vector is recorded as the source node. The corresponding large language model similarity information is used to reorder the candidate target node set, and based on subsequent fusion with structural similarity and other prior information, the final alignment result of the target node corresponding to the source node is determined. By introducing structural similarity filtering and threshold filtering mechanisms, this invention can significantly reduce the number of calls to the large language model, effectively reducing computation and service costs while ensuring alignment accuracy.
[0060] Step (3) Similarity estimation based on a large language model;
[0061] (3-1) Design the prompt word template structure:
[0062] For the selected source node and its corresponding set of candidate target nodes Construct prompt word templates for reasoning in large language models. The prompt word templates include the following three types of information content:
[0063] I. Source Node Description: Provides the source node. The pseudotext description is as follows: "Node A Information: Node ID: i, Number of Neighbors: ..., Approximate values of the first d dimensions of the feature vector: ..., Approximate values of the first d dimensions of the structural embedding vector: ...". For example: "Node A Information: Node ID: 137, Number of Neighbors: 12, List of Neighbor Node IDs: [3, 19, 27, ...], Approximate values of the first d dimensions of the node feature vector: 0.183, −0.041, 1.207..., Approximate values of the first d dimensions of the structural embedding vector: −0.317, 0.884, 0.129...".
[0064] II. Candidate Node Descriptions: Candidate nodes are listed in sequence. Pseudo-text description, Candidate nodes are denoted as B1, B2, ..., BK. For example: "B1 (structural score s1=0.862): Node ID: 905, Number of neighbors: 11, List of neighbor node IDs: [12, 33, 48...], Approximate values of the first d dimensions of the node feature vector: 0.170, −0.055, 1.194...; B2...". The structural score is the numerical value at the corresponding position in the structural similarity matrix, used to explicitly suggest structural priors to the large language model.
[0065] III. Problem and Output Format Constraints: Clearly define the task objective and output format constraints in the prompt word template. For example, in a network alignment scenario, determine whether node A in the source network and candidate node Bk in the target network represent the same real entity; require a similarity score of 0-1 for each candidate Bk, where 1 indicates they are definitely the same entity and 0 indicates they are definitely different; remind that most nodes in a large language model do not actually correspond to the same entity, so the score should be conservative; strictly require the last line to output only a JSON object, with keys B1…BK and values corresponding similarity decimals {B1: 0.8, B2: 0.1, B3: 0.5…}.
[0066] (3-2) Large Language Model Calling and JSON Parsing:
[0067] In practical implementation, different models can be used for alternative deployments and cost trade-offs. Here, the ChatGPT-4 model is selected for inference scoring. The prompt word template is input into the Large Language Model (LLM), and the output text is obtained through dialogue or API calls. A JSON object is then parsed from this text to obtain a similarity vector of length K. , For LLM to source node Alignment probability score with the k-th candidate node.
[0068] To improve robustness, the following strategy is adopted during the parsing process: First, attempt to parse the entire last line of the output as JSON; if this fails, extract the JSON substring within curly braces from the text using regular expressions for secondary parsing; if this still fails, use the normalized result of structural similarity within the candidate set as the fallback score, or assign a neutral value of 0.5. The fallback score is a backfill score, meaning that when the large language model fails to generate a valid score, the score at the corresponding position in the structural similarity matrix is used to fill the position that should have been filled by the LLM score. For successfully parsed scores, if each score exceeds 1 (potentially a score between 0 and 100), it is automatically scaled to the 0-1 range, while NaN or infinity values are corrected.
[0069] Step (4) Construction and normalization of the similarity matrix of the large language model:
[0070] For all source nodes that call the large language model, their similarity vectors are filled into the large language model similarity matrix. Corresponding position , This represents the index of the k-th candidate node in the target network. For node pairs not involved in the large language model evaluation, this is set to 0 or the local normalized value of structural similarity. Global normalization is performed to obtain the processed LLM similarity matrix. ,in To prevent division by zero of extremely small constants.
[0071] Step (5) Multi-channel similarity joint decision:
[0072] By combining other structural channels and community channels, a multi-channel similarity joint decision-making framework is constructed, including: GNN structural similarity. The similarity of the PALE structure mapping is calculated by a multi-layered embedding model. The structural embeddings and linear mappings trained through negative sampling are obtained; community similarity is based on minimum cut. By softly partitioning the embedding space, learning soft community assignments for the source and target networks, and then constructing an alignment matrix between communities based on community center similarity and anchor co-occurrence frequency; the community prior matrix... Based on the soft community alignment results, a priori bonus is given to node pairs that are in similar communities.
[0073] The above channels and LLM similarity matrix A joint input decision-making model is used, employing a linear or nonlinear combination structure with learnable weights, and end-to-end optimization is performed using the ranking loss of training anchor points. The fused output is a basic similarity matrix. Furthermore, the community prior knowledge is added on top of that. ,in Prior weights for the community.
[0074] Step (6) Analyze the similarity matrix obtained from the final joint decision. The algorithm performs row-wise maximum value selection, choosing the target node with the highest score for each source node as the alignment result. In project deployment, a threshold can be set to filter low-confidence matches, and in the event of many-to-one conflicts, conflict resolution is prioritized based on the overall score to meet the constraint of one-to-one alignment output.
[0075] To quantitatively evaluate the above effects, alignment quality metrics such as alignment accuracy (Top-1), Top-K hit rate (Hits@k), and mean reciprocal rank (MRR) were used. Alignment accuracy was used to statistically measure the consistency between the predicted alignment results and the true mapping, and was defined as follows: , This is the test set. The Top-K hit rate is used to examine whether the true aligned node falls within the top K candidates, and is defined as follows: ,in Indicates the source node in the comprehensive similarity matrix. Let the set be the Kth target nodes selected after sorting the rows by similarity from highest to lowest. The average reciprocal rank is used to reflect the relative position of the true aligned target node in the sort. Let the rank of the true aligned target node in the sort be denoted as . Then there is .
[0076] To characterize the cost and efficiency of large language model invocation, we introduce invocation metrics such as large language model invocation coverage, average number of large language model invocations per source node, and invocation cost per unit precision improvement. Large language model invocation coverage... ,in The total number of source nodes to be aligned. This represents the actual number of source nodes that invoked the large language model. The average number of large language model calls per source node. ,in This represents the total number of actual calls to the large language model throughout the alignment process. The Top-1 accuracy on the test set of the baseline method using only the structural similarity channel without calling LLM is [value missing]. The Top-1 accuracy of this invention, after introducing a large language model and performing multi-channel similarity joint decision-making, is [percentage missing]. The overhead of improving unit precision is defined as .
Claims
1. A large-scale complex network node alignment method based on a large language model, characterized in that: First, structural embedding representations of nodes are learned in both the source and target networks. Based on these embeddings, structural similarity between nodes is calculated to generate a candidate node set. Node pairs with high structural similarity and strong confidence are then quickly aligned. Second, the local structural features and attribute information of nodes are compressed and organized to construct a pseudo-text representation suitable for understanding large language models. Only for node pairs with low structural model confidence, a one-time similarity evaluation is performed using the large language model within their corresponding candidate node set to obtain a normalized large language model similarity score. Finally, the structural similarity, mapping similarity, community similarity, and large language model similarity are jointly fused in a learnable manner to output a one-to-one matching result between source network nodes and target network nodes.
2. The large language model-based large-scale complex network node alignment method of claim 1, wherein, The method is as follows: Step (1) Construct a network model and calculate structural similarity; (1-1) Construct the source and target networks to be aligned: Select two networks to be aligned, one as the source network and the other as the target network; Representing the source network as an undirected graph where the node set , the edge set , N is the number of nodes of the source network; representing the target network as an undirected graph where the node set , the edge set , M is the number of nodes of the target network; (1-2) inputting a structure embedding model of a source network, performing representation learning on nodes in the network to obtain a corresponding source network node embedding matrix , wherein the matrix The nth row in the matrix is taken as an embedding vector of a node ; inputting a structure embedding model of a target network, performing representation learning on nodes in the network to obtain a corresponding target network node embedding matrix , wherein the mth row is taken as an embedding vector of a node , and d is an embedding dimension, , which represents a real number field; The structural embedding model is used to encode the multi-level topological relationships of nodes, and can comprehensively characterize the local connectivity features and global structural positions of nodes. normalizing the embedding vectors, and calculating a structural similarity matrix between the source network and the target network based on the normalized node embeddings wherein and is the normalized embedding matrix, denotes transposition; this matrix is used for subsequent candidate set generation and high-confidence decision making; Step (2) Generate pseudo-text descriptions for nodes and filter candidate nodes; (2-1) Obtaining node structure and feature information: For any node in the source network Get its set of neighboring nodes and calculate the nodes degree value That is, the number of neighboring nodes. If the source network dataset provides the original feature matrix of the nodes. Then obtain the feature vector of that node. Otherwise, initialize it as a zero vector, where F is the feature dimension; For any node in the target network Get its set of neighboring nodes and calculate the nodes degree value , If the target network dataset provides the original feature matrix of the nodes. Then obtain the feature vector of that node. Otherwise, initialize to a zero vector; (2-2) Generate node pseudo-text descriptions: For the source node and target node Its pseudotext description includes at least the following: node identification information, used to uniquely identify the node; neighbor structure summary information, including the number of neighbor nodes and some neighbor node identifiers; Node feature summary information, including approximate numerical representations of several dimensions in the node feature vector; structural embedding summary information, including approximate numerical representations of several dimensions in the node embedding vector; (2-3) Selection and filtering of candidate target nodes based on structural similarity: Based on the structural similarity matrix For each source node The corresponding similarity row vectors are sorted, and the top K target nodes with the highest scores are selected to form the candidate target node set for the source node. , ; If the source node The maximum similarity value in the structural similarity matrix Greater than or equal to the preset threshold If the structural embedding model has a high confidence level in aligning the source node, then the structural similarity result is used directly to complete the alignment. If the maximum similarity value in the structural similarity matrix is less than a preset threshold , or source node If the source node is located within the neighborhood of the training anchor point, then the large language model is invoked to process the source node. The similarity score vector of the candidate target nodes is output by performing a one-time joint evaluation on the node and its corresponding candidate target node set. The similarity score vector is recorded as the source node. The corresponding large language model similarity information is used to reorder the candidate target node set, and based on the subsequent fusion with structural similarity and other prior information, the final target node alignment result corresponding to the source node is determined; Step (3) Similarity estimation based on a large language model; (3-1) For the selected source node and its corresponding set of candidate target nodes Construct prompt word templates for reasoning in large language models; (3-2) Large Language Model Invocation and JSON Parsing: Input the prompt word template into the Large Language Model (LLM), obtain the output text through dialogue or API call, parse the JSON object from it, and obtain a similarity vector of length K. , For LLM to source node Alignment probability score with the k-th candidate node. ; Step (4) Construction and normalization of the similarity matrix of the large language model; For all source nodes that call the large language model, their similarity vectors are filled into the large language model similarity matrix. Corresponding position , Let be the index of the k-th candidate node in the target network; for node pairs not participating in the large language model evaluation, set it to 0 or the local normalized value of structural similarity; Global normalization is performed to obtain the processed LLM similarity matrix. , To prevent division by zero of extremely small constants; Step (5) Construct a multi-channel similarity joint decision framework, including: GNN structural similarity The similarity of the PALE structure mapping is calculated by a multi-layered embedding model. The structural embeddings and linear mappings trained through negative sampling are obtained; community similarity is based on minimum cut. By softly partitioning the embedding space, learning soft community assignments for the source and target networks, and then constructing an alignment matrix between communities based on community center similarity and anchor co-occurrence frequency; the community prior matrix... Based on the soft community alignment results, a priori bonus is given to node pairs that are in similar communities; The above channels and LLM similarity matrix A joint input decision-making model is used, employing a linear or nonlinear combination structure with learnable weights, and end-to-end optimization is performed using the ranking loss of training anchor points; the fused output is a basic similarity matrix. Furthermore, community prior knowledge is added. ,in Prior weights for the community; Step (6) Analyze the similarity matrix obtained from the final joint decision. Perform the maximum value selection by row, and select the target node with the highest score for each source node as the alignment result.
3. The method for aligning nodes in large-scale complex networks based on a large language model as described in claim 2, characterized in that: In step (1-1), preprocessing operations are performed on the source network and the target network respectively, including: deleting self-loop edges and duplicate edges in the network, retaining only undirected and unweighted edge connections, and obtaining a structurally normalized network representation.
4. The method for aligning nodes in large-scale complex networks based on a large language model as described in claim 2, characterized in that: The structural embedding model described in steps (1-2) adopts the NAME model of a graph neural network that integrates Euclidean space and hyperbolic space representations. By introducing an attention mechanism and non-Euclidean geometric constraints, it achieves an effective expression of the hierarchical structural features of complex networks.
5. The method for aligning nodes in large-scale complex networks based on a large language model as described in claim 2, characterized in that: In step (2-2), the pseudo-text description does not rely on the real semantic text attributes of the nodes, but compresses and maps numerical structural features and attribute features into structured labels that conform to the form of natural language, so that the large language model can understand and compare the structural patterns and feature combinations of the nodes.
6. The method for aligning nodes in large-scale complex networks based on a large language model as described in claim 2, characterized in that: The prompt word template includes the following three types of information content: I. Source Node Description: Provides the source node. Pseudo-text description; II. Candidate Node Descriptions: Candidate nodes are listed in sequence. Pseudo-text description, ; III. Problem and Output Format Constraints: Clearly state the task objectives and output format constraints in the prompt word template.
7. The method for aligning nodes in large-scale complex networks based on a large language model as described in claim 2, characterized in that: In step (3-2), to improve robustness, the parsing process adopts the following strategy: First, attempt to parse the last line of the output as a whole as JSON; if this fails, extract the JSON substring within the curly braces in the text using regular expressions for secondary parsing; if this still fails, use the result of normalizing the structural similarity within the candidate set as the fallback score, or assign a neutral value of 0.5; the fallback score is the backfill score, which refers to the score in the structural similarity matrix used to fill the position that should have been filled by the LLM score when the large language model fails to generate a valid score; for successfully parsed scores, if each score exceeds 1, it is automatically scaled to the 0-1 range, while NaN or infinite values are corrected.