A method and apparatus for generating a propagation graph of information

By constructing a directed acyclic graph and pruning non-essential users, and combining breadth-first search with graph-level and edge-level generation models, the problems of low accuracy and efficiency in propagation graph generation are solved. The generated propagation graph more accurately reflects the connection relationships between users and is suitable for complex information systems.

CN117456037BActive Publication Date: 2026-06-19NORTHWESTERN POLYTECHNICAL UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NORTHWESTERN POLYTECHNICAL UNIV
Filing Date
2023-12-01
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for generating propagation graphs have low accuracy and efficiency, and do not include user feature information, which limits their application scenarios.

Method used

By collecting propagation data to construct a directed acyclic graph, pruning unimportant users, using breadth-first search (BFS) to sort users, constructing an adjacency matrix, and training the graph propagation model using graph-level and edge-level generative models, the model is encoded and predicted using a Set2Set model and a variational autoencoder, and trained by combining user features and topological structure.

Benefits of technology

It improves the training efficiency and accuracy of graph propagation models, and the generated propagation graphs more accurately reflect the connection relationships between users, adapting to changes in complex information systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117456037B_ABST
    Figure CN117456037B_ABST
Patent Text Reader

Abstract

This invention provides a method and apparatus for generating an information propagation graph. The method includes: collecting propagation data; constructing a directed acyclic graph (DAG) based on the propagation data; pruning non-important users from the constructed DAG to obtain a pruned DAG with non-important users removed; sorting all users on the DAG according to importance to obtain a corresponding user index sequence, and constructing an adjacency matrix based on the user index sequence; training a graph propagation model based on the adjacency matrix to obtain a trained graph propagation model; and inputting new propagation data into the trained graph propagation model to obtain the propagation graph corresponding to the new propagation data. This invention improves the training efficiency and accuracy of the graph propagation model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of information dissemination technology, and in particular to a method and apparatus for generating information dissemination graphs. Background Technology

[0002] In today's digital age, information dissemination is widespread and rapid. To better understand and analyze this dissemination process, studying how information spreads among users has become crucial. Currently, information diffusion in information systems is often modeled as a stochastic process on a propagation network. The topology of this network is a complex network composed of nodes and edges. Nodes typically represent individuals or entities, and edges represent the propagation relationships between them—the actual paths of information dissemination. Furthermore, many downstream tasks, such as information source localization and misinformation detection, require large amounts of propagation network topologies as training sets to drive the discovery of potential propagation patterns. However, in real-world scenarios, monitoring and acquiring realistic propagation network structures requires significant resources, resulting in relatively small datasets. Using small datasets for training can limit the model's generalization ability, increase the risk of overfitting, and reduce its robustness. Moreover, due to privacy and security concerns, obtaining publicly available real-world propagation network data is becoming increasingly difficult. To address this challenge, the goal of propagation graph generation methods is to simulate and model the information dissemination process to generate realistic propagation network topologies.

[0003] Currently, propagation graph generation methods are mainly divided into two categories: simulation methods based on information propagation models and graph generation methods based on deep learning. However, regardless of the method, current propagation graph generation methods suffer from low accuracy, small generation scale, slow generation speed, and lack of user feature information, thus limiting the application scenarios of the generated propagation graphs. Summary of the Invention

[0004] This invention provides an information system-driven propagation graph and apparatus to address the shortcomings of existing technologies in terms of both efficiency and accuracy in generating propagation graphs.

[0005] This invention provides a method for generating an information propagation graph, comprising:

[0006] S1: Collect propagation data, which includes users, user characteristics, and propagation relationships between users;

[0007] S2: Construct a directed acyclic graph G, G = (V, E, F), based on the propagation data and initialize it; where V represents the user set, E represents the set of propagation relationships between users, and F represents user features;

[0008] S3: Prune the non-important users from the constructed directed acyclic graph G to obtain a pruned directed acyclic graph G' with the non-important users removed; the non-important users are users who only receive information from the propagation source and do not forward it again;

[0009] S4: Use Breadth-First Search (BFS) to sort all users on the directed acyclic graph G' according to their importance, obtain a user index sequence, and construct an adjacency matrix based on the user index sequence;

[0010] S5: Train the graph propagation model based on the adjacency matrix to obtain the trained graph propagation model;

[0011] S6: Input the new propagation data into the trained graph propagation model to obtain the propagation graph corresponding to the new propagation data.

[0012] According to a method for generating an information propagation graph provided by the present invention, step S3 includes:

[0013] Step 31: Extract a portion of the propagation data from the propagation data and identify the non-essential users within that portion of the propagation data;

[0014] Step 32: Train the non-important user prediction model based on the extracted non-important users to obtain the trained non-important user prediction model;

[0015] Step 33: Input the remaining propagation data from the propagation data into the trained non-essential user prediction model to predict the corresponding non-essential users;

[0016] Step 34: Remove all non-important users from the directed acyclic graph G to obtain the directed acyclic graph after removing non-important users, which is the directed acyclic graph G' after pruning.

[0017] According to the information propagation graph generation method provided by the present invention, step 32 includes:

[0018] The following model was used to train the prediction model for non-essential users:

[0019]

[0020]

[0021] Where Pooling(·) encodes the directed acyclic graph network G using a Set2Set model, and Nonlinear(·) is a nonlinear model; G represents the directed acyclic graph of the propagation network, with the complete form G = (V, E, F), where V represents the user set, E represents the set of propagation relationships between users, and F represents user features; f θ(G) represents the prediction of the number of non-essential users in the directed acyclic graph G of the propagation network; This indicates the number of non-essential users; This represents the relationship edges that are directly connected to non-critical users.

[0022] According to the information propagation graph generation method provided by the present invention, step S4 includes:

[0023] S41: Obtain the user features corresponding to all users on the directed acyclic graph G', the user features including: the number of followers of the user, the number of posts forwarded by the user, the number of followers of the user, the registration time of the user, and whether the user is verified;

[0024] S42: Normalize the feature dimensions of each user to obtain the normalized feature dimensions;

[0025] S43: Sort the feature dimensions of each user according to their importance using a chi-square distribution to obtain the sorted feature dimensions of each user and the weight of each feature;

[0026] S44: Determine the importance I(v) of all users based on the sorted feature dimensions and corresponding weights of each user;

[0027] S45: Determine the user index sequence based on the importance I(v) of all users.

[0028] According to the information propagation graph generation method provided by the present invention, step S5 includes:

[0029] Train the graph propagation model using the following formula:

[0030]

[0031]

[0032] Among them, f G For a graph-level generative model, f E For edge-level generation models, h i Let F be the latent vector of the sequence model corresponding to the currently generated directed acyclic graph at the i-th intermediate layer; F is the user feature corresponding to the currently generated directed acyclic graph. This represents the adjacency probability information of user vi corresponding to the i-th row of the adjacency matrix. Indicates the above Perform single-hot decoding. The maximum value is 1, and all other values ​​are set to 0; h i+1 This is the low-dimensional vector corresponding to the next directed acyclic graph generated from the currently generated directed acyclic graph.

[0033] According to the present invention, a method for generating a propagation graph of information is provided, wherein the graph-level generation model f G The following formula was used for training:

[0034] logq φ (z|GRU(x (k) ))=logExp(z;λ (k) )

[0035] Where z is the low-dimensional encoding vector h of the current directed acyclic graph. i Based on the latent variables of a variational autoencoder, GRU, as a time series processing model, determines the exponential form parameters of the latent variable z using the relationship information between the latest user and its neighbors as conditions; q φ (z|GRU(x (k) )) represents the approximate posterior distribution of the latent variable z parameterized by φ; λ (k) The rate parameter is the exponentially distributed value, logExp(z; λ). (k) ) indicates that the rate parameter is λ (k) The natural logarithm of the probability density function of the exponential distribution; x (k) This is the k-th row of the transpose adjacency matrix.

[0036] According to the present invention, a method for generating a propagation graph of information is provided, wherein the edge generation model f E The following formula was used for training:

[0037] Q m =σ(W q •query m +b q )

[0038]

[0039]

[0040] likelihood mn =score mn ·value n

[0041] Among them, query m This represents the comprehensive feature information of the new user m corresponding to the previously generated directed acyclic graph. This comprehensive feature information is composed of the new user's feature information F. m and the low-dimensional vector h corresponding to the directed acyclic graph i Synthesis, query m =cat(F m h i ); key nRepresents the characteristic information of historical user n; socre mn This indicates the degree of matching between historical user n and current user m in terms of attributes; likelihood mn W represents the probability that there is a propagation relationship between historical user n and current user m; q b q W k b k These are the parameters that the model learns during training; Q m The query represents the comprehensive characteristic information of the new user m. m In the linear transformation weight parameter W q and bias parameter b g The linear transformation output result; K n The feature information of historical user n is represented by the linear transformation weight parameter W. k and bias parameter b k The linear transformation output result; value n Let V represent a vector with the nth position set to 1 and all other positions set to 0, and a length of 1*V, where V is the number of nodes in the directed acyclic graph.

[0042] According to the present invention, a method for generating a propagation graph of information is provided, wherein the graph propagation model is trained based on the following loss function:

[0043] loss = loss bce +α*loss KL

[0044]

[0045]

[0046] Where, loss bce To reconstruct the error loss function; loss KL Let A be the KL divergence loss function; G Let G be the adjacency matrix of a directed acyclic graph G, representing the unique representation of user importance I(v). For G Φ The predicted probability of the corresponding edge, where y represents the adjacency matrix A. G Does there exist an edge in the array, where α is the weight factor and λ is the weight factor? (k) The rate parameter represents the exponential distribution predicted by the E-VAE output of the variational autoencoder; z is the latent representation of the generation of the E-VAE.

[0047] The present invention also provides an information propagation graph generation apparatus, comprising:

[0048] A collection unit is used to collect propagation data, which includes users, user characteristics, and propagation relationships between users;

[0049] The construction unit is used to construct and initialize a directed acyclic graph G, G = (V, E, F), based on the propagation data; where V represents the user set, E represents the set of propagation relationships between users, and F represents user features.

[0050] The pruning unit is used to prune non-important users from the constructed directed acyclic graph G, and obtain a directed acyclic graph G' after pruning to remove non-important users;

[0051] The processing unit is used to sort all users on the directed acyclic graph G' according to importance using breadth-first search (BFS) to obtain a corresponding user index sequence.

[0052] The construction unit is also used to construct an adjacency matrix based on the user index sequence;

[0053] The training unit is used to train the graph propagation model based on the adjacency matrix to obtain the trained graph propagation model;

[0054] The prediction unit is used to input new propagation data into the trained graph propagation model to obtain the propagation graph corresponding to the new propagation data.

[0055] The information propagation graph generation method and apparatus provided by this invention improves the training efficiency of the graph propagation model by pruning non-important users; by using breadth-first search (BFS) to sort all users on the directed acyclic graph G' according to importance to obtain a user index sequence, and constructing an adjacency matrix based on the user index sequence to train the graph propagation model, the trained graph propagation model has a unique representation, thus improving the accuracy of the model. Attached Figure Description

[0056] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0057] Figure 1 Flowchart of the information propagation graph generation method provided by the present invention;

[0058] Figure 2 This invention provides a process description for generating a propagation graph based on real-world propagation data. Detailed Implementation

[0059] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0060] Figure 1 A flowchart of the information propagation graph generation method provided by the present invention is shown below. Figure 1 As shown, the method includes the following steps:

[0061] S1: Collect propagation data, which includes users, user characteristics, and propagation relationships between users.

[0062] Specifically, collect and input dissemination data from social media platforms: collect dissemination events of different content on social media platforms, including the characteristics of nearly 1 million users and the dissemination relationships between users.

[0063] S2: Construct a directed acyclic graph G, G = (V, E, F) based on the propagation data and initialize it; where V represents the user set, E represents the set of propagation relationships between users, and F represents user features.

[0064] Specifically, a directed acyclic graph G, G = (V, E, F), is constructed and initialized as follows: After the user relationships are input, the actual relationships are mapped to the directed acyclic graph G, where V is the set of vertices corresponding to users on the social media platform, E is the set of edges, and the connection between two users indicates that they have a propagation relationship during the propagation process. F represents the user's characteristic attributes, including the number of followers, the number of reposts, the number of followings, the registration time, and whether the user is verified.

[0065] S3: Prune the non-important users from the constructed directed acyclic graph G to obtain a pruned directed acyclic graph G' with the non-important users removed; the non-important users are users who only receive information from the propagation source and do not forward it again.

[0066] Specifically, this invention prunes non-essential users from the directed acyclic graph G, which reduces redundant information in the pruned directed acyclic graph G' and improves the training efficiency of the graph propagation model in the later stages.

[0067] S4: Use Breadth-First Search (BFS) to sort all users on the directed acyclic graph G' according to their importance, obtain a corresponding user index sequence, and construct an adjacency matrix based on the user index sequence.

[0068] Specifically, an adjacency matrix can represent the unique connections between all users in a directed acyclic graph G'. This invention uses an adjacency matrix to reflect the connections between users, providing a structured way for the model to capture user interaction patterns. When the model is trained, it will be able to identify which users have stronger connections with other users, thereby predicting possible paths for information propagation in the information system. Therefore, this invention trains a graph propagation model by constructing an adjacency matrix.

[0069] This invention obtains a user index sequence by sorting all users on a directed acyclic graph G' according to their importance. This ensures that during the later training of the graph propagation model, each user corresponds to only one unique index, thereby improving the training efficiency of the graph propagation model. Furthermore, by sorting all users according to their importance, the model training process can prioritize users who play a key role in the propagation process, thereby improving the accuracy of the model.

[0070] This invention uses Breadth-First Search (BFS) to sort all users on a directed acyclic graph G' according to their importance. Since BFS can consider the propagation relationship starting from the source node layer by layer, it ensures that the network is traversed in a layer-by-layer order. Furthermore, within each layer, by considering the user index sequence, it fully considers the sequential relationship of the social influence importance of users within each layer, avoiding omissions or biases in relationships. This ensures the uniqueness of the adjacency matrix and more accurately reflects the true position and importance of users in the propagation graph.

[0071] S5: Train the graph propagation model based on the adjacency matrix to obtain the trained graph propagation model.

[0072] S6: Input the new propagation data into the trained graph propagation model to obtain the propagation graph corresponding to the new propagation data.

[0073] The information propagation graph generation method provided by this invention improves the training efficiency of the graph propagation model by pruning non-important users; by using breadth-first search (BFS) to sort all users on the directed acyclic graph G' according to importance to obtain a user index sequence, and constructing an adjacency matrix based on the user index sequence to train the graph propagation model, the trained graph propagation model has a unique representation, thus improving the model's accuracy.

[0074] Furthermore, the following section details how to prune non-critical users from the constructed directed acyclic graph G. This method includes the following steps:

[0075] Step 31: Extract a portion of the propagation data from the propagation data and identify the non-essential users within that portion of the propagation data;

[0076] Step 32: Train the non-important user prediction model based on the extracted non-important users to obtain the trained non-important user prediction model;

[0077] Step 33: Input the remaining propagation data from the propagation data into the trained non-essential user prediction model to predict the corresponding non-essential users;

[0078] Step 34: Remove the non-important users predicted in Step 33 and the non-important users extracted in Step 31 from the directed acyclic graph G to obtain the directed acyclic graph after removing non-important users, that is, the directed acyclic graph G' after pruning.

[0079] Specifically, this invention utilizes the graph pooling technique Set2Set to encode the topology and user features of the directed acyclic graph network G = (V, E, F) to obtain corresponding encoding vectors; the obtained encoding vectors are used to predict the non-important user model f. θ (G) Training is performed to obtain a trained prediction model for non-essential users. During the training process, the model is trained using the following formula:

[0080]

[0081]

[0082] Where Pooling(·) encodes the directed acyclic graph network G using a Set2Set model, and Nonlinear(·) is a nonlinear model; G represents the directed acyclic graph of the propagation network, with the complete form G = (V, E, F), where V represents the user set, E represents the set of propagation relationships between users, and F represents user features; f θ (G) represents the prediction of the number of non-important users in the directed acyclic graph G of the propagation network (a mapping function from graph topology to the number of users); This indicates the number of non-essential users; This represents the relationship edges that are directly connected to non-critical users.

[0083] That is, the present invention counts the number of users who receive information only from the propagation source and do not forward it again in different directed acyclic graphs G. Among them, for Let v be any node in E, where v represents a user with an out-degree of 0 and directly connected to the propagation source. Let s be the propagation source node. Then there should be an edge between s and v (expressed as (s, v) ∈ E).

[0084] This invention designs a nonlinear model f θ (G) to predict the number of non-essential users First, graph pooling is used to encode the directed acyclic graph G, which transforms the vertex and edge (V, E) information in network G into lower-dimensional vector information. Then, based on the encoded low-dimensional vectors, the number of users who only receive information from the propagation source and do not forward it is predicted. Pooling(·) is a graph pooling technique used for the direct encoding of a propagating graph G. This invention employs a Set2Set model to implement the pooling operation on graph G. Nonlinear(·) is a nonlinear model that can utilize models with aggregation capabilities, such as Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT). This invention uses a backpropagation neural network to train and predict the number of users.

[0085] The method provided by this invention encodes the topology and user features of the directed acyclic graph network G = (V, E, F) using a Set2Set model to obtain corresponding encoding vectors. This method has the beneficial effect of improving prediction accuracy and reducing algorithm complexity when training prediction models for non-critical users. The principle behind this beneficial effect is as follows:

[0086] (1) Topology considerations: The topology of a directed acyclic graph contains the interaction relationships between users in the network. These relationships are very valuable in many practical scenarios, such as information systems and information dissemination. When these interaction relationships are encoded as vectors, they provide the model with rich contextual information, thereby helping the model capture those patterns that may be ignored but play a key role in predicting non-important users.

[0087] (2) Encoding user features: Each user may have multiple features associated with them, such as interests, behaviors, attributes, etc. Combining these features with the network topology can generate a comprehensive representation, enabling the model to evaluate users from multiple dimensions and further improve the accuracy of predictions.

[0088] (3) Utilization of comprehensive information: By comprehensively encoding the topology and user characteristics, the model can gain a more comprehensive understanding of information flow and user behavior patterns in the network. This not only improves the accuracy of predictions but also helps the model remain robust when facing complex and changing network structures.

[0089] Furthermore, the following section provides a detailed description of how to obtain a unique user index sequence, which includes the following steps:

[0090] S41: Obtain the user features F corresponding to all users in the pruned directed acyclic graph G, wherein the user features include: the number of followers of the user, the number of posts forwarded by the user, the number of followers of the user, the registration time of the user, and whether the user is verified.

[0091] S42: Normalize the feature dimensions of each user to obtain the normalized feature dimensions;

[0092] S43: Sort the feature dimensions of each user according to importance using a chi-square distribution to obtain the sorted feature dimensions {f′1, f′2, ...} and the weights {w′1, w′2, ...} of each feature, where w′1 represents the normalized weight of f′1, w′2 represents the normalized weight of f′2, and so on, and w′1>w′2>....;

[0093] S44: Determine the importance I(v) of all users based on the ranked feature dimensions and corresponding weights of each user.

[0094] S45: Determine the user index sequence based on the importance I(v) of all users.

[0095] Specifically, the present invention determines the user index sequence in the following manner:

[0096] S451: The variables required to initialize the BFS algorithm are as follows:

[0097]

[0098] in `[]` is an empty set, and `[]` is an empty list. An empty set and an empty list indicate initialization. `root` is the source user of the propagation network; `visited` represents users that have been visited since the propagation source; `curLevel` represents the neighbor users in the current layer obtained by visiting the nth hop neighbor (the same message is sent from the propagation source and propagated through n users); `bfsSequence` is the user index sequence, and setting it to an empty list indicates initialization.

[0099] S452: For any user v, its importance I(v) can be evaluated, and then the users in the current access layer curLevel can be sorted according to the importance, resulting in the current layer users that have been sorted according to importance.

[0100] S453: Initialize the nextLevel variable of the current level to an empty set. This represents users who need further access after all accesses of the current layer have been completed.

[0101] S454: Begin processing each user v in the current level curLevel. i Perform traversal and operations: when the user That is, when a resource has not been visited, update the variables required for the BFS algorithm, including...

[0102] (1) Add user vi to the visited list and mark it, i.e., visited.add(v i )

[0103] (2) Add the most important user to the importance sorting sequence, i.e., bfsSequence.add(v i )

[0104] (3) If there are other users in the next level of this user, these neighbors also need to be marked so that they can be accessed again after the current level (curLevel) is completed. That is, nextLevel.add(Neighbor_list(v i ))

[0105] Here, add(·) is the operation for adding items to the list, Neighbor_list(v i ) is node v i The list of neighbors;

[0106] S455: After all curLevel accesses in the current layer are completed, update the access variables for the next round to perform initialization before the next layer of user access begins, including:

[0107] curLevel←nextLevel means updating the current level with the users who need to access the next level.

[0108] Set the next level to an empty set;

[0109] S456: Judgment Condition If the condition is met, output the unique user sequence representation bfsSequence of the directed acyclic graph G; otherwise, return to S452 to continue execution.

[0110] The following section details how to determine the importance I(v) of all users based on the ranked feature dimensions {f′1, f′2, ...} and {w′1, w′2, ...}:

[0111] Specifically, first, obtain the features {f′1, f′2, ...} and {w′1, w′2, ...} of all users; then, perform the {w′1, w′2, ...} operation on the obtained {f′1, f′2, ...}, for a certain user v i Obtain the normalized value f′1(v) corresponding to each feature f′1 of the user. i Then, the feature-based importance score I(v) of the user can be calculated. i )=w′1*f′1(v i )+w′2*f′2(vi Finally, determine the importance score for all users.

[0112] This invention uses user characteristic dimensions and weights to determine the importance of all users, which can improve decision-making efficiency and accuracy. The reasons are as follows: (1) Quantitative measurement: By ranking, a clear and quantitative metric is provided for each user, which is more accurate than simply relying on intuition or subjective judgment. (2) Consideration of global information: Ranking considers information about all users in the system, not just information about a single user or a small group. This means that the decision-making of this invention is based on a global perspective of the entire system, rather than on a local or biased perspective. (3) Flexibility: Although ranking can be based on multiple metrics, once a metric is selected, the method provides a unified and consistent way to assess user importance. (4) Ease of interpretation and communication: The ranked user list is easy to interpret and communicate. Other teams or stakeholders can easily understand which users are considered the most important and why.

[0113] Furthermore, the present invention trains the graph propagation model according to the following formula:

[0114]

[0115]

[0116] Among them, f G For a graph-level generative model, f E For edge-level generation models, h i h is the low-dimensional vector corresponding to the currently generated directed acyclic graph. i Let F be the latent vector of the sequence model corresponding to the currently generated directed acyclic graph at the i-th intermediate layer; F is the user feature corresponding to the currently generated directed acyclic graph. In practical terms, a team can be understood as a user v on the currently generated directed acyclic graph. i The connection probability prediction with all historical users, from the perspective of data input association, is the adjacency probability information of user vi corresponding to the i-th row of the adjacency matrix. Its dimension is 1*|V|, where |V| represents the number of nodes in the directed acyclic graph, which is also the row length or column length of the adjacency matrix. Indicates the above Perform single-hot decoding. The maximum value is 1, and other values ​​are set to 0, thus converting the adjacency probability information of user vi corresponding to the i-th row into 0 or 1 correspondence information of the standard adjacency matrix; h i+1 This is the low-dimensional vector corresponding to the next directed acyclic graph generated from the currently generated directed acyclic graph.

[0117] Specifically, f GThe focus is on the information in the graph of the generation process, f E The focus is on the connection information between the latest and historical users during the generation process.

[0118] The information propagation graph generation method provided by this invention uses a graph-level generation model f G and edge-level generation model f E Using the above formulas to train the graph propagation model can achieve more efficient and accurate training.

[0119] Furthermore, f G The graph-level generative model for the generation process includes a gated cyclic unit (GRU) and a variational autoencoder (VAE) module. G The following formula was used for training:

[0120] logq φ (z|GRU(x (k) ))=logExp(z;λ (k) )

[0121] Where z is the low-dimensional encoding vector h of the current directed acyclic graph. i The latent variable based on the variational autoencoder is distributed according to an exponential distribution obtained from feature analysis of information system propagation data; GRU, as a time series processing model, determines the exponential form parameters of the latent variable z based on the relationship information between the latest user and its neighbors; q φ (z|GRU(x (k) )) represents the approximate posterior distribution of the latent variable z parameterized by φ; λ (k) The rate parameter is the exponentially distributed value, logExp(z; λ). (k) ) indicates that the rate parameter is λ (k) The natural logarithm of the probability density function of the exponential distribution; x (k) This represents the in-degree information of the adjacency matrix input to the GRU time series model at step k (i.e., the kth row of the transposed adjacency matrix).

[0122] Specifically, this invention employs an exponential variational autoencoder (E-VAE) to establish a latent space vector representation of a graph structure, f G This is a model for graph-level generation that combines the functionality of a gated recurrent unit (GRU) and a variational autoencoder (VAE). In this model, when new user and neighbor relationship information is received, the GRU first processes and encodes this information. Then, the VAE uses the GRU's encoded output and further extracts latent structural information from the graph using a variational method with an exponential distribution.

[0123] The method provided by this invention, by employing the above formula, can produce more efficient, accurate, and robust results for training graph-level generation models. The principle behind these beneficial effects lies in:

[0124] (1) Multi-level modeling: By combining graph-level generation models and edge-level generation models, this invention can simultaneously consider the global structure and local connectivity patterns of the entire graph. This hierarchical approach enables the model to capture complex social interactions and propagation patterns.

[0125] (2) Enhanced representation learning: By training using the corresponding formulas mentioned above, the model can more effectively capture the relationships between nodes and edges, thereby obtaining richer and more discriminative feature representations. This provides a strong foundation for subsequent propagation prediction and analysis.

[0126] (3) Adaptability: Since the model is trained based on actual information system data, it can adapt to different propagation patterns and trends, thus maintaining efficiency and accuracy in the ever-changing information system environment.

[0127] (4) Modular design: The graph-level and edge-level generation models can be optimized and adjusted separately, making the entire system more flexible. Depending on the specific application scenario, these two models can be fine-tuned or replaced to achieve the best results.

[0128] (5) Computational efficiency: Through effective training strategies and algorithms, the training and inference processes of the model can be greatly accelerated, meeting the needs of large-scale information systems.

[0129] Furthermore, f E For the edge generation model in the generation process, f E Based on the current network topology representation h i And the current topological context, representing the connection density between users with unique attributes. This edge-level generative model f E The following formula was used for training:

[0130] Q m =σ(W q •query m +b q )

[0131]

[0132]

[0133] likelihood mn =score mn ·value n

[0134] Among them, query m This represents the comprehensive feature information of the new user m corresponding to the previously generated directed acyclic graph. This comprehensive feature information is composed of the new user's feature information F. m and the low-dimensional vector h corresponding to the directed acyclic graph i Synthesis, query m =cat(F m h i ); key n Represents the characteristic information of historical user n; socre mn This indicates the degree of matching between historical user n and current user m in terms of attributes; likelihood mn W represents the probability that there is a propagation relationship between historical user n and current user m; q b q W k b k These are the parameters that the model learns during training; Q m The query represents the comprehensive characteristic information of the new user m. m In the linear transformation weight parameter W q and bias parameter b g The linear transformation output of K is generally referred to as the query in the attention mechanism. n The feature information of historical user n is represented by the linear transformation weight parameter W. k and bias parameter b k The output of the linear transformation is generally referred to as the key in the attention mechanism; the value... n This represents a vector of length 1*V, where the nth position is 1 and all other positions are 0. V is the number of nodes in the directed acyclic graph, which is also the row or column value of the adjacency matrix. In attention mechanisms, it is generally referred to as the value.

[0135] Specifically, this invention trains the edge-level representation f of the generation process of graph G based on graph-level encoding information. E The algorithm utilizes user attribute attention mechanisms to evaluate the probability of directed edges forming between a new user and each existing user, determines the propagation probability between the new user and all users in the current topology graph, selects the user with the highest propagation probability from the current topology graph, links it to the new user, and updates the current topology graph based on the new user's link relationships. The newly generated edges are then integrated into the next graph generation sequence, enabling the training process to repeat. Specifically: f E For the edge generation model in the generation process, f E Based on the current network topology representation h iAnd the current topological context, representing the connection density between users with unique attributes. Likelihood probability calculation of inter-user propagation relationships with a user attribute attention mechanism. This invention generates a query and a key for each user; for a new user m, its query... m =cat(F m h i ) is composed of its characteristic information F m and the current network status h i Synthetic, it represents the comprehensive feature information of the new user m under the currently generated network structure. Meanwhile, for the historical user n, his key is also composed of his features, representing his individual characteristics. Thus, we have queries and keys representing the features of new and historical users under the current network structure. Q m and K n A non-linear transformation ensures consistency in the dimensions of the query and key. Next, these queries and keys are used to calculate a score, which measures the degree of matching or similarity in attributes between the new user m and the historical user n. Finally, to determine the probability of a propagation relationship between two users, this invention combines the obtained score with other information to calculate the likelihood probability. mn The likelihood probability mn This actually represents the possibility that a propagation relationship exists between historical user n and current user m, taking into account user characteristics and the current network structure. q b q W k b k These are parameters that the model learns during training; they determine how to generate queries and keys from user attributes and the current network state. By adjusting these parameters, the model can learn to better capture and represent propagation relationships between users.

[0136] The method provided by this invention, by employing the above formula, can produce beneficial effects such as higher connection accuracy, better model generalization ability, and faster training convergence speed for training edge-level generative models. The principle behind these beneficial effects is as follows:

[0137] (1) Local Feature Capture and Enhanced User Relationship Learning: The edge-level generative model uses an attention mechanism that integrates current global information and user features to focus on the connections between nodes in the graph. This means that it can capture the local features and interaction patterns between nodes in greater detail. For some applications, such as recommendation systems or link prediction, this fine-grained local structure recognition is crucial.

[0138] (2) Model complexity and data matching: In graph structures, the number of edges is usually much greater than the number of nodes. Edge-level models are designed specifically for this situation, so their parameters and structure are better adapted to this data distribution, thereby improving efficiency and accuracy.

[0139] (3) Faster convergence speed: Since the edge generation model mainly focuses on connections, its parameters and structure are relatively few, which enables the model to converge to the optimal solution faster during training.

[0140] (4) Model interpretability: By analyzing the parameters and output of the edge generation model, the connection patterns and feature importance in the graph can be understood more intuitively, thereby improving the interpretability of the model.

[0141] The graph propagation model described herein is trained based on the following loss function:

[0142] loss = loss bce +α*loss KL

[0143]

[0144]

[0145] Where, loss bce To reconstruct the error loss function; loss KL Let A be the KL divergence loss function; G Let G be the adjacency matrix of a directed acyclic graph G, representing the unique representation of user importance I(v). For G Φ The predicted probability of the corresponding edge, where y represents the adjacency matrix A. G Does an edge exist in the equation? α is a weighting factor, representing the rate parameter of the exponential distribution predicted by the E-VAE output; z is the latent representation of the generation of the E-VAE.

[0146] Specifically, to enable the graph generation model to successfully learn and generate graph structures, this invention designs a loss function (loss) and optimizes the model's parameters during training until the model converges. The loss function consists of two parts:

[0147] Reconstruction error loss bce This part of the loss function measures the difference between the graph generated by the model and the actual graph. Specifically, this invention calculates the difference between the probability of edge existence predicted by the model and the probability of actual edges. This allows the model to better simulate the actual graph structure.

[0148] KL divergence loss KLThis part of the loss function measures the difference between the latent representation in the model and the expected distribution. In other words, this invention requires that the representation within the model follows a specific distribution, and KL divergence helps ensure this. Here, z is the latent representation generated by the E-VAE, and α is the weight used to emphasize the importance of edges. To balance the importance of these two parts, this invention introduces a weighting factor α. By adjusting this weighting factor, it is possible to emphasize which part of the loss is more important to the model, thereby improving the model's accuracy.

[0149] The present invention also provides an information system-driven propagation graph generation device, including...

[0150] A collection unit is used to collect propagation data, which includes users, user characteristics, and propagation relationships between users;

[0151] The construction unit is used to construct and initialize a directed acyclic graph G, G = (V, E, F), based on the propagation data; where V represents the user set, E represents the set of propagation relationships between users, and F represents user features.

[0152] The pruning unit is used to prune non-important users from the constructed directed acyclic graph G, and obtain a directed acyclic graph G' after pruning to remove non-important users;

[0153] The processing unit is used to sort all users on the directed acyclic graph G' according to importance using breadth-first search (BFS) to obtain a corresponding user index sequence.

[0154] The construction unit is also used to construct an adjacency matrix based on the user index sequence;

[0155] The training unit is used to train the graph propagation model based on the adjacency matrix to obtain the trained graph propagation model;

[0156] The prediction unit is used to input new propagation data into the trained graph propagation model to obtain the propagation graph corresponding to the new propagation data.

[0157] Figure 2 This invention describes the graph generation stage of an information propagation graph generation method. The graph generation process (blue line) primarily focuses on the topological structure of a directed acyclic graph. This process employs an exponential variational autoencoder (E-VAE) to establish a latent space vector representation of the graph topology. Based on this fundamental topology, the edge generation process (green line) evaluates the similarity between potential new users and each existing user. In the evaluation, an attention mechanism (purple line) focusing on user attributes is used to assess the likelihood of directional edge formation between new users and each existing user, integrating the newly generated edges into the next graph generation sequence.

[0158] Table 2 Performance Evaluation of Different Methods

[0159]

[0160] Table 2 presents the performance evaluation of different methods based on the MMD metric. The MMD metric is widely used in graph generation to assess the similarity between two data distributions and is crucial for evaluating the quality of the generated graph structure. A smaller MMD value indicates a closer approximation. In the table, DAVA represents the method proposed in this invention, and time represents the hours required for the model to generate individual graphs with 100, 1000, and 10000 nodes, respectively. The symbol "-" indicates that the model failed to successfully generate graphs of the corresponding scale, and bold values ​​represent the best results. DAVA consistently outperforms all test datasets, and compared to the optimal D-VAE baseline, MMD... 2 On average, it reduced the size by 40% and increased the generation scale by two orders of magnitude. The superiority of the propagation graph generation method provided by this invention can be attributed to four key factors: (1) Utilizing statistical analysis from real-world data as prior knowledge, E-VAE is better able to represent the potential space of real-world topology. (2) The constructed relational network provides historical prior knowledge of user relationships. (3) Through an attention mechanism based on user feature representation, it dynamically focuses on the possibility of the existence of edges in the topology. (4) It takes into account the credibility erosion effect in information system propagation and effectively incorporates it into the generation process. DAVA can generate large-scale propagation graphs in less time for two reasons: (1) The BFS arrangement based on node importance creates a strongly relevant context for each node, allowing for efficient sequential generation using a lightweight sliding window without needing to focus on the global context in the graph autoregression process. (2) The interpretable attention mechanism allows for a lightweight module design, significantly reducing the parameters required for various attention processes. The above experiments show that the propagation directed acyclic graph generated by DAVA is closer to reality, larger in scale, and faster in generation speed.

[0161] Table 3 Source detection accuracy of localization methods for different groups on different training sets.

[0162]

[0163] Table 3 illustrates the enhancement effect of the propagation graph generation method provided by this invention on downstream tasks. Many downstream tasks in the field of information propagation, such as impact maximization, misinformation detection, information diffusion analysis, and propagation source localization, rely on propagation data from real-world scenarios. To verify that the generated propagation graph data can serve downstream tasks, a comparative experiment was conducted on an information platform using two advanced propagation source localization algorithms, GCNSI and TGASI. In the experiment, the original group trained a localization model on 90% of the propagation data and tested the remaining 10%. For comparison, the enhancement group generated an additional 1000 real-world propagation graphs for training using DAVA or other state-of-the-art methods, while the control group simulated an additional 1000 propagation snapshots based on SI, SIR, IC, and LT models. Experimental results show that using simulated data from traditional information propagation models leads to a decrease in the performance of downstream tasks in real-world propagation scenarios, indicating that traditional information propagation models have limited relevance to real-world tasks. Conversely, augmentation using real-generated data can significantly improve training results, especially when using propagation data generated by the DAVA method proposed in this invention, where it outperforms other state-of-the-art methods. This experiment further emphasizes the importance of propagation graph generation and the excellent performance of the method provided in this invention.

[0164] Table 4 illustrates the impact of each module in DAVA on graph generation performance, demonstrating their necessity. The key modules of the proposed DAVA method include E-VAE, the user relationship attention mechanism, the loss function, and a decay mechanism based on the credibility erosion effect (CEE). Performance comparison experiments on the Twitter16 dataset using DAVA and its variants reveal that removing or replacing key modules leads to reduced similarity, decreased quality, reduced generation size, and increased generation time in the generated propagation graph. This experiment underscores the essential role of the modules included in the proposed method in graph generation models, providing valuable assistance in generating realistic propagation graph data.

[0165] Table 4. Performance evaluation of DAVA variant models based on MMD metric

[0166]

[0167]

[0168] In summary, the method provided by this invention can learn the potential distribution of propagation data in real information systems, generating realistic and larger-scale propagation graph data. By focusing on and analyzing the CCDF values ​​of propagation graph features and user features in information systems, it is found that most features follow an exponential distribution, and a credibility erosion effect is observed in information systems. Utilizing the above analysis results and knowledge as prior information, this is successfully integrated into the graph generation model, significantly improving the quality of the generated propagation graph data while reducing the complexity of the algorithm. This allows the method to generate larger-scale information system propagation graphs at a lower cost.

[0169] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for generating an information propagation graph, characterized in that, include S1: Collect propagation data, which includes users, user characteristics, and propagation relationships between users; S2: Construct a directed acyclic graph based on the propagation data. , and initialize; where, Represents a set of users. This represents the set of propagation relationships between users. Indicate user characteristics; S3: pruning non-important users from the constructed directed acyclic graph to obtain a directed acyclic graph after pruning non-important users ; the non-important users are users that only receive information from the propagation source and do not forward it any more; S4: Use Breadth-First Search (BFS) on the directed acyclic graph. All users on the platform are sorted according to importance, resulting in a user index sequence, and an adjacency matrix is ​​constructed based on the user index sequence. S5: Train the graph propagation model based on the adjacency matrix to obtain the trained graph propagation model; S6: Input the new propagation data into the trained graph propagation model to obtain the propagation graph corresponding to the new propagation data; Step S5 includes: The graph propagation model is trained using the following formula: in, For graph-level generation models, For edge-level generation models, The sequence model corresponding to the currently generated directed acyclic graph is in the th... i The hidden vectors of the intermediate layers; F is the user feature corresponding to the currently generated directed acyclic graph; For the adjacency matrix at the th i Adjacency probability information of user vi corresponding to row Indicates the above Perform single-hot decoding. The maximum value is 1, and other values ​​are set to 0; This is the low-dimensional vector corresponding to the next directed acyclic graph generated from the currently generated directed acyclic graph; The graph propagation model is trained based on the following loss function: in, To reconstruct the error loss function; Let KL divergence loss function be used. For a directed acyclic graph G, based on user importance The corresponding unique adjacency matrix, for The predicted probability of the corresponding edge. Representing the adjacency matrix Does an edge exist in the middle? As a weighting factor, The rate parameter represents the exponential distribution predicted by the output of the variational autoencoder (E-VAE). This represents the potential representation of the generation of E-VAE.

2. The information propagation graph generation method according to claim 1, characterized by, Step S3 includes: Step 31: Extract a portion of the propagation data from the propagation data and identify the non-essential users within that portion of the propagation data; Step 32: Train the non-important user prediction model based on the extracted non-important users to obtain the trained non-important user prediction model; Step 33: Input the remaining propagation data from the propagation data into the trained non-essential user prediction model to predict the corresponding non-essential users; Step 34: removing all non-significant users from the directed acyclic graph , resulting in a directed acyclic graph after removing non-significant users, i.e. the pruned directed acyclic graph .

3. The information propagation graph generating method according to claim 2, wherein Step 32 includes: The following model was used to train the prediction model for non-essential users: in, To adopt The model applies to the directed acyclic graph network. Encode, This is a nonlinear model; G represents a directed acyclic graph of the propagation network, and its complete form is... ,in Represents a set of users. This represents the set of propagation relationships between users. Indicate user characteristics; This indicates the prediction of the number of non-essential users in the directed acyclic graph G of the propagation network. This indicates the number of non-essential users; This represents the relationship edges that are directly connected to non-critical users.

4. The information propagation graph generation method according to claim 1, characterized by, Step S4 includes: S41: Obtain the directed acyclic graph. The user characteristics corresponding to all users on the platform include: the number of followers of the user, the number of posts forwarded by the user, the number of followers of the user, the registration time of the user, and whether the user is verified. S42: Normalize the feature dimensions for each user to obtain the normalized feature dimensions; S43: Sort the feature dimensions of each user according to the importance by using the chi-square distribution, to obtain the sorted and the weight corresponding to each feature; S44: determining the importance of all users according to the sorted and corresponding weights ; S45: determining the user index sequence according to the importance of all users .

5. The information propagation graph generation method according to claim 1, characterized in that, The graph-level generation model is trained using the following equation: in, It is the low-dimensional encoding vector of the current directed acyclic graph. Based on the latent variables of the variational autoencoder, GRU, as a time series processing model, determines the parameters of the latent variable z in exponential form from the relationship information between the latest user and its neighbors as a condition. Indicates by Parameterized latent variables The approximate posterior distribution; It is the rate parameter of the exponential distribution. The rate parameter is The natural logarithm of the probability density function of the exponential distribution; This is the k-th row of the transpose adjacency matrix.

6. The information propagation graph generation method according to claim 1, wherein The edge-level generative model is trained using the following equation: in, This represents the comprehensive feature information of the new user m corresponding to the previously generated directed acyclic graph. This comprehensive feature information is composed of the new user's feature information. and the low-dimensional vector corresponding to the directed acyclic graph synthesis, ; This represents the characteristic information of historical user n; This indicates the degree of matching between historical user n and current user m in terms of attributes; This indicates the possibility that there is a propagation relationship between historical user n and current user m; , , , These are the parameters that the model learns during training; This represents the comprehensive characteristic information of the new user m. In linear transformation weight parameters and bias parameters The linear transformation output result; This represents the feature information of historical user n in the linear transformation weight parameters. and bias parameters The linear transformation output result; V is a vector, where V is the number of nodes in a directed acyclic graph.

7. An information propagation map generation apparatus characterized by comprising: include A collection unit is used to collect propagation data, which includes users, user characteristics, and propagation relationships between users; a building unit for building a directed acyclic graph according to the propagation data , and initializing; wherein, denotes a set of users, denotes a set of propagation relations between users, denotes user features; a pruning unit configured to prune non-essential users from the constructed directed acyclic graph to obtain a pruned directed acyclic graph with non-essential users removed ; Processing unit for using breadth-first search (BFS) to process the directed acyclic graph. All users on the platform are sorted according to importance, resulting in a user index sequence; The construction unit is also used to construct an adjacency matrix based on the user index sequence; The training unit is used to train the graph propagation model based on the adjacency matrix to obtain the trained graph propagation model; The prediction unit is used to input new propagation data into the trained graph propagation model to obtain the propagation graph corresponding to the new propagation data. The training unit trains the graph propagation model according to the following formula: in, For graph-level generation models, For edge-level generation models, The sequence model corresponding to the currently generated directed acyclic graph is in the th... i The hidden vectors of the intermediate layers; F is the user feature corresponding to the currently generated directed acyclic graph; For the adjacency matrix at the th i Adjacency probability information of user vi corresponding to row Indicates the above Perform single-hot decoding. The maximum value is 1, and other values ​​are set to 0; This is the low-dimensional vector corresponding to the next directed acyclic graph generated from the currently generated directed acyclic graph; The graph propagation model is trained based on the following loss function: in, To reconstruct the error loss function; Let KL divergence loss function be used. For a directed acyclic graph G, based on user importance The corresponding unique adjacency matrix, for The predicted probability of the corresponding edge. Representing the adjacency matrix Does an edge exist in the middle? As a weighting factor, The rate parameter represents the exponential distribution predicted by the output of the variational autoencoder (E-VAE). This represents the potential representation of the generation of E-VAE.