Event causality detection method fusing lexical and dependency features
By integrating part-of-speech and dependency features, and utilizing the RoBERTa model and edge feature map attention network, the accuracy problem of causal relationship detection is solved, achieving more efficient causal relationship detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2023-03-21
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies struggle to effectively detect sentence-level causal relationships, lacking comprehensive utilization of part-of-speech and dependency features, resulting in insufficient accuracy in causal relationship detection.
The RoBERTa model is used to generate word embedding vectors, and one-hot encoding is used to obtain part-of-speech and dependency features. A dependency directed graph is constructed, and features are aggregated through an edge feature map attention network and a bidirectional long short-term memory network. Finally, causal relationship detection is performed in a conditional random field.
It enhances the semantic representation capability of causal relationship detection and improves the detection accuracy of causal sentences.
Smart Images

Figure CN116796727B_ABST
Abstract
Description
TECHNICAL FIELD
[0001] The application relates to an event causality detection method fusing part-of-speech and dependency relation features, and belongs to the technical fields of deep learning and natural language processing. BACKGROUND
[0002] Event causality extraction, as one of subtasks of information extraction, can accurately reflect the mutual influence between events, help people deepen the logical understanding and grasp of large-scale texts, and accurately grasp the causality between events, thereby providing strong help for upper application tasks such as passage understanding, event prediction and intelligent question answering. As a more fine-grained task than event causality extraction, causality detection pays more attention to the causality at the sentence level, and has gradually become a research hotspot in the field of natural language processing.
[0003] Therefore, it is necessary to provide an event causality detection method fusing part-of-speech and dependency relation features to solve the above problems. SUMMARY
[0004] The application aims to provide an event causality detection method fusing part-of-speech and dependency relation features, which can better detect the causality of a sentence.
[0005] To achieve the above object, the application provides an event causality detection method fusing part-of-speech and dependency relation features, mainly comprising the following steps:
[0006] Step S1, node feature modeling, a RoBERTa model is used as a basic model to generate an initial word embedding vector, One-hot encoding is used to encode the part-of-speech of all words to generate part-of-speech features, and the initial word embedding vector is fused with the part-of-speech features to generate node features of a dependency directed graph;
[0007] Step S2, edge feature modeling, One-hot encoding is used to encode all dependency relations to generate dependency relation features, which are used as edge features of the dependency directed graph;
[0008] Step S3, constructing a dependency directed graph, a dependency directed graph with dependency relation features on the edges is constructed;
[0009] Step S4, aggregating node and edge features, the dependency directed graph is input into an improved edge feature map attention network model, and the network model is used to aggregate the node and edge features of the dependency directed graph;
[0010] Step S5, enhancing semantic information in the dependency directed graph, the output result of the layer is input into a bidirectional long short-term memory network to capture the dependency relation between long-distance nodes in the forward and reverse directions, and a graph embedding vector is generated;
[0011] Step S6: Event causality detection. Finally, the word embedding vector and graph embedding vector, which are fused with part-of-speech features, are merged and incorporated into the linear layer and conditional random field to obtain the final causality detection result.
[0012] As a further improvement of the present invention, in step 1, the RoBERTa model's word segmenter is used to standardize the causal event dataset according to the input format. For a text sequence T = [t1, t2, ..., t] of length s in the dataset... s The input is standardized according to the input format and then fed into the RoBERTa model. The RoBERTa encoder obtains the vector representations of all words in the sequence, and finally generates the initial word embedding vectors. Then, the Stanford NLP toolkit is used to obtain the part-of-speech tags of all words in the sequence. The part-of-speech tags are one-hot encoded, and corresponding part-of-speech features are generated for the standardized sequence. The initial word embedding vectors generated by RoBERTa and the part-of-speech features are concatenated and fused to obtain new embedding features, which are then used as node features in the subsequent dependent directed graph.
[0013] As a further improvement of the present invention, in step 2, the syntactic dependencies between all words in sequence T are obtained using the Stanford NLP toolkit, and all dependencies are one-hot encoded to generate edge features.
[0014] As a further improvement of the present invention, in step 3, a corresponding directed dependency graph is constructed for sequence T. The word embedding vector that incorporates part-of-speech features is used as the node feature of the syntactic dependency graph, the dependency relationship between one word and another word is used as the directed edge of the syntactic dependency graph, and the one-hot encoding corresponding to the dependency relationship is used as the feature on the edge.
[0015] As a further improvement of the present invention, in step 4, the node features and edge features of the syntacticly dependent directed graph are simultaneously input into the edge feature graph attention network model. Each layer of the edge graph attention network iterates through the node and edge features, generating higher-level node and edge features that are then fed into the next layer. Simultaneously, each layer of the edge graph attention network generates a set of node features (Node_Feature) that integrates the edge features. edge And save all the Node_Features generated by the edge graph attention network layers. edge The nodes will be stitched together in the merging layer to achieve multi-scale fusion of node features and edge features, ultimately resulting in multi-level graph embedding feature information that incorporates edge features.
[0016] As a further improvement of the present invention, step 4 also includes the following steps: each edge graph attention network layer contains two different blocks: a node attention module and an edge attention module, and receives a set of node features Node_Feature=(h1,h2,…,h n ),h i ∈R h And a set of edge features Edge_Feature=(e1,e2,…,e m ),e i ∈R e As input, n and m represent the number of nodes and edges, and h and e represent the dimensions of their respective features. Each edge graph attention network layer, after receiving these two sets of features, will produce a high-level output, which includes a new set of node features, Node_Feature′ = (h′1, h′2, ..., h′′). n ),h′ i ∈R h′ And a new set of edge features, Edge_Feature′=(e′1,e′2,…,e′ m ),e′ i ∈R e′ ;
[0017] Transform the edge index matrix into an edge mapping matrix M edge (M edge ∈R n×n×m Then use the edge mapping matrix M edge Transform the edge feature vector Edge_Feature into an adjacency form vector Edge_Feature * The calculation formula is as follows:
[0018] Edge_Feature * =M edge •Edge_Feature
[0019] Each element in Edge_Feature* can be represented as e i,j e i,j Let represent the eigenvector of the directed edge from node i to node j.
[0020] As a further improvement of the present invention, after preprocessing the edge feature vector, an attention mechanism integrating edge features is performed on each node to generate attention weights that simultaneously contain features of neighboring nodes and adjacent edges. For each node i, the weight w of each neighboring node j is calculated. ijIn this process, all features are concatenated, parameterized with a weight vector 'a', and activated using the LeakyReLU function. Finally, the Softmax function is used to normalize these weights, which can be expressed as:
[0021]
[0022] Among them, h i Represents the characteristics of node i; N i h is the set containing the first-order neighbors of node i and node i itself; j ,h k Let e represent the node characteristics of neighboring nodes j and k, respectively; i,j ,e i,k These represent the edge features from node i to node j and node k, respectively; a T This represents the weight vector.
[0023] As a further improvement of the present invention, after node i obtains the normalized attention weights of each neighbor, it performs a weighted summation of the neighbor node features and represents the output of the node attention block through a nonlinear Sigmoid function, which can be expressed as:
[0024]
[0025] Where, N i h is the set containing the first-order neighbors of node i and node i itself; j α represents the node characteristics of node i's neighboring node j; i,j σ represents the normalized attention weights for nodes i and j; σ represents the Sigmoid function.
[0026] As a further improvement of the present invention, in step 5, a bidirectional long short-term memory network is added after the graph embedding features to capture the dependencies between long-distance nodes in the forward and reverse directions in the syntactic dependency graph.
[0027] As a further improvement of the present invention, in step 6, the graph embedding features enhanced with semantic information are concatenated with the word embedding vectors fused with part-of-speech features to obtain the final embedding features of the entire model. The final embedding features are then connected to a linear layer, the feature dimensions of the model are mapped to the number of labels, and then connected to a conditional random field to predict the final causal relationship.
[0028] The beneficial effects of this invention are: this invention incorporates part-of-speech and dependency features into the event causality detection task, thereby enhancing the semantic representation of causal sentences in the event causality detection task. Attached Figure Description
[0029] Figure 1This is a diagram illustrating the overall model architecture of the event causality detection method of the present invention.
[0030] Figure 2 This is an example diagram of the dependency directed graph of the event causality detection method of the present invention.
[0031] Figure 3 This is a model architecture diagram of the edge feature map attention network for the event causality detection method of the present invention.
[0032] Figure 4 This is an architecture diagram of a single-layer edge graph attention network layer in the event causality detection method of the present invention.
[0033] Figure 5 This is a schematic diagram illustrating the specific calculation process of the node attention module in the event causality detection method of the present invention.
[0034] Figure 6 This is a schematic diagram of the node and edge switching process of the edge attention module in the event causality detection method of the present invention. Detailed Implementation
[0035] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
[0036] It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and / or processing steps closely related to the present invention are shown in the accompanying drawings, while other details that are not closely related to the present invention are omitted.
[0037] Additionally, it should be noted that the terms “comprising,” “including,” or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.
[0038] like Figures 1 to 6 As shown, this invention discloses a method for detecting event causality by integrating part-of-speech and dependency relationship features, comprising the following steps:
[0039] Step S1: Use the RoBERTa model's word segmenter to standardize the causal event dataset according to the input format. For a text sequence T = [t1, t2, ..., t] of length s in the dataset... s[CLS] and [SEP] are added to the beginning and end of the text sequence T, respectively. The maximum sequence length is set to n. For sequences shorter than n, [PAD] is used to pad the remaining parts. This is then input into the RoBERTa model. The RoBERTa encoder obtains vector representations of all words in the sequence, generating initial word embedding vectors. The part-of-speech (POS) features of the words are then fused into these vectors. First, the POS of CLS, SEP, and PAD in the sequence are represented using a single encoding. Then, the Stanford NLP toolkit is used to obtain the POS of all words in the sequence, and one-hot encoding is performed on the POS. Corresponding POS features are generated for the standardized sequence. The initial word embedding vectors generated by RoBERTa and the POS features are concatenated and fused to obtain new embedding features, which are then used as node features in the subsequent directed graph.
[0040] Step S2: Use the Stanford NLP toolkit to obtain the syntactic dependencies between all words in sequence T, and perform one-hot encoding on all dependencies to generate edge features.
[0041] Step S3: Construct a corresponding directed dependency graph for sequence T. Use word embedding vectors that incorporate part-of-speech features as node features of the syntactic dependency graph, use the dependency relationship between one word and another word as directed edges of the syntactic dependency graph, and use the one-hot encoding corresponding to the dependency relationship as features on the edges. Figure 2 This example demonstrates a dependent directed graph, where the cause and effect parts are highlighted in red and blue, respectively, while the black parts represent neither (usually causal indicators).
[0042] Step S4: The overall architecture of the edge feature map attention network model consists of multiple edge map attention network layers and a merging layer. We simultaneously input the node features and edge features of the syntactic dependency graph into the edge feature map attention network model. Each edge map attention network layer iterates through the node and edge features, generating higher-level node and edge features that are then fed into the next edge map attention network layer. Simultaneously, each edge map attention network layer generates a set of node features (Node_Feature) that integrates the edge features. edge And save all the Node_Features generated by the edge graph attention network layers. edge The data will be stitched together in the merging layer to achieve multi-scale fusion of node features and edge features, ultimately resulting in multi-level graph embedding feature information that incorporates edge features. Figure 3 The diagram shows the model architecture of the edge feature map attention network.
[0043] Each edge graph attention network layer contains two distinct blocks: a node attention module and an edge attention module, and receives a set of node features, Node_Feature = (h1, h2, ..., h...). n ),h i ∈R h And a set of edge features, Edge_Feature=(e1,e2,…,e m ),e i ∈R e The input consists of two sets of features: n and m, representing the number of nodes and edges, and h and e, representing the dimensions of their respective features. Each edge graph attention network layer receives these two sets of features and then produces a high-level output, which includes a new set of node features: Node_Feature′=(h′1,h′2,…,h′). n ), h′ i ∈R h′ And a new set of edge features: Edge_Feature′=(e′1,e′2,…,e′) m ),e′ i ∈R e′ .
[0044] Figure 4 The diagram illustrates the architecture of a single-layer edge graph attention network. In the node attention module, the received edge features are defined as Edge_Feature = (e1, e2, ..., e...). m ),e i ∈R e Arranging the edge index matrix in order makes it difficult to find the relationship between an edge and its neighboring nodes. Therefore, we first transform the edge index matrix into an edge mapping matrix M. edge (M edge ∈R n×n×m Then use the edge mapping matrix M edge Transform the edge feature vector Edge_Feature into an adjacency form vector Edge_Feature * The calculation formula is as follows:
[0045] Edge_Feature * =M edge •Edge_Feature
[0046] Each element in Edge_Feature* can be represented as e i,j e i,j Let represent the eigenvector of the directed edge from node i to node j.
[0047] Figure 5This demonstrates the specific calculation process of transforming the edge feature vector of the node attention module into an adjacency vector, where Figure (a) shows the edge mapping matrix M obtained from the dependent directed graph. edge Figure (b) illustrates the process of transforming edge features into adjacency forms.
[0048] After preprocessing the edge feature vectors, an attention mechanism integrating edge features is applied to each node, generating attention weights that simultaneously incorporate features from neighboring nodes and adjacent edges. For each node i, the weight w for each neighboring node j is calculated. ij In this process, all features are concatenated, parameterized with a weight vector 'a', and activated using the LeakyReLU function. Finally, the Softmax function is used to normalize these weights. The entire process can be described as follows:
[0049]
[0050] Among them, h i Represents the characteristics of node i; N i h is the set containing the first-order neighbors of node i and node i itself; j ,h k Let e represent the node characteristics of neighboring nodes j and k, respectively; i,,j ,e i,k These represent the edge features from node i to node j and node k, respectively; a T This represents the weight vector.
[0051] After node i obtains the normalized attention weights for each neighbor, we can perform a weighted summation of the features of these neighbor nodes. Furthermore, a non-linear sigmoid function is applied to these summations. The final result, which is also the output of the attention block for this node, can be expressed as:
[0052]
[0053] Where, N i h is the set containing the first-order neighbors of node i and node i itself; j α represents the node characteristics of node i's neighboring node j; i,j σ represents the normalized attention weights for nodes i and j; σ represents the sigmoid function. It's important to note that we only aggregate node features to generate a new set of node features. Edge features only play a role in the weight calculation and are not included as part of the new node features.
[0054] In the node attention module, a set of node features integrating edge features is generated. edge =(r1,r2,…,r nEach edge graph attention network layer generates a set of Node_Features. edge However, it will not be passed to the next edge graph attention network layer; it will only be retained until the last merging layer. For each node i, the generated r i The calculation formula is as follows:
[0055]
[0056] In the edge attention module, to achieve aggregation, the roles of nodes and edges in the graph are first switched. Figure 6 A case study of the switching process was presented. Then, using the same... Figure 5 The method shown first yields the node mapping matrix M. node (M node ∈R m×m×n Then, the node feature set is converted into adjacency form. The method for calculating the normalized attention weights of nodes is the same as before; the normalized attention weight from each edge p to its neighbor edge q can be expressed as:
[0057]
[0058] Among them, e p Features representing edge p; M p It is the set containing the first-order neighbors of edge p and edge p itself; e q ,e k Let h represent the edge features of neighbor edges q and k, respectively; pq ,h pk Let b represent the node features from edge p to edge q and edge k, respectively; T This represents the weight vector.
[0059] Similar to node features, the computation of the new edge feature set can be expressed as:
[0060]
[0061] Among them, M p It is the set containing the first-order neighbors of edge p and edge p itself; e q Represents the edge features of edge q; β pq σ represents the normalized attention weights of edges p and q; σ represents the Sigmoid function.
[0062] After iteratively generating new sets of nodes and edge features through multiple layers of edge graph attention networks, a merging layer is finally connected to concatenate the node and edge feature sets generated by each edge graph attention network layer. The merging layer employs a multi-head attention mechanism to concatenate features. Unlike ordinary graph attention networks, the multi-head attention mechanism in edge graph attention networks is calculated uniformly across all edge graph attention network layers. The calculation formula can be expressed as:
[0063]
[0064] Where L represents the number of layers in the edge graph attention network, This represents the node features of node i, which integrate edge features, generated in the attention network layer of k iterative edge graphs.
[0065] Step S5: After obtaining the graph embedding features, add a bidirectional long short-term memory network layer only after the graph embedding features to capture the dependencies between long-distance nodes in the syntactic dependency graph. The calculation formula is as follows:
[0066] f t =σ(W f ·[h t-1 x t ]+b f ),
[0067] i t =σ(W i ·[h t-1 x t ]+b i ),
[0068]
[0069] o t =σ(W o ·[h t-1 x t ]+b o ),
[0070]
[0071] h t =o t *tanhC t ),
[0072] Where σ represents the Sigmoid activation function; f t i t and o t These represent the forget gate, input gate, and output gate, respectively; x t C t and h t These are the input vector, the final cell state, and the output vector, respectively; W f W i W c and W o b is the weight matrix; f b i b c and b o It is the offset vector; This represents the current cell state.
[0073] Step S6: After obtaining the graph embedding features with positive and negative information, concatenate them with the word embedding vectors that incorporate part-of-speech features to obtain the final embedding features of the entire model. Connect the final embedding features to a linear layer, mapping the model's feature dimensions to the number of labels, and then connect it to a conditional random field to predict the final causal relationship.
[0074] In summary, this invention incorporates part-of-speech and dependency features into the event causality detection task, enhancing the semantic representation of causal sentences in this task. Furthermore, it models a dependency directed graph by using word embedding vectors with fused part-of-speech features as node features, dependency relationships as directed edges, and dependency features as edge features. Thirdly, it employs an edge feature graph attention network model to aggregate node and edge features. During aggregation, edge information is incorporated into the feature representation, and node and edge features are iterated in a parallel and interactive manner to generate high-level feature output.
[0075] The above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims
1. A method for detecting event causal relationships by integrating part-of-speech and dependency relationship features, characterized in that, The main steps include: Step S1: Node feature modeling. The RoBERTa model is used as the base model to generate preliminary word embedding vectors. One-hot encoding is used to encode the part of speech of all words to generate part-of-speech features. The initial word embedding vectors and part-of-speech features are fused to generate node features of the dependent directed graph. Step S2: Edge feature modeling. One-hot encoding is used to encode all dependencies to generate dependency features, which are then used as edge features of the dependency directed graph. Step S3: Construct a directed dependency graph, which has dependency relationship features on the edges; Step S4: Aggregate node and edge features. Input the dependent directed graph into the improved edge feature graph attention network model and use this network model to aggregate the node and edge features of the dependent directed graph. Step S5: Enhance the semantic information in the dependent directed graph, and input the output of this layer into a bidirectional long short-term memory network to capture the dependency relationship between long-distance nodes in both directions, and generate graph embedding vectors. Step S6: Event causality detection. Finally, the word embedding vector and graph embedding vector, which are fused with part-of-speech features, are merged and incorporated into the linear layer and conditional random field to obtain the final causality detection result. In step S1, the RoBERTa model's tokenizer is used to standardize the causal event dataset according to the input format. For a text sequence of length s in the dataset... The input is standardized according to the input format and then fed into the RoBERTa model. The RoBERTa encoder obtains the vector representations of all words in the sequence, and finally generates the initial word embedding vectors. Then, the Stanford NLP toolkit is used to obtain the part-of-speech tags of all words in the sequence. The part-of-speech tags are one-hot encoded, and corresponding part-of-speech features are generated for the standardized sequence. The initial word embedding vectors generated by RoBERTa and the part-of-speech features are concatenated and fused to obtain new embedding features, which are then used as node features in the subsequent dependent directed graph.
2. The event causal relationship detection method integrating part-of-speech and dependency relationship features according to claim 1, characterized in that: In step S2, the Stanford NLP toolkit is used to obtain the syntactic dependencies between all words in sequence T, and all dependencies are one-hot encoded to generate edge features.
3. The event causal relationship detection method integrating part-of-speech and dependency relationship features according to claim 1, characterized in that: In step S3, a corresponding directed dependency graph is constructed for sequence T. The word embedding vectors that incorporate part-of-speech features are used as node features of the syntactic dependency graph. The dependency relationship between one word and another word is used as the directed edge of the syntactic dependency graph. The one-hot encoding corresponding to the dependency relationship is used as the feature on the edge.
4. The event causal relationship detection method integrating part-of-speech and dependency relationship features according to claim 1, characterized in that: In step S4, the node features and edge features of the syntactic dependency directed graph are simultaneously input into the edge feature graph attention network model. Each layer of the edge graph attention network iterates through the node and edge features, generating higher-level node and edge features that are then fed into the next layer. Simultaneously, each layer of the edge graph attention network generates a set of node features that integrate the edge features. And save all the edge graphs generated by the attention network layers. The nodes will be stitched together in the merging layer to achieve multi-scale fusion of node features and edge features, ultimately resulting in multi-level graph embedding feature information that incorporates edge features.
5. The event causal relationship detection method integrating part-of-speech and dependency relationship features according to claim 4, characterized in that, Step S4 further includes the following steps: each edge graph attention network layer contains two distinct blocks: a node attention module and an edge attention module, and receives a set of node features. and a set of edge features As input, n and m represent the number of nodes and edges, and h and e represent the dimensions of their respective features. Each edge map attention network layer receives these two sets of features and produces a high-level output, which includes a new set of node features. and a new set of edge features, ; Transform the edge index matrix into an edge mapping matrix Then use the edge mapping matrix edge feature vectors Transform the vector into adjacency form The calculation formula is as follows: , in, Each element in can be represented as , Let represent the eigenvector of the directed edge from node i to node j.
6. The event causal relationship detection method integrating part-of-speech and dependency relationship features according to claim 5, characterized in that: After preprocessing the edge feature vectors, an attention mechanism integrating edge features is performed on each node to generate attention weights that simultaneously include features from neighboring nodes and adjacent edges. For each node i, the weights of each neighboring node j are calculated. In this process, all features are concatenated, parameterized with a weight vector 'a', and activated using the LeakyReLU function. Finally, the Softmax function is used to normalize these weights, which can be expressed as: , in, Represent the characteristics of node i; It is a set that includes the first-order neighbors of node i and node i itself; Let j and k represent the node characteristics of neighboring nodes j and k, respectively. These represent the edge features from node i to node j and node k, respectively. This represents the weight vector.
7. The event causal relationship detection method integrating part-of-speech and dependency relationship features according to claim 6, characterized in that: After obtaining the normalized attention weights of each neighbor, node i performs a weighted summation of the neighbor node features and expresses the output of the node's attention block using a non-linear sigmoid function, which can be represented as: , in, It is a set that includes the first-order neighbors of node i and node i itself; Represents the node characteristics of node j, which is the neighbor of node i; Represents the normalized attention weights for nodes i and j; This represents the Sigmoid function.
8. The event causal relationship detection method integrating part-of-speech and dependency relationship features according to claim 1, characterized in that: In step S5, a bidirectional long short-term memory network is added after the graph embedding features to capture the dependencies between long-distance nodes in the forward and reverse directions in the syntactic dependency graph.
9. The event causal relationship detection method integrating part-of-speech and dependency relationship features according to claim 1, characterized in that: In step S6, the graph embedding features enhanced with semantic information are concatenated with the word embedding vectors fused with part-of-speech features to obtain the final embedding features of the entire model. The final embedding features are then connected to a linear layer, mapping the feature dimensions of the model to the number of labels, and then connected to a conditional random field to predict the final causal relationship.