An open domain relation extraction method and system based on adaptive clustering
By introducing bilateral boundary loss and a relation repository, and combining inter-instance and inter-cluster comparative learning, the model decision boundary is optimized, which solves the problem of semantic entanglement between classes and clusters in open-domain relation extraction and improves the effectiveness and quality of relation extraction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING INST OF TECH
- Filing Date
- 2023-06-01
- Publication Date
- 2026-06-19
AI Technical Summary
Existing open-domain relation extraction methods struggle to effectively address the semantic entanglement between clusters when dealing with unknown relations. Furthermore, traditional methods fail to explicitly align relation semantics with cluster semantics, resulting in poor clustering performance.
An adaptive clustering-based approach is adopted, which introduces a two-sided boundary loss to constrain difficult samples, uses a relation repository to construct semantic differences between clusters, and combines inter-instance and inter-cluster comparative learning to optimize the decision boundary of the model, thereby achieving intra-class consistency and inter-class difference modeling.
It improves the performance of open-domain relation extraction, alleviates semantic overlap between clusters, enhances the model's sensitivity to decision boundaries and its ability to distinguish labeled data, and improves the quality of open-domain relation extraction.
Smart Images

Figure CN116662457B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to an open-domain relation extraction method and system, specifically an open-domain relation extraction method and system based on adaptive clustering, belonging to the field of natural language processing application technology. Background Technology
[0002] With the development of deep learning technology, relation extraction technology has made significant progress. As an information extraction method, relation extraction mainly uses models to identify and extract the corresponding relationships between entity pairs from textual data, helping users to further understand and utilize the data. However, traditional supervised and remote supervision paradigms for relation extraction are often designed for predefined relationships and cannot handle emerging relationships in the real world. To address the time-consuming and laborious problem of manually defining templates for emerging relationships, researchers have proposed open-domain relation extraction, which extracts implicit structured information from unstructured text without being limited by the relationship type or domain of the original text. This better reflects the knowledge diversity and broad domain scope of the relationships to be identified in the real world.
[0003] Currently, the mainstream approach to improving open-domain relation extraction methods for clustering unknown relations is to use metric learning or contrastive learning to constrain the distance between original samples and positive and negative samples. While this approach is straightforward, this single boundary limit restricts the similarity of sample pairs, hindering the flexible construction of clusters. Secondly, self-supervised contrastive learning based on instance discrimination to optimize feature representations lacks awareness of intra-class variations, and directly applying it to deep clustering fails to achieve optimal class differentiation. Furthermore, the method of iteratively training feature representation and clustering modules in stages can only optimize features with explicit supervision signals and cannot achieve joint updates of the two modules.
[0004] These methods cannot explicitly align relational semantics and cluster semantics, making it difficult to resolve semantic entanglement between some clusters. Therefore, to address the widespread problem of poor clustering performance in open-domain relation extraction under undefined relation scenarios, it is first necessary to clarify the classification of difficult samples in semantic entanglement and model the semantic differences between classes; secondly, it is necessary to adopt a joint optimization objective to achieve joint optimization of features and clustering. Improving the quality of open-domain relation extraction can provide an important data source for natural language processing tasks such as knowledge graph construction and intelligent question answering, and has great application value. Summary of the Invention
[0005] The purpose of this invention is to overcome the shortcomings of existing open-domain relation extraction methods and creatively propose an open-domain relation extraction method and system based on adaptive clustering. This method utilizes bilateral boundary loss to constrain the clustering of samples with varying difficulty, constructs a relation repository, and employs inter-instance and inter-cluster contrastive learning to model intra-class consistency and inter-class differences. During the clustering process, the decision boundary is adaptively adjusted, thereby improving the effectiveness of open-domain relation extraction.
[0006] The innovations of this invention are as follows: First, existing open-domain relation extraction models do not focus on different types of difficult samples, and existing self-supervised contrastive learning methods only consider the consistency of relation instances of the same category, without considering the differences between clusters after intra-cluster changes. To address this, a two-sided boundary loss is introduced to provide a constraint boundary for difficult samples, which provides upper and lower bounds on the similarity of difficult samples. Simultaneously, based on instance self-supervised contrastive learning, supervised contrastive learning is introduced using labeled data to improve the model's ability to distinguish labeled data. Furthermore, a relation repository is used to store the relation representations of each category of unlabeled data separately, and inter-cluster semantic differences are constructed based on this repository. The predicted results of cluster decision boundaries and the probabilities of cluster semantic assignments are trained by minimizing cross-entropy, improving the model's sensitivity to decision boundaries.
[0007] The objective of this invention is achieved through the following technical solutions.
[0008] This invention proposes an open-domain relation extraction method based on adaptive clustering, comprising the following steps:
[0009] Step 1: Input a relation instance and encode it to generate a relation representation.
[0010] Specifically as follows:
[0011] Step 1.1: Convert relation instances into sequences of their word vector representations using an embedding layer. This can be achieved using the following methods:
[0012] Step 1.1.1: Expand the sentences in the relation instance. Specifically, this involves adding pairs of special symbols "" to the head entity and tail entity. <e1> ”、"< / e1> "", <e2>"and"< / e2> Mark its beginning and end positions.
[0013] Step 1.1.2: The sentence processed in Step 1.1.1 is mapped word by word into a sequence of word vectors through an embedding layer.
[0014] Step 1.2: Encode the input sequence using an encoder and output a relational representation. This can be achieved using the following methods:
[0015] Step 1.2.1: Obtain the encoded representation of the input sequence through the encoder. After max pooling the encoded representations related to the head entity and the tail entity, obtain the encoded representation of each entity.
[0016] Step 1.2.2: Concatenate the head and tail entity encoding representations obtained in Step 1.2.1 to obtain the relation representation of the relation instance.
[0017] Step 2: Define and construct different types of difficult samples for the labeled data, and use bilateral boundary loss to measure the difference between the two types of difficult samples in the semantic space.
[0018] Specifically as follows:
[0019] Step 2.1: Construct different types of difficult samples using different strategies. Specifically, this can be achieved using the following methods:
[0020] Step 2.1.1: For cases with the same relationship type in different contexts, construct positive samples using an entity replacement strategy. Specifically, randomly replace the head entity and tail entity with other words of the same entity type.
[0021] Step 2.1.2: For cases with similar contexts but different relation types, randomly select instances with different relation types from the original relation instances, modify their head and tail entities to replace them with synonyms from the original instances, and construct negative samples.
[0022] Step 2.2: After obtaining the relational representation of the samples obtained in Step 2.1, construct a two-sided boundary loss. Specifically, constrain the difference between the cosine similarity between the original sample and the positive sample, and the difference between the cosine similarity between the original sample and the negative sample, within the range of -m1 above and -m2 below.
[0023] Step 3: Construct instance-level supervised contrastive learning and self-supervised contrastive learning losses using labeled and unlabeled data.
[0024] Specifically as follows:
[0025] Step 3.1: Construct positive samples for each relation instance in the labeled and unlabeled datasets using the strategy in Step 2.1.1.
[0026] Step 3.2: The relationship representation between instances and positive samples is obtained from Step 1. The relationship representation of negative samples is the relationship representation of other positive samples in the same batch. Cosine similarity is used to measure similarity, and a self-supervised comparison loss between instances is constructed.
[0027] Step 3.3: For the labeled data, use the relational instances belonging to the same category in the same batch as positive samples and other samples in the same batch as negative samples. After obtaining the relational representation of instances, positive samples and negative samples in Step 1, construct the supervised comparison loss between instances.
[0028] Step 4: Build a relation repository for unlabeled data, use the relation representations in each storage queue to model the semantic differences between clusters to achieve adaptive clustering of unlabeled data, and use a classification task to predict the labels of labeled data.
[0029] Specifically as follows:
[0030] Step 4.1: Each queue maintained by the relation repository collection represents a positive sample relation representation for each relation category. After each encoder update, the new relation representation enters the corresponding new queue, and the earliest relation representation added to the original queue is deleted.
[0031] Step 4.2: Minimize the cross-entropy between cluster assignment based on semantic similarity in the feature space and predictions generated based on the decision boundary. This can be achieved using the following method:
[0032] Step 4.2.1: Calculate the semantic similarity between the current representation and each relation category using the relation representation of each category in the relation repository.
[0033] Step 4.2.2: Use the vector after the current representation is processed by the clustering decision boundary, i.e. the vector after mapping and Softmax operation, as the probability that the current representation belongs to each relation category.
[0034] Step 4.2.3: For each category, minimize the cross-entropy of the semantic similarity and decision boundary output predictions, and update the parameters of the decision boundary.
[0035] Step 4.2.4: After each round of training, update the parameters of the encoder and decision boundary, update the labels of each positive sample using maximum likelihood estimation, and update the relation repository.
[0036] Step 4.3: Assign higher weights to samples that are classified differently in adjacent training rounds for instance-level contrastive loss.
[0037] Step 4.4: For unlabeled data, use maximum likelihood estimation to output the class prediction of the sample; for labeled data, use classification cross-entropy to output the class prediction of the sample.
[0038] Repeat the above steps until the loss updates slowly over 10 rounds, or until the maximum number of training rounds is reached.
[0039] In another aspect, based on the above method, an open-domain relation extraction system based on adaptive clustering is proposed, including a relation encoding module, an adaptive hard sample module, a supervised and self-supervised contrastive learning module, and an adaptive clustering module.
[0040] Among them, the relation encoding module is responsible for encoding relation instances into corresponding relation representations;
[0041] The adaptive hard sample module is responsible for constructing positive and negative samples and providing upper and lower bounds for the similarity between positive and negative samples;
[0042] The supervised and self-supervised comparative learning module is used to perform self-supervised comparative learning between instances for labeled and unlabeled data, supervised comparative learning for labeled data, and update encoder parameters.
[0043] The adaptive clustering module is used for clustering decision boundary updates for unlabeled data and classification of labeled data.
[0044] Furthermore, the relation encoding module includes a sample acquisition unit, a preprocessing unit, and an embedding layer unit.
[0045] in:
[0046] The sample acquisition unit is used to acquire relation instances from the corpus;
[0047] The preprocessing unit is used to obtain the required data, including sentences, head entities, and tail entities, and is responsible for adding special markers to the sentences;
[0048] The embedding layer unit is used to convert the processed text information into corresponding word vector sequences, and further generate relation representations.
[0049] The adaptive hard sample module includes sample construction units and bilateral boundary constraint units. Among them:
[0050] The sample construction unit is responsible for obtaining the positive and negative samples corresponding to the relation instances;
[0051] The bilateral boundary constraint unit provides upper and lower bounds for the similarity between instances and positive samples, as well as the difference in similarity between instances and negative samples.
[0052] The supervised and self-supervised comparative learning module includes self-supervised comparative units and supervised comparative units. Among them:
[0053] The self-supervised comparison unit constructs a comparison loss for relational instances between labeled and unlabeled data;
[0054] The supervised comparison unit targets labeled data, where positive samples are data of the same category from the same batch, and negative samples are data of different categories from the same batch.
[0055] The adaptive clustering module includes relation repository units and adaptive clustering units. Wherein:
[0056] The relation repository unit maintains positive samples corresponding to relation instances in queues of different relation categories;
[0057] The adaptive clustering unit uses a relation repository and decision boundary to classify unlabeled data and uses a classification task to provide class predictions for labeled data.
[0058] The connections between the above components are as follows:
[0059] The input of the adaptive hard sample module is connected to the output of the relation encoding module;
[0060] The input of the supervised and self-supervised contrastive learning module is connected to the output of the relation encoding module;
[0061] The input of the adaptive clustering module is connected to the output of the relation encoding module;
[0062] The input of the supervised and self-supervised contrastive learning module is connected to the output of the adaptive hard sample module;
[0063] The input of the adaptive clustering module is connected to the output of the adaptive hard sample module;
[0064] The input of the supervised and self-supervised contrastive learning module is connected to the output of the adaptive clustering module.
[0065] In the relation encoding module, the input of the preprocessing unit is connected to the output of the sample acquisition unit, and the input of the embedding layer unit is connected to the output of the preprocessing unit.
[0066] In the adaptive hard sample module, the input of the bilateral boundary constraint unit is connected to the output of the sample construction unit.
[0067] In the supervised and self-supervised contrastive learning module, the input of the supervised contrastive unit is connected to the output of the self-supervised contrastive unit.
[0068] In the adaptive clustering module, the input of the adaptive clustering unit is connected to the output of the relation storage unit.
[0069] Beneficial effects
[0070] Compared with existing technologies, this invention uses bilateral boundary loss for different types of difficult samples and constructs a supervised contrastive loss based on instance-based contrastive learning. It fully utilizes the semantic information of the relation repository to achieve adaptive clustering, improving the quality of open-domain relation extraction. Experiments and data visualization on the relation extraction dataset FewRel demonstrate that the following content and the introduction of the adaptive clustering method can effectively alleviate the semantic overlap between clusters, improve the performance of open-domain relation extraction, and that the open-domain relation extraction system based on adaptive clustering outperforms the unsupervised open-domain relation extraction model system. Attached Figure Description
[0071] Figure 1 This is a flowchart of the method of the present invention;
[0072] Figure 2 This is a data visualization diagram of an embodiment of the present invention;
[0073] Figure 3 This is a schematic diagram of the system architecture of the present invention. Detailed Implementation
[0074] The present invention will now be described in further detail with reference to the accompanying drawings and embodiments.
[0075] Example
[0076] like Figure 1 As shown, an open-domain relation extraction method based on adaptive clustering includes the following steps:
[0077] Step 1: Input a relation instance and encode it to generate a relation representation. Specifically:
[0078] Step 1.1: Convert relation instances into sequences of their word vector representations through an embedding layer;
[0079] Step 1.2: Encode the input sequence using an encoder and output the relational representation.
[0080] Step 2: Define and construct different types of difficult samples for the labeled data, and use bilateral boundary loss to measure the differences between the two types of difficult samples in the semantic space. Specifically:
[0081] Step 2.1: Construct different types of difficult samples using different strategies;
[0082] Step 2.1.1: For cases with the same relation type in different contexts, construct positive samples using an entity replacement strategy. Specifically, randomly replace the head and tail entities with other words of the same entity type.
[0083] Step 2.1.2: For cases with similar contexts but different relation types, randomly select instances with different relation types from the original relation instances, modify their head and tail entities to replace them with synonyms from the original instances, and construct negative samples.
[0084] Step 2.2: After obtaining the relational representation of the samples obtained in Step 2.1, construct the bilateral boundary loss.
[0085] Step 3: Construct instance-level supervised contrastive learning and self-supervised contrastive learning losses using labeled and unlabeled data. Specifically:
[0086] Step 3.1: Construct positive samples for each relation instance in the labeled and unlabeled datasets using the strategy in Step 2.1.1;
[0087] Step 3.2: The relationship representation between instances and positive samples is obtained from Step 1. The relationship representation of negative samples is the relationship representation of other positive samples in the same batch. Cosine similarity is used to measure similarity, and a self-supervised comparison loss between instances is constructed.
[0088] Step 3.3: For the labeled data, use the relational instances belonging to the same category in the same batch as positive samples and other samples in the same batch as negative samples. After obtaining the relational representation of instances, positive samples and negative samples in Step 1, construct the supervised comparison loss between instances.
[0089] Step 4: Construct a relation repository for unlabeled data, utilize the relation representations in each storage queue to model semantic differences between clusters to achieve adaptive clustering, provide category prediction for unlabeled data, and simultaneously use a classification task to predict labels for labeled data. Specifically:
[0090] Step 4.1: Build a relation repository, where each queue represents a positive sample relation representation for each relation category;
[0091] Step 4.2: Minimize the cross-entropy between cluster assignment based on semantic similarity in the feature space and predictions generated based on decision boundaries;
[0092] Step 4.3: Assign higher weights to samples that are classified differently in adjacent training rounds for instance-level contrastive loss;
[0093] Step 4.4: For unlabeled data, use maximum likelihood estimation to output the class prediction of the sample; for labeled data, use classification cross-entropy to output the class prediction of the sample.
[0094] In steps 1.1, 2.2, and 3.2, it is necessary to obtain relation instances and related positive and negative samples, and preprocess the samples. For example, 6400 relation instances from 64 relation classes in the FewRel dataset are selected as the training corpus. Each relation instance includes a sentence, the start and end positions of the head entity, and the start and end positions of the tail entity. For example:
[0095] Sentence: Mike was born in British.
[0096] Head entity start and end positions: [0,1]
[0097] Tail entity start and end positions: [4,5]
[0098] In one embodiment, the beginning and end entities of the sentence are first represented by paired special symbols "". <e1> ”、"< / e1> "", <e2> "and"< / e2> The tags constitute the input sequence, that is:
[0099] Input sequence: <e1> Mike< / e1>was born in <e2> British< / e2> .
[0100] Next, an embedding layer mapping is used to convert the input sequence into a word vector representation sequence.
[0101] In step 1.2, after obtaining the word vector representation sequence, it is fed into the encoder to obtain the relational representation of the input sequence.
[0102] First, the entity-related word vector representations are max-pooled to obtain the entity's encoded representation:
[0103] h ent =MAXPOOL([h s ,...,h e (1)
[0104] Among them, h s h e These represent the word vectors corresponding to the start and end positions of the entity, respectively.
[0105] Secondly, the concatenation of the head and tail entity encoding representations is used as the relation representation z. i :
[0106] z i =[h head ,h end (2)
[0107] Among them, h head and h end These represent the encoded representations of the head and tail entities, respectively. [,] represents the concatenation operation.
[0108] In step 2.1, for the labeled data, two examples of constructing difficult samples are as follows:
[0109] Original example: Mike was born in British.
[0110] Same relation type in different contexts: The birthplace of Jack was Candada.
[0111] Similar contexts, different relationship types: Mike died in British.
[0112] For step 2.2, through loss L H Constrain the differences in similarity between the original sample and the negative sample, and between the original sample and the positive sample, within a specified range:
[0113]
[0114]
[0115] in, and These represent the relationship between positive and negative samples corresponding to the labeled data, respectively. The max(,) operation represents taking the maximum value, and m1 and m2 are the upper and lower bounds of the semantic difference, respectively.
[0116] In step 3.1, positive samples from labeled and unlabeled datasets are obtained using the positive sample construction strategy in step 2.1.1, and relation representations are obtained using step 1.
[0117] Step 3.2 constructs a unified instance-level self-supervised contrastive learning model for the two datasets:
[0118]
[0119] in, Represents the current instance z i The positive sample relationship is represented by τ, where τ is the temperature coefficient, 1 [n≠i] This means that the expression is 1 if and only if n is not equal to i, otherwise it is 0.
[0120] Step 3.3 involves constructing instance-level supervised contrastive learning for the labeled dataset:
[0121]
[0122] Among them, P(i)={p∈N\i:y p =y i} represents the set of sample labels in the current batch that match the label of the i-th sample, where N is the sample size of each batch. For unlabeled datasets, The two contrast losses are unified as follows:
[0123]
[0124] Where λ is a hyperparameter used to balance the two losses. N is the sample size of each batch.
[0125] In step 4.1, for unlabeled data, B is defined as the batch size, and C... u The number of unlabeled categories is given, and the build size is BN / (C). u -1) Relational repository collection This is used to store the positive sample representation for each category. For the current pseudo-label... Positive samples represent, in addition to Positive samples corresponding to other unlabeled data will be used as the comparison set Q. i ,in After each backpropagation, a new relation representation is created. Enter the corresponding queue The earliest entry in the queue was deleted.
[0126] In step 4.2, the semantic similarity between the current representation and each relation category is calculated using the relation representation of each category in the relation repository. The current representation is then processed by the clustering decision boundary, i.e., the vector p after mapping and Softmax operation. i Each dimension in the matrix represents the probability of the current representation belonging to each relation category. Finally, the cluster assignment based on semantic similarity in the feature space is minimized. Compared with the prediction p generated based on the decision boundary i,j Cross-entropy L between CD Update the parameters of the decision boundary:
[0127]
[0128]
[0129]
[0130] in The relationship is represented for unlabeled data, where τ is the temperature coefficient, and W and b are parameters of the decision boundary. Mapped to C u Dimensional vector.
[0131] After each training round, the labels of instances in the iteration cycle are updated using maximum likelihood estimation.
[0132]
[0133] In step 4.3, samples that differ in class classification during adjacent training rounds are assigned higher weights for instance-level contrastive loss:
[0134]
[0135]
[0136]
[0137]
[0138] in Let the relationship in round e be represented by a score. The value is 1 if and only if the labels of adjacent rounds are the same. Representation of Relationship The weights in the e-th round of training The score represents the relationship between samples from the same batch in round e.
[0139] In step 4.4, a classification cross-entropy loss is constructed for the labeled data during cluster training:
[0140]
[0141] Where y c p represents the label corresponding to the currently labeled data. c The probability output by the classifier.
[0142] Finally, as Figure 1 As shown, the combined training loss of each module is the overall loss L:
[0143] L=αL H +L CL +L CD +βL CE (17)
[0144] Here, α and β are hyperparameters.
[0145] Repeat the above steps until the loss does not change significantly within 10 rounds, or until the maximum number of training rounds is reached and training is terminated.
[0146] The model uses the Adam gradient update algorithm to update its parameters. After convergence, the model is stored for subsequent testing. The cross-entropy loss function and training method are existing technologies and will not be elaborated upon.
[0147] To highlight the ability of this invention to alleviate semantic entanglement between class clusters, relation representations of 800 unlabeled data points from 8 categories in the FewRel dataset were randomly selected. The dimension of each representation was reduced to 2 dimensions and then visualized as follows: Figure 2 As shown, each category is distinguished by a different color. Subgraphs (a), (b), (c), and (d) represent the initial state of each relation representation in the semantic space, and the visualization results after 10, 30, and 52 training epochs, respectively. With increasing epochs, the representations within each relation semantic cluster become increasingly compact, and the boundaries between clusters become more distinct. Although the orange class is divided into multiple subsets, the semantic overlap with other categories is significantly reduced compared to before. Even those difficult samples located at the boundaries of other clusters can be compactly grouped together, reducing semantic entanglement with adjacent clusters.
[0148] According to another aspect of the present invention, an open-domain relation extraction system based on adaptive clustering is proposed, such as... Figure 3As shown, the system includes a relation encoding module, an adaptive hard sample module, a supervised and self-supervised contrastive learning module, and an adaptive clustering module. The relation encoding module is responsible for preprocessing the corpus and converting the data into corresponding relation representations. The adaptive hard sample module constructs different hard samples and constrains them with a two-sided boundary loss. The supervised and self-supervised contrastive learning module constructs supervised contrastive learning for labeled data and self-supervised contrastive learning for labeled and unlabeled data. The adaptive clustering module constructs a relation repository for unlabeled data, achieves adaptive decision boundary optimization, and constructs a classification task for labeled data.
[0149] Furthermore, the relation encoding module includes a sample acquisition unit, a preprocessing unit, and an embedding layer unit.
[0150] in:
[0151] The sample acquisition unit is used to acquire relation instances from the corpus;
[0152] The preprocessing unit is used to obtain the required data, including sentences, head entities, and tail entities. It is responsible for adding special markers to the sentences, as well as constructing a vocabulary and segmenting the sentences.
[0153] The embedding layer unit is used to convert the processed text information into corresponding word vector sequences, and further generate relation representations.
[0154] The adaptive hard sample module includes sample construction units and bilateral boundary constraint units. Among them:
[0155] The sample construction unit is responsible for obtaining the positive and negative samples corresponding to the relation instances;
[0156] The bilateral boundary constraint unit provides upper and lower bounds for the similarity between instances and positive samples, as well as the difference in similarity between instances and negative samples.
[0157] The supervised and self-supervised comparative learning module includes self-supervised comparative units and supervised comparative units. Among them:
[0158] The self-supervised comparison unit constructs a comparison loss for relational instances between labeled and unlabeled data;
[0159] The supervised comparison unit targets labeled data, where positive samples are data of the same category from the same batch, and negative samples are data of different categories from the same batch.
[0160] The adaptive clustering module includes relation repository units and adaptive clustering units. Wherein:
[0161] The relation repository unit maintains positive samples corresponding to relation instances in queues of different relation categories;
[0162] The adaptive clustering unit utilizes a relation repository and decision boundaries to classify unlabeled data and is trained together with the labeled data classification task.
[0163] In the sample acquisition unit, the corpus of the FewRel dataset can be selected as the training set and the test set.
[0164] In the preprocessing unit, entity-related special tokens are added to the sentence, and then BPE encoding is used to obtain a vocabulary. The sentence is then split based on the vocabulary obtained from the BPE encoding. The specific method is as described above.
[0165] In the embedding layer unit, sentences can be converted into a sequence of word vector representations through mapping to construct relation representations. The specific method is as described above.
[0166] In the sample construction unit, two methods are used to construct samples with different contextual similarity relationships and samples with the same relationship but different contextual similarity relationships, as described above.
[0167] In the bilateral boundary constraint unit, upper and lower bound constraints are provided for the difference in similarity between each sample pair, and the specific method is as described above.
[0168] In the self-supervised comparison unit and the supervised comparison unit, instance-level comparison loss is constructed for labeled data and unlabeled data, and the specific method is as described above.
[0169] In the relation repository unit, positive samples corresponding to relation instances are maintained in queues of different relation categories, as described above.
[0170] In the adaptive clustering unit, the relationship repository and decision boundary are used to classify unlabeled data, and the classification task is used to provide class prediction for labeled data.
[0171] Those skilled in the art will understand that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope defined by the claims of the present invention.
Claims
1. An open-domain relation extraction method based on adaptive clustering, characterized in that, Includes the following steps: Step 1: Input a relation instance, encode it to generate a relation representation, specifically: Step 1.1: Convert relation instances into sequences of their word vector representations through an embedding layer; Step 1.1.1: Expand the sentence in the relation instance, specifically by using pairs of special symbols for the head entity and the tail entity. <e1> ”、"< / e1> "、" <e2> "and" Mark its beginning and end positions; Step 1.1.2: Map the sentence processed in Step 1.1.1 word by word into a sequence of word vectors through an embedding layer; Step 1.2: Encode the input sequence using an encoder and output a relational representation; Step 1.2.1: Obtain the encoded representation of the input sequence through the encoder, and then perform max pooling on the encoded representations related to the head entity and the tail entity to obtain the encoded representation of each entity; Step 1.2.2: Concatenate the head and tail entity codes obtained in Step 1.2.1 to obtain the relation representation of the relation instance; Step 2: Define and construct different types of difficult samples for the labeled data, and use bilateral boundary loss to measure the differences between the two types of difficult samples in the semantic space. Specifically: Step 2.1: Construct different types of difficult samples using different strategies; Step 2.1.1: For cases with the same relation type in different contexts, construct positive samples using an entity replacement strategy. Specifically, randomly replace the head and tail entities with other words of the same entity type. Step 2.1.2: For cases with similar contexts but different relation types, randomly select instances with different relation types from the original relation instances, modify their head and tail entities to replace them with synonyms from the original instances, and construct negative samples; Step 2.2: After obtaining the relational representation of the samples obtained in Step 2.1, construct the two-sided boundary loss; Step 3: Construct instance-level supervised contrastive learning and self-supervised contrastive learning losses using labeled and unlabeled data. Specifically: Step 3.1: Construct positive samples for each relation instance in the labeled and unlabeled datasets using the strategy in Step 2.1.1; Step 3.2: After obtaining the relationship representation between instances and positive samples in Step 1, the relationship representation of negative samples is represented by the relationship representation of other positive samples in the same batch. Cosine similarity is used to measure similarity, and a self-supervised comparison loss between instances is constructed. Step 3.3: For the labeled data, use the relation instances belonging to the same category in the same batch as positive samples and other samples in the same batch as negative samples. After obtaining the relation representation of instances, positive samples and negative samples in Step 1, construct the supervised comparison loss between instances. Step 4: Construct a relation repository for unlabeled data, utilize the relation representations in each storage queue to model semantic differences between clusters to achieve adaptive clustering, provide category prediction for unlabeled data, and simultaneously use a classification task to predict labels for labeled data. Specifically: Step 4.1: Build a relation repository, where each queue represents a positive sample relation representation for each relation category; Step 4.2: Minimize the cross-entropy between cluster assignment based on semantic similarity in the feature space and predictions generated based on decision boundaries; Step 4.2.1: Calculate the semantic similarity between the current representation and each relation category using the relation representation of each category in the relation repository; Step 4.2.2: Use the vector after the current representation is processed by the clustering decision boundary, i.e., the vector after mapping and Softmax operation, as the probability that the current representation belongs to each relation category; Step 4.2.3: For each category, minimize the cross-entropy of the semantic similarity and decision boundary output predictions, and update the parameters of the decision boundary; Step 4.2.4: After each round of training, update the parameters of the encoder and decision boundary, update the labels of each positive sample using maximum likelihood estimation, and update the relation repository. Step 4.3: Assign higher weights to samples that are classified differently in adjacent training rounds for instance-level contrastive loss; Step 4.4: For unlabeled data, use maximum likelihood estimation to output the class prediction of the sample; for labeled data, use classification cross-entropy to output the class prediction of the sample pair. Repeat all the above steps until the loss does not change significantly within 10 rounds, or terminate when the maximum number of training rounds is reached.
2. The open-domain relation extraction method based on adaptive clustering as described in claim 1, characterized in that, In steps 1.1, 2.2 and 3.2, it is necessary to obtain relation instances and related positive and negative samples, preprocess the samples, and select 6400 relation instances from 64 types of relations in the FewRel dataset as training corpus. Each relation instance includes a sentence, the start and end positions of the head entity and the start and end positions of the tail entity. Then, the embedding layer is used to map the input sequence into a word vector representation sequence. In step 1.2, after obtaining the word vector representation sequence, it is fed into the encoder to obtain the relational representation of the input sequence; First, the entity-related word vector representations are max-pooled to obtain the entity's encoded representation: wherein, , respectively represent word vector representations of the entity corresponding start and end positions; Second, concatenation of head-tail entity encoding representations as relation representation : wherein, and respectively represent the encoded representation of the head entity and the tail entity, represents a concatenation operation.
3. The open-domain relation extraction method based on adaptive clustering of claim 2, wherein, For step 2.2, through loss Constrain the differences in similarity between the original sample and the negative sample, and between the original sample and the positive sample, within a specified range: in, and These represent the relationship between the positive and negative samples corresponding to the labeled data. The operation represents retrieving the maximum value. and These are the upper and lower bounds of the semantic differences, respectively.
4. The open-domain relation extraction method based on adaptive clustering as described in claim 3, characterized in that, In step 3.1, positive samples from labeled and unlabeled datasets are obtained using the positive sample construction strategy in step 2.1.1, and relation representations are obtained using step 1. Step 3.2 constructs a unified instance-level self-supervised contrastive learning model for the two datasets: in, Represents the current instance The positive sample relationship is represented as follows: For temperature coefficient, It means if and only if Not equal to The expression is 1 if the condition is met, otherwise it is 0. Step 3.3 involves constructing instance-level supervised contrastive learning for the labeled dataset: (6) in, Indicates the number in the current batch that is related to the first... A set of sample labels that are identical to the sample labels, among which For the sample size of each batch, for the unlabeled dataset, The two contrast losses are unified as follows: in, These are hyperparameters used to balance the two losses, where Sample size for each batch.
5. The open-domain relation extraction method based on adaptive clustering of claim 4, wherein, In step 4.1, for unlabeled data, define... For batch size, The number of unlabeled categories is used to construct the size. Relationship repository collection Used to store the positive sample representation for each category, for the current pseudo-label as Positive samples represent, in addition to Other positive sample data are used as the comparison set. ,in After each backpropagation, the new relation representation is obtained. Enter the corresponding queue The earliest member added to the queue is deleted; In step 4.2, the semantic similarity between the current representation and each relation category is calculated using the relation representation of each category in the relation repository. The current representation is then processed by the clustering decision boundary, i.e., the vector after mapping and Softmax operation. Each dimension in the matrix represents the probability of the current representation belonging to each relation category. Finally, the cluster assignment based on semantic similarity in the feature space is minimized. Predictions based on decision boundaries Cross-entropy between Update the parameters of the decision boundary: in Relationships for unlabeled data For temperature coefficient, and For the parameters of the decision boundary, Mapped to dimensional vector; After each round of training, the labels of the instances in the iteration are updated using maximum likelihood estimation : In step 4.3, samples that differ in class classification during adjacent training rounds are assigned higher weights for instance-level contrastive loss: Where #imgpt61# is the relation representation score in round #imgpt62#, #imgpt63# represents a value of 1 if and only if the labels of adjacent rounds are the same, #imgpt64# represents the weight of relation representation #imgpt65# in round #imgpt66# of training, and #imgpt67# represents the relation representation score of the same batch of samples in round #imgpt68#. In step 4.4, a classification cross-entropy loss is constructed for the labeled data during cluster training: #imgpt69# (16) Where #imgpt70# is the label corresponding to the current labeled data, and #imgpt71# is the probability output by the classifier; Finally, the combined training loss of each module is the overall loss. #imgpt72# Among them, #imgpt74# and #imgpt75# are hyperparameters.
6. An open-domain relation extraction system based on adaptive clustering for implementing the method of any one of claims 1 to 5, characterized in that, It includes a relation encoding module, an adaptive hard sample module, a supervised and self-supervised contrastive learning module, and an adaptive clustering module; Among them, the relation encoding module is responsible for encoding relation instances into corresponding relation representations; The adaptive hard sample module is responsible for constructing positive and negative samples and providing upper and lower bounds for the similarity between positive and negative samples; The supervised and self-supervised comparative learning module is used to perform self-supervised comparative learning between instances for labeled and unlabeled data, supervised comparative learning for labeled data, and update encoder parameters. The adaptive clustering module is used for updating clustering decision boundaries for unlabeled data and classifying labeled data; The relation encoding module includes a sample acquisition unit, a preprocessing unit, and an embedding layer unit, wherein: The sample acquisition unit is used to acquire relation instances from the corpus; The preprocessing unit is used to obtain the required data, including sentences, head entities, and tail entities, and is responsible for adding special markers to the sentences; The embedding layer unit is used to convert the processed text information into corresponding word vector sequences, and further generate relation representations; The adaptive hard sample module includes sample construction units and bilateral boundary constraint units, wherein: The sample construction unit is responsible for obtaining the positive and negative samples corresponding to the relation instances; Bilateral boundary constraint units provide upper and lower bounds for the similarity between instances and positive samples and the difference in similarity between instances and negative samples; The supervised and self-supervised contrastive learning module includes self-supervised contrastive units and supervised contrastive units, wherein: The self-supervised comparison unit constructs a comparison loss for relational instances between labeled and unlabeled data; The supervised comparison unit targets labeled data, where positive samples are data of the same category from the same batch, and negative samples are data of different categories from the same batch. The adaptive clustering module includes a relation repository unit and an adaptive clustering unit, wherein: The relation repository unit maintains positive samples corresponding to relation instances in queues of different relation categories; The adaptive clustering unit uses a relation repository and decision boundary to classify unlabeled data and uses a classification task to provide class prediction for labeled data; The connections between the above components are as follows: The input of the adaptive hard sample module is connected to the output of the relation encoding module; The input of the supervised and self-supervised contrastive learning module is connected to the output of the relation encoding module; The input of the adaptive clustering module is connected to the output of the relation encoding module; The input of the supervised and self-supervised contrastive learning module is connected to the output of the adaptive hard sample module; The input of the adaptive clustering module is connected to the output of the adaptive hard sample module; The input of the supervised and self-supervised contrastive learning module is connected to the output of the adaptive clustering module; In the relation encoding module, the input of the preprocessing unit is connected to the output of the sample acquisition unit, and the input of the embedding layer unit is connected to the output of the preprocessing unit. In the adaptive hard sample module, the input of the bilateral boundary constraint unit is connected to the output of the sample construction unit; In the supervised and self-supervised contrastive learning module, the input of the supervised contrastive unit is connected to the output of the self-supervised contrastive unit; In the adaptive clustering module, the input of the adaptive clustering unit is connected to the output of the relation storage unit.