Data enhancement method for low-resource news event extraction based on large language model and generative adversarial network

By employing a data augmentation method combining a large language model and generative adversarial networks, the problems of data scarcity and semantic bias in low-resource news event extraction are addressed, generating high-quality augmented training data and significantly improving the accuracy of event extraction.

CN122286232APending Publication Date: 2026-06-26DALIAN UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DALIAN UNIV OF TECH
Filing Date
2026-04-03
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

The task of extracting news events in low-resource environments suffers from data scarcity, which leads to a decline in the performance of deep neural network models. Furthermore, the semantic shifts and structural illusions generated by large language models are severe, affecting the accuracy of event extraction.

Method used

We employ a data augmentation method based on a large language model and generative adversarial networks. Through adversarial training of the generator and discriminator, we generate high-quality augmented text data. We then use the Sentence-BERT model to calculate information entropy and semantic similarity, remove data that does not conform to the news domain, and construct a high-quality augmented training dataset.

Benefits of technology

It significantly improved the accuracy of event extraction tasks, with event trigger word classification jumping from 11.3% to 33.6% and argument classification jumping from 0.41% to 6.82%, achieving a dramatic leap in performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122286232A_ABST
    Figure CN122286232A_ABST
Patent Text Reader

Abstract

This invention discloses a data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks. The method acquires original news events and extracts a training dataset from them. This dataset undergoes parsing, separation, generator, and discriminator processing, and is then combined with a Sentence-BERT model to obtain a high-quality augmented training dataset. This invention, through a generative paradigm, significantly improves the accuracy scores of both trigger word classification and argument classification in the event extraction subtask, achieving a dramatic performance leap. After generating augmented training data using this invention, the model is trained, ultimately achieving higher accuracy than other methods in the event extraction task, demonstrating the superiority of this data augmentation method.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data augmentation technology, and in particular to a data augmentation method for extracting low-resource news events based on a large language model and generative adversarial networks. Background Technology

[0002] Event extraction in the news domain is a fundamental and highly valuable task in natural language processing. It aims to accurately identify event triggers and arguments from massive amounts of unstructured news reports, thereby constructing structured news event graphs or intelligence networks. Although deep neural network models have made significant progress in event extraction, their performance is highly dependent on large-scale, high-quality annotated corpora.

[0003] In real-world news scenarios, event data naturally exhibits a pronounced "long-tail distribution." Specifically, a few routine news event types, such as "high-level meetings" and "financial report releases," dominate, while available training samples for most fine-grained or breaking events are extremely scarce. Furthermore, news texts possess strong timeliness, political seriousness, and domain-specific expertise, leading to extremely high costs for manual annotation and a heavy reliance on specialized knowledge. This further exacerbates the data scarcity problem, making news event extraction a typical low-resource task.

[0004] Finally, there's the semantic shift and "news illusion" caused by directly generating data using large models. While large language models excel at generating fluent text, news text demands high levels of objectivity, logical rigor, and domain consistency. Due to their unconstrained open-domain nature, large language models often lead to severe semantic shifts and structural illusions when directly applied to low-resource news tasks. This introduces significant out-of-distribution noise into the augmented data, severely polluting the feature learning space of downstream extraction models. Summary of the Invention

[0005] Therefore, it is necessary to propose a data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks to address the above problems.

[0006] A data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks, the method comprising:

[0007] Obtain original news domain events, and extract training datasets from the obtained original news domain events;

[0008] By separating dataset reading and field parsing, plain text data and the corresponding event structure dataset are extracted from the training dataset.

[0009] The first six words are extracted from the text dataset and used as the initial context vector input to the generator, allowing the generator with frozen underlying parameters to output the generated text in an autoregressive manner.

[0010] The text dataset and the generated text set are output to the discriminator, which calculates the first probability score of each text data in the text dataset belonging to the news domain distribution and the second probability score of each generated text in the generated text set belonging to the news domain distribution, respectively.

[0011] The loss function is determined based on the first probability score and the second probability score, and the generator and discriminator are subjected to adversarial training based on the loss function to obtain the trained optimized generator and optimized discriminator.

[0012] The optimized generator samples the text dataset to generate enhanced text data.

[0013] The Sentence-BERT model is used to calculate the information entropy of each event structure data in the event structure dataset; the data with the highest information entropy is then selected. The event structure data and the enhanced text data are input into the autoregressive large language model, and the autoregressive large language model extracts the event trigger words from the enhanced text data.

[0014] Event trigger words and augmented text data are input into the autoregressive large language model, which extracts event arguments from the augmented text data based on its dynamically loaded list of event arguments.

[0015] The event trigger words and the event arguments constitute an enhanced event structure dataset;

[0016] The first domain semantic similarity score of each sample in the event trigger word is calculated by optimizing the discriminator; then the second domain semantic similarity score of each sample in the event argument is calculated, and samples corresponding to the first domain semantic similarity score and the second domain semantic similarity score that are lower than the semantic fidelity threshold are removed, and the updated event trigger word and updated event argument are obtained.

[0017] When the update event trigger word and the update event argument satisfy the first condition, the update event trigger word and the update event argument constitute a high-quality augmented training dataset.

[0018] In one embodiment, the event structure dataset consists of two parts: the core vocabulary that triggers a specific type of event in the text dataset, i.e., the event trigger words, and the objective entities of the specific type of event in the text data, i.e., the event argument information.

[0019] In one embodiment, the first probability score of each text data belonging to the news domain distribution and the second probability score of each generated text belonging to the news domain distribution are implemented by the following expression:

[0020]

[0021]

[0022] in, The first probability score; This is the second probability score; Text data; To generate text; This is a text dataset.

[0023] In one embodiment, the expression for the loss function is as follows:

[0024]

[0025] in, This is the loss value; Represents a text dataset Calculate the expected value of the distribution; This represents the generated text set output by the generator. Seeking expectations; Indicates the first probability score. This indicates the second probability score.

[0026] In one embodiment, the information entropy of the event structure data is achieved through the following expression:

[0027]

[0028] in, This represents the information entropy of each event structure data in the event structure dataset. Represents the first event in the event structure dataset. The probability of the semantic features corresponding to each event structure data appearing in the overall distribution. This represents the total number of categories for semantic features.

[0029] In one embodiment, the step of the update event trigger word and the update event argument constituting a high-quality augmented training dataset when the update event trigger word and the update event argument satisfy a first condition includes:

[0030] In the generated augmented text data, determine the position information of the update event trigger word and the calculated event argument. If it is determined that the update event trigger word and the update event argument exist in the augmented text data or that the event type of the update event trigger word and the update event argument exists in the event argument list, then the update event trigger word and the update event argument constitute a high-quality augmented training dataset.

[0031] By employing a generative paradigm, this invention achieves a dramatic leap in performance, increasing the accuracy score of trigger word classification in the event extraction subtask from 11.3 to 33.6 and the argument classification score from 0.41 to 6.82. After generating augmented training data using the data augmentation method of this invention, the model is trained, ultimately demonstrating superior accuracy in the event extraction task compared to other methods, thus proving the superiority of the data augmentation method of this invention. Attached Figure Description

[0032] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0033] in:

[0034] Figure 1 This is an application environment diagram of a data augmentation method for low-resource news event extraction based on a large language model and generative adversarial network in one embodiment.

[0035] Figure 2 This is a flowchart of a data augmentation method for low-resource news event extraction based on a large language model and generative adversarial network in one embodiment.

[0036] Figure 3 This is a structural block diagram of a computer device in one embodiment. Detailed Implementation

[0037] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0038] To address the technical problems in the background art, this application provides a data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks.

[0039] Figure 1 This diagram illustrates the application environment of a data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks, as shown in one embodiment. (Refer to...) Figure 1 This data augmentation method for low-resource news event extraction, based on a large language model and generative adversarial networks (GANs), is applied to a data augmentation system for low-resource news event extraction based on a large language model and GANs. The system includes a terminal 110 and a server 120. The terminal 110 and server 120 are connected via a network. The terminal 110 can be a desktop terminal or a mobile terminal; the mobile terminal can be at least one of a mobile phone, tablet, or laptop. The server 120 can be a standalone server or a server cluster consisting of multiple servers. Terminal 110 is used to acquire original news domain events and extract training datasets from the acquired original news domain events. Server 120 is used to extract plain text datasets and event structure datasets corresponding to the text datasets from the training datasets through dataset reading and field parsing separation; the first six words of the text datasets are extracted as initial context vectors and input to the generator, allowing the generator with frozen underlying parameters to output generated text in an autoregressive manner; the text datasets and the generated text sets are output to the discriminator, which calculates a first probability score for each text data in the text dataset belonging to the news domain distribution and a second probability score for each generated text in the generated text set belonging to the news domain distribution; a loss function is determined based on the first probability score and the second probability score, and the generator and discriminator are subjected to adversarial training based on the loss function to obtain an optimized generator and an optimized discriminator after training; the optimized generator samples the text dataset to generate enhanced text data; the Sentence-BERT model is used to calculate the information entropy of each event structure data in the event structure dataset; the top six words with the largest information entropy are selected from the first six words of the generated text datasets. The event structure data and enhanced text data are input into an autoregressive large language model (ARMA). The ARMA extracts event trigger words from the enhanced text data. The event trigger words and enhanced text data are then input into the ARMA, which extracts event arguments from the enhanced text data based on its dynamically loaded list of event arguments. The event trigger words and event arguments constitute the enhanced event structure dataset. A first-domain semantic similarity score is calculated for each sample in the event trigger words using an optimized discriminator. Then, a second-domain semantic similarity score is calculated for each sample in the event arguments, and samples with scores below the semantic fidelity threshold are removed. The first domain semantic similarity score and the second domain semantic similarity score correspond to the samples, and the updated event trigger words and updated event arguments are obtained by updating them.

[0040] When the update event trigger word and the update event argument satisfy the first condition, the update event trigger word and the update event argument constitute a high-quality augmented training dataset.

[0041] like Figure 2 As shown, in one embodiment, as Figure 2 As shown, a data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks is provided. This method can be applied to both terminals and servers; this embodiment illustrates its application to a terminal. The specific steps of this data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks are as follows:

[0042] S10: Obtain the original news domain events and extract the training dataset from the obtained original news domain events. ;

[0043] S20: Separate data reading and field parsing from the training dataset Extract plain text from a text dataset and the text dataset Corresponding event structure dataset ;

[0044] S30: From the text dataset The first six words are extracted and used as the initial context vector input to the generator. This allows the generator to freeze its underlying parameters. Output generated text in an autoregressive manner ;

[0045] S40: Transfer the text dataset and the generated text set Output to discriminator Discriminator Calculate the text dataset respectively Each text data in The first probability score belonging to the news domain distribution and the generated text set Each generated text The second probability score belonging to the news domain distribution ;

[0046] S50: Based on the first probability score and the second probability score Determine the loss function, and adjust the generator based on the loss function. and discriminator Adversarial training is performed to obtain an optimized generator after training. and optimized discriminator ;

[0047] S60: Through the optimized generator For the text dataset Sample to generate enhanced text data ;

[0048] S70: Calculate the event structure dataset using the Sentence-BERT model. Information entropy of each event structure data ; the one with the highest information entropy Individual event structure data and enhanced text data The input is fed into an autoregressive large language model, which extracts augmented text data. event trigger words ;

[0049] S80: Event Triggering Word and enhanced text data The input is fed into an autoregressive large language model, which extracts augmented text data based on its dynamically loaded list of event arguments. event arguments in ;

[0050] S90: The event trigger word and the event arguments This constitutes an enhanced event structure dataset. ;

[0051] S100: By optimizing the discriminator Calculate event trigger words First-domain semantic similarity score for each sample; then calculate event arguments. The second-domain semantic similarity score of each sample is used to eliminate samples with scores below the semantic fidelity threshold. The first domain semantic similarity score and the second domain semantic similarity score correspond to the samples, and the updated event trigger word is obtained. and updating event arguments ;

[0052] S110: When the update event triggers... and the updated event arguments When the first condition is met, the update event trigger word and the updated event arguments Constructing a high-quality augmented training dataset .

[0053] In one embodiment, the event structure dataset The text dataset was triggered by this. The core vocabulary for specific types of events, namely event trigger words, and text data. It consists of two parts: the objective entity of a specific type of event and the event argument information.

[0054] In one embodiment, each text data The first probability score belonging to the news domain distribution and each generated text The second probability score belonging to the news domain distribution This can be achieved using the following expression:

[0055]

[0056]

[0057] in, The first probability score, This is the second probability score; Text data; To generate text; This is a text dataset.

[0058] In one embodiment, the expression for the loss function is as follows:

[0059]

[0060] in, This is the loss value; Represents a text dataset Calculate the expected value of the distribution; This represents the generated text set output by the generator. Seeking expectations; Indicates the first probability score. This indicates the second probability score.

[0061] In one embodiment, the event structure data Information entropy This can be achieved using the following expression:

[0062]

[0063] in, This represents the information entropy of each event structure data in the event structure dataset. Represents the first event in the event structure dataset. The probability of the semantic features corresponding to each event structure data appearing in the overall distribution. This represents the total number of categories for semantic features.

[0064] In one embodiment, when the update event triggers the word and the updated event arguments When the first condition is met, the update event trigger word and the updated event arguments Constructing a high-quality augmented training dataset include:

[0065] In the generated enhanced text data Determine the trigger word for the update event. sum event argument Location information, if it is determined to update the event trigger word and the updated event arguments In enhancing text data The update event triggering word mentioned above and the updated event arguments If the event type exists in the event argument list, then the update event trigger word and the updated event arguments Constructing a high-quality augmented training dataset .

[0066] The specific embodiments of the present invention are as follows:

[0067] This embodiment provides a data augmentation method for event extraction and re-annotation in the news field. This method completely eliminates the reliance on manually preset templates and fixed syntactic skeletons at the underlying logic level. The specific steps are as follows:

[0068] Step S101: Obtain the training dataset for event extraction in the news domain to be enhanced. ,as well as Text data and Corresponding event structure data .

[0069] Step S102: Construct a generator and discriminator The adversarial network model architecture is composed of these components.

[0070] First, the generator is initialized and some parameters are frozen. In this preferred embodiment, the generator... A pre-trained GPT-2 Large-Scale Language Model was used. To prevent catastrophic forgetting during adversarial training and to maintain the fluency of the underlying syntax, the model was frozen. bottom Layer network parameters (e.g., setting) Only the top-level parameters are updated using gradients.

[0071] Secondly, from Before the middle cut lexical units (e.g.) (as initial context vector) Input generator. The generator outputs the probability distribution of the next word in an autoregressive manner, thus generating discrete text chunks. .

[0072] Monte Carlo Search and Hybrid Loss Function Optimization: Due to the discrete nature of text generation, gradients cannot be directly backpropagated. This embodiment employs a policy gradient algorithm. When the sequence is not fully generated ( When using, adopt the following: Subsampling (e.g.) The deduction strategy is used to approximate the final expected reward.

[0073] Then, an adaptive discriminator evaluation is performed. The discriminator... A multi-scale one-dimensional convolutional neural network is used, with a set of convolutional kernels of different sizes configured. The number of filters is set to The discriminator outputs the generated text sequence. Probability score belonging to the real domain distribution Set the generator learning rate. Discriminator learning rate Batch size They conducted alternating adversarial training and ultimately utilized the trained generator. Augmented data for generating text data .

[0074] The loss function during training is defined as shown in the formula. Let be the loss value, where express Expectations; express Expectations; This indicates the judgment output by the discriminator. Probability ratings belonging to the news field This indicates the judgment output by the discriminator. Probability score belonging to the news field.

[0075]

[0076] Step S103: Obtain the augmented data of the text data output in step S102. Large language models with hundreds of billions of parameters (such as Llama-3.1-8B-Instruct) are introduced as automated annotation tools with zero intervention.

[0077] Using the Sentence-BERT model to calculate The information entropy of the event structure data is used as auxiliary information to assist the large language model in labeling the event structure.

[0078] Here, a specific JSON-formatted Prompt is used to guide the large model. Phase 1: The large language model outputs event trigger words; Phase 2: The event trigger words predicted in Phase 1 are injected into a new Prompt. The large language model then predicts event argument information based on a dynamically loaded ontology role list, generating a preliminary re-annotated dataset. and .

[0079] Step S104: Dual-channel quality verification and data distillation (filtering stage)

[0080] To completely eliminate the inherent "structural illusion" of large language models and the "semantic drift" caused by the generator, a two-channel filtering method is implemented:

[0081] Channel 1 reuses the discriminator trained and converged in step S102. ,calculate and Domain similarity score for each sample Set semantic fidelity threshold (For example ).when If necessary, discard the sample directly.

[0082] Channel 2 performs a strict span validity check. Let the generated sentence character set be... The entity span set extracted by the model is If it exists If the predicted event_type does not exist in the preset event ontology mapping, it is determined to be hallucination data and is removed.

[0083] After filtering, the final high-quality augmented event extraction dataset is obtained. , compared with the original dataset The mixture is then input into a downstream event extraction network (such as BERT-Large) for supervised training.

[0084] Example 2: A data augmentation system for news event extraction

[0085] This embodiment provides a data augmentation system for event extraction. The functional modules within this system logically correspond one-to-one with the method steps in Embodiment 1, including:

[0086] Data preprocessing and feature mapping module: used to receive low-resource events in the target domain, extract initial text, perform lexical analysis and vectorization, and construct the initial tensor space.

[0087] Adversarial domain distribution alignment module: It embeds a generator engine with a partial parameter freezing mechanism and an adaptive multi-scale convolutional discriminator engine to perform prefix-guided text sampling and Monte Carlo policy gradient backpropagation.

[0088] Information entropy-driven large model scheduling module: It has a built-in vector retrieval engine and large model inference API interface, and is responsible for calculating the cosine similarity of high-dimensional features and the information entropy distribution after temperature scaling, and arranging two-stage instruction prompt scheduling.

[0089] Dual-channel data cleaning and distillation module: Couples an adversarial model discriminator with a hard-coded ontology rule validator to achieve confidence interception and span consistency verification.

[0090] Example 3: A computer device suitable for the above method (core hardware level decomposition)

[0091] This embodiment provides a computer device that serves as the underlying hardware carrier supporting the complex large language models (such as the GPT-2 Large pre-trained model and the Llama-3.1-8B-Instruct inference model) and the high-intensity matrix operations of generative adversarial networks in Embodiment 1.

[0092] This computer device is suitable not only for large-scale AI training clusters in the cloud, but also for high-performance computing workstations at the edge. Specifically, the computer device includes a processor, memory, communication interfaces, and a communication bus.

[0093] Processor architecture:

[0094] The processor not only includes a central processing unit (CPU, such as a high-performance processor with a multi-core x86 architecture or ARM architecture) for handling basic logic and operating system scheduling, but more importantly, it also includes a tensor computing accelerator (AIAccelerator / GPU).

[0095] Since the method of the present invention involves large-scale language model inference and deep adversarial training with up to billions of parameters (such as 8B parameters), the tensor computation accelerator is embedded with a large number of computational cores (such as CUDA cores) and tensor cores specifically designed to accelerate matrix multiplication.

[0096] During the adversarial training in step S102, the generator's multi-head attention mechanism computation matrix, the feedforward neural network weight matrix, and the discriminator's multi-scale convolution kernel weights are all distributed in parallel to different computing units of the tensor computing accelerator for high-throughput forward propagation and gradient backward computation. During the large model relabeling in step S103, the processor uses KV Cache technology to reuse the attention state of the context in the GPU memory of the computing accelerator, which greatly reduces the computational power consumption caused by multiple inputs in the information entropy demonstration example.

[0097] Memory hierarchy:

[0098] The memory is divided into volatile memory (such as high-bandwidth video memory HBM and system memory RAM) and non-volatile memory (such as high-speed NVMe solid-state drives).

[0099] The high-speed NVMe hard disk permanently stores the initial low-resource dataset, ontology rule base, and unloaded GPT-2 generator model weights (such as .bin or .safetensors format) and large language model weights.

[0100] The high-bandwidth video memory is the "core working area" during method execution. When the computer device executes the method of this application, the processor schedules the generator, discriminator, and large language model (such as Llama-3.1 model parameters) from the hard disk to the high-bandwidth video memory via the communication bus with high precision (such as FP32) or quantization precision (such as FP16, INT8, to save space).

[0101] Communication interface and bus mechanism:

[0102] The communication bus (e.g., PCIe Gen4 or Gen5 protocol bus) enables high-speed data exchange between the CPU, tensor computation accelerator, and memory. The communication interface is responsible for data interaction with external systems (such as data annotation platforms or distributed storage clusters), receiving the initial input event dataset, and outputting the final high-quality enhanced event extraction dataset obtained after dual quality filtering (step S104). .

[0103] In summary, through hardware collaboration, this computer device can efficiently and stably support the computationally intensive generation, re-annotation, and adversarial verification stages in this invention by calling the computer program (instruction sequence) embedded in the memory, making this method highly feasible for industrial-grade implementation.

[0104] Example 4: Computer-readable storage medium

[0105] This embodiment provides a non-volatile or volatile computer-readable storage medium on which a computer program / instruction sequence is stored. When this computer program / instruction sequence is executed by the processor (especially the collaborative processing of CPU and GPU) of the computer device in Embodiment 3 above, the computer device is able to implement all the steps of the event-oriented extraction generation and relabeling data augmentation method described in Embodiment 1. The storage medium may include various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory, a random access memory, a magnetic disk, or an optical disk.

[0106] Example 5: Model Training and Hyperparameter Configuration

[0107] To ensure that the data augmentation method for news event extraction described in this invention can be stably reproduced on electronic devices and achieve the best feature manifold alignment effect, this embodiment provides a preferred underlying hyperparameter configuration scheme.

[0108] 1. Network architecture parameter settings

[0109] Generator skeleton: A pre-trained GPT-2 Large Language Model is preferred, with a maximum sequence length of 256. To ensure generation diversity and quality, a Top-k sampling strategy is adopted for decoding (assuming...). ) and Top-p kernel sampling (assuming The prefix length is set to 6 lexical units in the combination method.

[0110] Discriminator skeleton: An adaptive multi-scale one-dimensional convolutional neural network is used, with the set of convolutional kernel filter sizes set as follows: The number of filters at each size was set to 64 to capture the multi-dimensional local n-gram features of the news text.

[0111] 2. Training and Optimizer Parameter Settings

[0112] Adversarial training mechanism: Maximum Likelihood Estimation (MLE) pre-training is set to 100 epochs, followed by policy gradient adversarial training with 10 epochs. In policy gradient inference, the Monte Carlo reinforcement learning search has 2 inference branches; the discriminator updates by 1 step for every 1 step update of the generator. The weighted average of the pre-training loss in the adversarial loss function is set to 1.0.

[0113] Optimization Algorithm and Learning Rate: Both the generator and discriminator employ the AdamW optimizer for gradient descent. To maintain the balance of the adversarial game, the learning rate of the discriminator (…) Set the learning rate to be slightly higher than that of the generator. The batch size during training is set to 64, and the gradient accumulation step count is set to 2.

[0114] Example 6: Experimental Data and Verification of Beneficial Effects

[0115] To verify the actual technical effect of the data augmentation method provided by this invention, this embodiment uses a standard event extraction benchmark dataset containing a large amount of news corpus for comparative verification experiments. A limited training set with extremely low resources (1% of the data volume) to medium resources (30% of the data volume) is constructed through random sampling.

[0116] Tables 1 and 2 below show a comparison of the accuracy scores of the downstream deep event extraction model (based on BERT-Large) on the event extraction task after training with no augmented data, using traditional data augmentation methods to generate augmented data and mix it with the original data (synonym replacement, back translation, mask filling), the large model directly generating augmented data and mixing it with the original data, and the augmentation method described in this invention generating augmented data and mixing it with the original data.

[0117] Table 1: Accuracy Comparison of Event Extraction Subtasks (Event Trigger Word Recognition and Trigger Word Classification)

[0118]

[0119] Table 2: Accuracy Comparison of Event Extraction Subtasks (Argument Recognition and Argument Classification)

[0120]

[0121] Explanation of beneficial effects:

[0122] As shown in Tables 1 and 2, in news extraction scenarios with extremely limited data (1% data volume), traditional local replacement and fill-in-the-blank methods show only slight improvement due to their inability to synthesize new news syntactic structures. However, the method of this invention, through a generative paradigm, significantly improves the accuracy of trigger word classification in the event extraction subtask, jumping from 11.3 to 33.6 (an absolute improvement of 22.3%), and argument classification from 0.41 to 6.82, achieving a dramatic performance leap. After generating augmented training data using the data augmentation method of this invention, the model was trained, ultimately achieving higher accuracy in the event extraction task than other methods, demonstrating the superiority of the data augmentation method of this invention.

[0123] Figure 3 An internal structural diagram of a computer device in one embodiment is shown. This computer device can specifically be a terminal or a server. Figure 3As shown, the computer device includes a processor, memory, and network interface connected via a system bus. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system and may also store a computer program. When executed by the processor, this computer program enables the processor to implement a data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks. The internal memory may also store a computer program, which, when executed by the processor, enables the processor to implement the same data augmentation method. Those skilled in the art will understand that... Figure 3 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0124] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments described above. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and RAMbus dynamic RAM (RDRAM), etc.

[0125] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0126] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks, characterized in that, The method includes: Obtain original news domain events, and extract training datasets from the obtained original news domain events; By separating dataset reading and field parsing, plain text data and the corresponding event structure dataset are extracted from the training dataset. The first six words are extracted from the text dataset and used as the initial context vector input to the generator, allowing the generator with frozen underlying parameters to output the generated text in an autoregressive manner. The text dataset and the generated text set are output to the discriminator, which calculates the first probability score of each text data in the text dataset belonging to the news domain distribution and the second probability score of each generated text in the generated text set belonging to the news domain distribution, respectively. The loss function is determined based on the first probability score and the second probability score, and the generator and discriminator are subjected to adversarial training based on the loss function to obtain the trained optimized generator and optimized discriminator. The optimized generator samples the text dataset to generate enhanced text data. The Sentence-BERT model is used to calculate the information entropy of each event structure data in the event structure dataset; the data with the highest information entropy is then selected. The event structure data and the enhanced text data are input into the autoregressive large language model, and the autoregressive large language model extracts the event trigger words from the enhanced text data. Event trigger words and augmented text data are input into the autoregressive large language model, which extracts event arguments from the augmented text data based on its dynamically loaded list of event arguments. The event trigger words and the event arguments constitute an enhanced event structure dataset; The first domain semantic similarity score of each sample in the event trigger word is calculated by optimizing the discriminator; then the second domain semantic similarity score of each sample in the event argument is calculated, and samples corresponding to the first domain semantic similarity score and the second domain semantic similarity score that are lower than the semantic fidelity threshold are removed, and the updated event trigger word and updated event argument are obtained. When the update event trigger word and the update event argument satisfy the first condition, the update event trigger word and the update event argument constitute a high-quality augmented training dataset.

2. The data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks according to claim 1, characterized in that, The event structure dataset consists of two parts: the core words that trigger specific types of events in the text dataset, namely event trigger words, and the objective entities of specific types of events in the text data, namely event argument information.

3. The data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks according to claim 1, characterized in that, The first probability score for each text data belonging to the news domain distribution and the second probability score for each generated text belonging to the news domain distribution are achieved by the following expressions: in, The first probability score, This is the second probability score; Text data; To generate text; This is a text dataset.

4. The data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks according to claim 1, characterized in that, The expression for the loss function is as follows: in, This is the loss value; Represents a text dataset Calculate the expected value of the distribution; This represents the generated text set output by the generator. Seeking expectations; Indicates the first probability score. This indicates the second probability score.

5. The data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks according to claim 1, characterized in that, The information entropy of the event structure data is achieved through the following expression: in, This represents the information entropy of each event structure data in the event structure dataset. Represents the first event in the event structure dataset. The probability of the semantic features corresponding to each event structure data appearing in the overall distribution. This represents the total number of categories for semantic features.

6. The data augmentation method for low-resource news event extraction based on a large language model and generative adversarial networks according to claim 1, characterized in that, When the update event trigger word and the update event argument satisfy the first condition, the update event trigger word and the update event argument constitute a high-quality augmented training dataset, including: In the generated augmented text data, determine the position information of the update event trigger word and the calculated event argument. If it is determined that the update event trigger word and the update event argument exist in the augmented text data or that the event type of the update event trigger word and the update event argument exists in the event argument list, then the update event trigger word and the update event argument constitute a high-quality augmented training dataset.