A data augmentation method for multi-attribute word extraction
By constructing new training samples based on text fragment annotation and label datasets in the attribute word extraction task, the problems of scarce labeled data and semantic destruction in existing technologies are solved, thereby improving the robustness and generalization ability of the model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SICHUAN JIUZHOU ELECTRIC GROUP CO LTD
- Filing Date
- 2023-06-05
- Publication Date
- 2026-06-26
AI Technical Summary
Existing data augmentation methods tend to corrupt sentence semantics in attribute word extraction tasks and ignore the potential of labeled data, making it difficult to effectively solve the problem of scarce labeled data.
By annotating attribute words based on text fragments, a multi-attribute word label dataset is constructed. New training samples are generated from the label set, and combined with the original sample sentences, new training data is generated, thereby enhancing the robustness and generalization ability of the model.
It reduces the cost of manual annotation, expands the training data, enhances the robustness and generalization ability of the model, solves the problem of insufficient labeled data in multi-attribute word extraction tasks, and achieves simple and effective data augmentation.
Smart Images

Figure CN116663554B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing, and more specifically to a data augmentation method for extracting multiple attribute words. Background Technology
[0002] The statements in this section are provided only as background information in connection with this disclosure and may not constitute prior art.
[0003] In machine learning, supervised learning has been widely used due to its excellent performance. However, model training in supervised learning requires sufficient labeled data, which typically consumes significant human and financial resources, resulting in generally small-scale labeled datasets. This makes it difficult for models to learn enough features from the labeled data, thus limiting the improvement of model performance. To alleviate the problem of insufficient labeled data in supervised learning and provide models with more training data, researchers have proposed data augmentation techniques. These techniques expand the original training dataset by creating or synthesizing new data based on the existing dataset. Data augmentation techniques help enhance the robustness of models, prevent overfitting, and improve generalization ability, and have been widely used in fields such as computer vision and natural language processing. In the field of natural language processing, the data augmentation techniques currently used can be mainly divided into five categories: (1) Data augmentation methods based on word substitution, such as using the WordNet dictionary for synonym substitution and back translation of different languages to generate new training data; (2) Data augmentation methods based on generation, such as using deep learning models to generate new training data; (3) Data augmentation methods based on sample sampling, including upsampling of minority class data and downsampling of majority class data in the dataset; (4) Data augmentation methods based on noise injection, which perform random insertion, random swapping and random deletion operations on the original training data; (5) Data augmentation methods based on pre-trained language models, which use high-performance pre-trained language models to generate new training data.
[0004] Attribute word extraction is an important subtask in opinion mining within natural language processing. It involves extracting attribute words from text sentences; these words are entity words appearing in the sentence, consisting of single words or phrases. For example, in a review of a laptop, such as "This laptop has good performance, but the screen is a bit large and it's a bit heavy," the manufacturer or consumer wants to understand which specific attributes of the laptop the review evaluates. The first step is to automatically extract all relevant attributes from the review text, such as "performance," "screen," and "weight." Currently, attribute word extraction is primarily based on supervised learning methods. Although significant progress has been made thanks to deep learning technology, its performance is often limited by finite labeled data. Therefore, researchers typically introduce data augmentation techniques to improve the performance of attribute word extraction.
[0005] However, the above data augmentation methods still have the following shortcomings:
[0006] (1) Data augmentation methods based on word replacement may replace the information to be extracted in the text sentence samples, thereby destroying the semantics of the information to be extracted. To prevent such operations, an information protection mechanism needs to be added during the data augmentation process, but this will increase the complexity of the information extraction task. (2) Sentence samples generated by data augmentation methods based on generation may change the attribute word information and labels in the sentence, which is not suitable for information extraction tasks. (3) Data augmentation methods based on sample sampling are difficult to perform data augmentation processing on opinion mining datasets, because in attribute word extraction and labeling datasets, text sentence samples are not labeled with category labels, but only the attribute words in the sentence are labeled with fine-grained labels (such as using the BIO labeling method). (4) Data augmentation methods based on noise injection may cause inconsistencies in the positions of attribute words and labels when randomly inserting symbols into the original training data and randomly swapping and deleting words in the sentence. (5) Data augmentation methods based on pre-trained language models can produce good results, but this method is inefficient and costly, and is not commonly used in attribute word extraction tasks.
[0007] In summary, the above data augmentation methods may alter the word order, attribute words, and their label information in sentences, easily disrupting the overall semantics of phrase-type attribute words and failing to effectively address the problem of scarce attribute word annotation data.
[0008] Furthermore, the aforementioned data augmentation methods focus on augmenting the text data in the samples, neglecting the potential for data augmentation through labels. Summary of the Invention
[0009] The purpose of this invention is to provide a data augmentation method for multi-attribute word extraction, addressing the problems existing in the prior art, alleviating the problem of scarce labeled data in supervised learning, improving the performance of the model in multi-attribute word extraction application scenarios, and thus solving the above-mentioned problems.
[0010] The technical solution of the present invention is as follows:
[0011] A data augmentation method for multi-attribute word extraction includes:
[0012] Step S1: For the original training sample sentence containing multiple attribute words, perform attribute word annotation based on the text fragment;
[0013] Step S2: Construct a multi-attribute word label dataset based on the annotation results;
[0014] Step S3: Combine each subset of the multi-attribute word label dataset with the original sample sentences to generate new training samples.
[0015] Further, step S1 includes:
[0016] Treat the text segment corresponding to each attribute word as a whole, and mark the start and end positions of the corresponding text segment.
[0017] Further, step S2 includes:
[0018] Step S21: Based on the annotation results, generate a list of attribute words and a set of their corresponding tag pairs;
[0019] Step S22: Based on the list of attribute words and the set of label pairs, construct a set G of non-empty proper subsets. label .
[0020] Further, step S22 includes:
[0021] From the set of tag pairs corresponding to the attribute word list, take any number of tag pairs to form set G. label Each subset of.
[0022] Further, step S3 includes:
[0023] Set G label Each subset is combined with the sentences in the original training samples to generate new training samples.
[0024] Further, step S21 includes:
[0025] Attribute list A = {a1, a2, ..., a...} n}, where a represents an attribute word and n represents the total number of attribute words;
[0026] For the attribute list A = {a1, a2, ..., a...} n After labeling each attribute word in}, a set of start and end position label pairs B = {(s1,e1),(s2,e2),...,(s n ,e n )}, where (s,e) represents a tag pair, s represents the start position of the text segment corresponding to the attribute word, and e represents the end position of the text segment corresponding to the attribute word.
[0027] Further, step S22 includes:
[0028] The set G of nonempty proper subsets label ={G1,G2,...,G m}, where G mLet B be the set of m combinations of label pairs in the label pair set B.
[0029] Furthermore, the total number T of new training samples that can be generated from the original training sample sentence containing multiple attribute words is calculated as follows:
[0030]
[0031] Furthermore, it also includes:
[0032] Adding new training samples to the original training set achieves data augmentation.
[0033] Compared with existing technologies, the advantages of this invention are:
[0034] 1. A data augmentation method for multi-attribute word extraction, comprising: Step S1: For the original training sample sentence containing multiple attribute words, attribute words are labeled based on text fragments; Step S2: A multi-attribute word label dataset is constructed based on the labeling results; Step S3: Each subset in the multi-attribute word label dataset is combined with the original sample sentence to generate new training samples; This invention reduces the cost of manual data labeling, alleviates the problem of insufficient labeled data, and designs a data augmentation method based on text fragment labels in multi-attribute extraction application scenarios, synthesizing new training samples and expanding the training data; The augmented training dataset enables the model to learn more features from the data, enhances the robustness of the model, prevents overfitting, and improves generalization ability.
[0035] 2. A data augmentation method for multi-attribute word extraction, the technical solution of which is simple and effective and does not require additional tools and resources; and solves some existing technical problems in multi-attribute word extraction, which has important practical significance and application value. It can be applied not only to multi-attribute word extraction tasks in the field of natural language processing, but also to tasks such as multi-named entity recognition. Similar data augmentation methods can also be designed and applied to the field of computer vision. Attached Figure Description
[0036] Figure 1 This is a schematic diagram of a data augmentation process for multi-attribute word extraction. Detailed Implementation
[0037] It should be noted that relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0038] The features and performance of the present invention will be further described in detail below with reference to embodiments.
[0039] Example 1
[0040] In attribute-level opinion mining datasets, there are some samples containing multiple attribute words, but their proportion is relatively small, and the number of words contained in each attribute word is also different. For example, in the SemEval 2014 Task4Restaurant dataset and the Laptop dataset, the number of sentence samples containing multiple attributes accounts for approximately 42.95% and 37.23%, respectively. Most attribute words contain 1 to 2 words, while a few attribute words contain more words. Some attribute words in the Restaurant dataset contain 14 words, and some attribute words in the Laptop dataset contain 9 words.
[0041] This invention addresses the characteristics of current attribute-level opinion mining datasets, which often contain a limited number of samples with multiple attribute words and varying attribute word lengths. Based on text fragment annotation, it designs a data augmentation method for multi-attribute word extraction, primarily solving the following two types of technical problems:
[0042] (1) This invention addresses the problem that existing sequence labeling methods (such as BIO labeling) easily lead to a decrease in the accuracy of phrase-type attribute word extraction. Existing sequence labeling methods often use BIO tags to label each word in a sentence. In information extraction tasks, this method easily decomposes a single phrase-type attribute word into multiple attribute words, resulting in the erroneous extraction of multiple attribute words. The data augmentation method proposed in this invention treats each attribute word as a whole for labeling, thereby protecting the overall semantics of phrase-type attribute words.
[0043] (2) Addressing the problem that existing data augmentation methods easily affect the original semantics of samples. Existing technologies may alter the word order and label information of sentences and their attribute words, leading to the introduction of excessive data noise and thus changing the original semantics of the samples. The data augmentation method proposed in this invention can construct new training samples by combining different label sets without changing the original semantics of the sample text, providing more supervision information for model training.
[0044] In machine learning, supervised learning methods have been widely used due to their excellent performance. However, supervised learning models typically rely on high-quality, sufficient labeled data; insufficient labeled data can limit the potential for performance improvement. Therefore, to alleviate the problem of scarce labeled data in multi-attribute extraction tasks based on supervised methods, this invention proposes a novel data augmentation method.
[0045] Since each training sample in the labeled dataset for attribute word extraction tasks includes both text and labels, augmentation methods for training samples can be designed from both text and label perspectives. However, most existing research focuses on data augmentation from the text perspective, neglecting the potential of augmenting data from label data. Therefore, this invention addresses multi-attribute word extraction applications by augmenting training data using label data without requiring additional external computing resources.
[0046] The method proposed in this invention mainly includes three parts: attribute word annotation based on text fragments, construction of a multi-attribute word label dataset, and generation of new training samples. To make the invention's objectives, technical solutions, and advantages clearer, the invention will be further described in detail below with reference to the accompanying drawings. Please refer to the accompanying drawings. Figure 1 A data augmentation method for multi-attribute word extraction is as follows:
[0047] Step S1: For the original training sample sentence containing multiple attribute words, perform attribute word annotation based on the text fragment;
[0048] Step S2: Construct a multi-attribute word label dataset based on the annotation results;
[0049] Step S3: Combine each subset of the multi-attribute word label dataset with the original sample sentences to generate new training samples.
[0050] In this embodiment, it should be noted that the attribute words of the original training sample sentences are composed of single words or combinations of multiple words, and their lengths vary. For phrase-type attribute words containing multiple words, existing sequence labeling methods such as BIO tend to erroneously extract them as multiple attribute words, thus destroying their overall semantics. Therefore, in this embodiment, step S1 specifically includes:
[0051] Each text snippet corresponding to an attribute word is treated as a whole, and the start and end positions of the corresponding text snippets are marked; the markings are shown in the attached figure. Figure 1 As shown, the original training sample sentence is: "This laptop has good performance, but the screen is a bit big and it's a bit heavy."; Step S1 yields the following text fragment (i.e., attribute words) and annotates it as follows:
[0052] 1. "Performance", the label pair following "Performance" is "(s1=6,e2=7)";
[0053] 2. "Monitor", the label pair following the label "monitor" is "(s2=13,e2=15)";
[0054] 3. "Weight", the label pair following the label "(s3=20,e3=21)" is "(s3=20,e3=21)";
[0055] It should be noted that the numbers represent the positions of the start or end words in the text segment within the original training sample sentence; for example, (s1=6,e2=7) represents the 6th and 7th words in the original training sample sentence.
[0056] In this embodiment, specifically, step S2 includes:
[0057] Step S21: Based on the annotation results, generate a list of attribute words and its corresponding set of label pairs; preferably, all original training sample sentences containing multiple attribute words can be selected from the original training set (the original training sample sentences are included in the original training set), and the list of attribute words and the set of label pairs for each original training sample sentence can be obtained.
[0058] Step S22: Based on the list of attribute words and the set of label pairs, construct a set G of non-empty proper subsets. label .
[0059] In this embodiment, specifically, step S22 includes:
[0060] From the set of tag pairs corresponding to the attribute word list, take any number of tag pairs to form set G. label Each subset of.
[0061] In this embodiment, specifically, step S3 includes:
[0062] Set G label Each subset is combined with the sentences of the original training samples to generate new training samples; then the new training samples are added to the original training set to achieve data augmentation.
[0063] In this embodiment, specifically, step S21 includes:
[0064] Attribute list A = {a1, a2, ..., a...} n}, where a represents an attribute word and n represents the total number of attribute words;
[0065] For the attribute list A = {a1, a2, ..., a...} n After labeling each attribute word in}, a set of start and end position label pairs B = {(s1,e1),(s2,e2),...,(s n ,e n )}, where (s,e) represents a tag pair, s represents the start position of the text segment corresponding to the attribute word, and e represents the end position of the text segment corresponding to the attribute word; for example, such as Figure 1 As shown, (s1,e1) represents label pair 1, and (s2,e2) represents label pair 2.
[0066] In this embodiment, specifically, step S22 includes:
[0067] The set G of nonempty proper subsets label ={G1,G2,...,G m}, where G m Let B be the set of m combinations of label pairs in the label pair set B.
[0068] In this embodiment, the total number T of new training samples that can be generated from each original training sample sentence containing multiple attribute words is calculated as follows:
[0069]
[0070] The embodiments described above merely illustrate specific implementation methods of this application, and while the descriptions are detailed and specific, they should not be construed as limiting the scope of protection of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the technical solution of this application, and these modifications and improvements all fall within the scope of protection of this application.
[0071] This background section is provided to generally present the context of the invention. The work of the currently named inventors, the work to the extent described in this background section, and aspects of this section that did not constitute prior art at the time of application are neither expressly nor impliedly acknowledged as prior art to the invention.
Claims
1. A data augmentation method for multi-attribute word extraction, characterized in that, include: Step S1: For the original training sample sentence containing multiple attribute words, perform attribute word annotation based on the text fragment; Step S2: Construct a multi-attribute word label dataset based on the annotation results; Step S3: Combine each subset of the multi-attribute word label dataset with the original sample sentences to generate new training samples; Step S1 includes: Treat the text segment corresponding to each attribute word as a whole, and mark the start and end positions of the corresponding text segment; Step S2 includes: Step S21: Based on the annotation results, generate a list of attribute words and a set of their corresponding tag pairs; Step S22: Based on the list of attribute words and the set of label pairs, construct a set of non-empty proper subsets. ; Step S22 includes: From the set of tag pairs corresponding to the attribute word list, take any number of tag pairs to form a set. Each subset of.
2. The data augmentation method for multi-attribute word extraction according to claim 1, characterized in that, Step S3 includes: set Each subset is combined with the original training sample sentences to generate new training samples.
3. The data augmentation method for multi-attribute word extraction according to claim 2, characterized in that, Step S21 includes: Attribute word list ,in, a Indicates attribute words, n Indicates the total number of attribute words; For the list of attribute words After tagging each attribute word, a set of start and end position label pairs is generated. ,in, Indicates a tag pair, s Indicates the starting position of the text segment corresponding to the attribute word. e Indicates the end position of the text segment corresponding to the attribute word.
4. The data augmentation method for multi-attribute word extraction according to claim 3, characterized in that, Step S22 includes: The set of nonempty proper subsets ,in Represents a set of labels In m A set consisting of combinations of tag pairs.
5. A data augmentation method for multi-attribute word extraction according to claim 4, characterized in that, The total number of new training samples that can be generated from the original training sample sentence containing multiple attribute words. T Calculate as follows: 。 6. The data augmentation method for multi-attribute word extraction according to claim 1, characterized in that, Also includes: Adding new training samples to the original training set achieves data augmentation.