Crop disease and pest named entity recognition method fusing RoBERTa-wwm and adversarial training
By integrating RoBERTa-wwm and the RGC-ADV model trained adversarially, the problems of insufficient word-level semantic information acquisition and sample imbalance in the named entity recognition of crop diseases and pests are solved, achieving more efficient disease and pest entity recognition and improving recognition accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- FUZHOU UNIV
- Filing Date
- 2023-03-08
- Publication Date
- 2026-06-12
AI Technical Summary
Existing named entity recognition models for crop diseases and pests are insufficient in their ability to acquire professional text features in the field of diseases and pests, cannot acquire semantic information at the Chinese word level, have unclear boundaries and are difficult to define entities, and have uneven sample distribution, which affects recognition performance.
A named entity recognition method for crop diseases and pests is adopted, which integrates RoBERTa-wwm and adversarial training. The RoBERTa-wwm model is used to obtain word-level semantic information, adversarial training is introduced to solve the boundary ambiguity problem, and Focal Loss is used to improve the imbalance of sample labels. The RGC-ADV model is constructed for training.
It improves the performance of named entity recognition for crop diseases and pests, enhances the model's semantic representation ability of Chinese disease and pest text information, improves the recognition effect of entities with unclear boundaries, alleviates the impact of uneven sample distribution, and provides a high-quality technical foundation for crop disease and pest knowledge graphs and question answering systems.
Smart Images

Figure CN116629259B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of information technology, specifically relating to a method for named entity recognition of crop diseases and pests that integrates RoBERTa-wwm and adversarial training. Background Technology
[0002] To effectively prevent and control crop diseases and pests, it is essential to obtain pest and disease control information quickly and accurately. However, with the rapid development of information technology, the scale of textual data on crop diseases and pests is growing exponentially. Extracting knowledge about crop diseases and pests from massive, heterogeneous data sources has become a pressing problem. Named entity recognition (NENT) for crop diseases and pests aims to accurately and efficiently identify disease-related entities from unstructured, massive amounts of data, helping people quickly and accurately obtain valuable pest and disease control information, which is of great significance for crop disease and pest control. Simultaneously, crop disease and pest entity recognition is also the research foundation for constructing disease and pest knowledge graphs and question-answering systems; its recognition performance directly affects the quality of these systems. Therefore, researching effective entity recognition models in the field of diseases and pests, and accurately identifying various entities in this field, has significant research value and practical implications for agricultural informatization.
[0003] Currently, Chinese Named Entity Recognition (NER) is widely used in professional fields such as biomedicine, geology, and finance; however, in-depth research on NER for crop diseases and pests is relatively limited. Previous NER tasks for crop diseases and pests have mostly employed rule-based and dictionary-based methods, as well as machine learning-based methods. Dictionary-based and rule-based extraction methods require pre-designed rules or defined dictionaries, rely excessively on experts, and are difficult to adapt to the ever-changing and expanding agricultural data in the era of big data. Rule-based methods are gradually being replaced by machine learning methods. Commonly used machine learning models for entity extraction in agriculture include Hidden Markov Models (HMMs), Maximum Entropy Markov Models (MEMMs), and Conditional Random Fields (CRFs). Zhang Jian et al. used a word segmentation system and CRF to complete agricultural NER recognition and meticulously segment it into multiple entities. Malarkodi et al. proposed a method for NER based on CRF for real agricultural data. Statistical machine learning methods have improved the accuracy of agricultural named entity recognition, but researchers must spend a lot of effort on feature engineering and data labeling, and there are problems such as high-dimensional and sparse data and poor scalability.
[0004] The development of deep learning has brought new breakthroughs to the field of named entity recognition in agriculture. Deep learning methods effectively solve the problems of traditional named entity recognition methods, such as reliance on manual dictionaries and insufficient feature extraction, in the process of extracting agricultural pest and disease information. Commonly used models include Long Short-Term Memory (LSTM), Recurrent Neural Networks (RNN), and Convolutional Neural Networks (CNN) and their improved models. QUOC et al. used BiLSTM and CRF to achieve entity recognition in the agricultural field, providing support for digital agriculture. Yu et al. proposed a recognition method based on BiLSTM and CRF, which completed the task of named entity recognition of rice variety information from unstructured text. Although deep learning methods have achieved good results, the complex internal structure of LSTM requires a lot of time and resources to train with BiLSTM and CRF backbone models. Therefore, some scholars have begun to simplify and improve LSTM, resulting in the application of Gated Recurrent Unit (GRU). However, these studies are based on traditional word vector models to obtain static representations, lacking the ability to distinguish the same word with different meanings in different contexts. Large-scale pre-trained models based on the Transformer model can address polysemous word representation problems and have achieved better performance in tasks such as crop disease and pest entity recognition. Zhang et al. introduced BERT for dynamic vector representation in a crop disease entity recognition model, then used BiLSTM to learn contextual information, and finally obtained the globally optimal label sequence through CRF. Liu et al. proposed an ALBERT-BiLSTM-CRF model for wheat disease and pest name entity recognition, which integrates ALBERT and rules, and achieved improvements in accuracy, recall, and F1 score.
[0005] The main drawbacks of the existing technology are: (1) The existing crop pest and disease named entity recognition model is still lacking in the ability to acquire professional text features in the field of pests and diseases, and cannot acquire semantic information at the level of Chinese words. It has problems such as unclear boundaries, difficulty in defining entities, and inaccurate recognition of proper nouns; (2) There is an imbalance in the distribution of entity label samples in the crop pest and disease corpus data. The imbalance in sample distribution will seriously affect the overall performance of entity recognition. Summary of the Invention
[0006] The purpose of this invention is to provide a method for named entity recognition of crop diseases and pests that integrates RoBERTa-wwm and adversarial training, which is beneficial to improving the performance of named entity recognition of crop diseases and pests.
[0007] To achieve the above objectives, the technical solution adopted by this invention is: a method for named entity recognition of crop diseases and pests that integrates RoBERTa-wwm and adversarial training, characterized in that it includes:
[0008] Data is obtained from different data sources to construct a raw corpus of crop diseases and pests. The raw corpus is preprocessed to remove redundant data, resulting in a standardized corpus of crop diseases and pests. Entity data is labeled, and the labeled data is divided into training set, validation set, and test set according to a set ratio.
[0009] A crop pest and disease named entity recognition model, RGC-ADV, was constructed by integrating RoBERTa-wwm and adversarial training. The RGC-ADV model extracts word-level semantic information from pest and disease texts using the RoBERTa-wwm model to obtain dynamic word vectors, thereby solving the problem of incomplete word recognition. Adversarial training is used to address the difficulty in defining pest and disease entities with ambiguous boundaries. Focal Loss is also introduced to improve the imbalance of pest and disease sample labels. The constructed RGC-ADV model is trained and tested using the obtained training set, validation set, and test set to obtain the trained RGC-ADV model.
[0010] Named entity recognition of crop diseases and pests is performed using a trained RGC-ADV model.
[0011] Furthermore, the RGC-ADV model comprises seven network layers: an input layer, a RoBERTa-wwm layer, an adversarial training layer, a BiGRU layer, a fully connected layer, a CRF layer, and an output layer. The input text is pre-trained in the RoBERTa-wwm layer to obtain word-level semantic information, converting each word in the pest and disease text into a feature vector. The adversarial training layer adds perturbations to the feature vectors output by the RoBERTa-wwm layer to generate adversarial examples, improving the model's robustness to input perturbations. The original feature vectors and adversarial examples are input together into the BiGRU layer for training, fully learning the relationships between contexts. The features extracted by the BiGRU layer are synthesized through a fully connected layer and then input into the CRF layer to obtain the final prediction result. FocalLoss is introduced to address the problem of imbalanced labels in pest and disease samples.
[0012] Furthermore, the RoBERTa-wwm layer is implemented based on the RoBERTa-wwm model, which employs a Chinese full-word masking strategy. First, the pest and disease corpus text is segmented into words, and then the words are masked. After covering all Chinese characters that make up the same word, these words are predicted. Finally, dynamic word vectors with word-level features are generated, making them more suitable for Chinese crop pest and disease named entity recognition tasks. The RGC-ADV model uses the Roberta-wwm model as a pre-training model for pest and disease named entity recognition to extract text features. During training, the model parameters are fine-tuned to allow the model to better learn the semantic features of the pest and disease data. The Roberta-wwm model consists of 12 Transformer layers. Each Transformer uses a multi-head attention mechanism to reduce the distance between two words at any position in the input pest and disease sequence to a constant. Assuming the input is Tok = {Tok...} [cls] Tok1, Tok2…Tok n Tok [SEP]}, where Tok i T i E i These represent the word vectors of the i-th character in the pest and disease text data, before and after Transformer encoding, respectively. The vector E = {E_i} corresponding to Tok is obtained through Roberta-wwm. [cls] E1, E2…E n E [SEP] Vector E contains the pest and disease semantic information obtained by Roberta-wwm during the pre-training phase.
[0013] Furthermore, the adversarial training layer adds perturbations to the model, that is, perturbations are added to the original input pest and disease samples to obtain adversarial samples, and these adversarial samples are input into the model for training. This prevents noise from pest and disease private information and improves the model's generalization ability. The projected gradient descent (PGD) method is introduced into the RGC-ADV model for iterative attacks, with the perturbation range controlled within a specified range S each time. Once the perturbation value exceeds the specified range S, the adversarial sample x is removed. t Projecting onto the specified range x+S, the iterative process is shown in formula (1);
[0014]
[0015] Where α represents the magnitude of the perturbation in each iteration of PGD, Π x+s Indicates the projection operation, x t and x t+1The RGC-ADV model generates adversarial examples E by adding perturbations to the initial vector E of pests and diseases output from the RoBERTa-wwm layer. ADV Then E and E ADV They are used together as input to the BiGRU layer for training.
[0016] Furthermore, the BiGRU layer is composed of a forward GRU and a backward GRU; the GRU consists of update gates and reset gates, which are respectively represented by z. t and r t This means that the update gate is used to determine the extent to which information from the previous moment enters the current moment, and the reset gate is used to control how much information is forgotten; the calculation process of the GRU network is shown in formulas (2) to (5);
[0017] r t =σ(W r [h t-1 ,x t (2)
[0018] Z t =σ(W z [h t-1 ,x t (3)
[0019]
[0020]
[0021] Where, x t Let σ represent the pest and disease information input at time t, and h represent the Sigmoid activation function. t and h t-1 Let these represent the output vectors of the hidden layer at times t and t-1, respectively. Indicates the current state of the candidate set. The input weight matrix represents the activation function, * represents the Hadamard product, and tanh() represents the activation function.
[0022] The BiGRU layer uses a bidirectional gated recurrent unit (BiGRU) to train the pest and disease text vectors output by the adversarial training layer, captures the semantic dependencies of the pest and disease information context, and acquires and utilizes forward and backward data features to improve prediction accuracy.
[0023] Furthermore, the fully connected layer is used to map the learned distributed feature representations to the sample label space to achieve sample classification; the RGC-ADV model integrates the pest and disease sample feature results output by the BiGRU layer through the fully connected layer to weaken the influence of location features on the classification results, thereby improving the classification effect of pest and disease samples.
[0024] Furthermore, the CRF layer uses the sequence P output by the fully connected layer as the input of the CRF to calculate the score of the output sequence H. The calculation process is shown in formula (6).
[0025]
[0026] Where n is the sequence length, and A represents the transition score matrix. ij This represents the transition score matrix from the i-th label to the j-th label of a pest or disease sample; during decoding, the Viterbi algorithm is used to calculate the sequence label sequence H with the highest probability among all H sequences. max As shown in formula (7);
[0027] H max =arg max(score(H,y)) (7)
[0028] The RGC-ADV model obtains the globally optimal label prediction sequence through the CRF layer, and introduces the FocalLoss loss function to optimize the CRF model. Focal Loss controls the weights of positive and negative classification samples, making the training process focus more on negative samples, so as to continuously optimize the model performance. Its calculation process is shown in formula (8).
[0029] Loss Focal =-α(1-P(y|x)) γ ln(P(y|x)) (8)
[0030] Where α∈[0,1] is a balancing factor used to balance the number of positive and negative samples in the pest and disease samples, γ≥0 is a modulation coefficient used to reduce the loss of easily classified samples, i.e., non-pest and disease entity samples, so that the model pays more attention to difficult samples, i.e. pest and disease entity samples; P(y|x) represents the probability that the label of pest and disease sample x is y.
[0031] Compared with existing technologies, this invention has the following advantages: It provides a method for named entity recognition of crop diseases and pests that integrates RoBERTa-wwm and adversarial training. This method constructs a named entity recognition model for crop diseases and pests, RGC-ADV, which fully considers the features of the corpus and the implicit features in sentences. It generates word vector representations that integrate information and features from the agricultural disease and pest domain through RoBERTa-wwm, alleviating the bias caused by incomplete semantic feature representation during prediction and enhancing the model's semantic representation ability of Chinese disease and pest text information. Furthermore, adversarial training is introduced during model training, adding adversarial perturbations to the word vector layer to improve the model's generalization ability and further improve the recognition effect of entities with unclear boundaries. At the same time, the Focal Loss loss function is introduced to optimize CRF, effectively mitigating the impact of imbalanced classification of disease and pest label samples, and providing a high-quality technical foundation for downstream tasks such as crop disease and pest knowledge graphs and question answering systems. Attached Figure Description
[0032] Figure 1 This is a flowchart illustrating the method implementation of an embodiment of the present invention;
[0033] Figure 2 This is a schematic diagram of the structure of the RGC-ADV model in an embodiment of the present invention;
[0034] Figure 3 This is a schematic diagram of the structure of the RoBERTa-wwm model in an embodiment of the present invention;
[0035] Figure 4 This is a schematic diagram of the structure of the GRU in an embodiment of the present invention. Detailed Implementation
[0036] The present invention will be further described below with reference to the accompanying drawings and embodiments.
[0037] It should be noted that the following detailed descriptions are exemplary and intended to provide further explanation of this application. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains.
[0038] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the exemplary embodiments according to this application. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.
[0039] This embodiment provides a method for named entity recognition of crop diseases and pests that integrates RoBERTa-wwm and adversarial training, including the following steps:
[0040] First, obtain relevant data from different data sources to construct a raw corpus of crop diseases and pests. Preprocess the raw corpus to remove redundant data and obtain a standardized corpus of crop diseases and pests. Label the entity data and divide the labeled data into training set, validation set and test set according to a set ratio.
[0041] (1) Data Acquisition. Crop pest and disease data released by the Institute of Crop Science, Chinese Academy of Agricultural Sciences, was selected as the primary data source, supplemented by Baidu Encyclopedia, and further supplemented by professional books and pest and disease control experience from agricultural experts in Fujian Province. Based on the different data sources, appropriate data acquisition methods were determined, and corresponding data were acquired to construct a raw corpus of crop pests and diseases. This raw corpus contained a large amount of redundant data. The raw data was preprocessed to remove redundant data, resulting in a standardized corpus of crop pests and diseases.
[0042] (2) Data labeling. Based on the guidance of agricultural experts, entity categories were divided, distinguishing suffixes were defined, and entity data were labeled. The labeled data was then divided into training set, validation set, and test set in a ratio of 7:1:2.
[0043] II. Constructing the RGC-ADV model for named entity recognition of crop diseases and pests, which integrates RoBERTa-wwm and adversarial training, with the following structure: Figure 2 As shown, the RGC-ADV model extracts word-level semantic information from pest and disease texts using the RoBERTa-wwm model to obtain dynamic word vectors, thus addressing the problem of incomplete word recognition. It also employs adversarial training to resolve the difficulty in defining pest and disease entities with ambiguous boundaries, and introduces Focal Loss to improve the imbalance of pest and disease sample labels. The constructed RGC-ADV model is trained and tested using the obtained training, validation, and test sets to obtain the trained RGC-ADV model.
[0044] In this embodiment, the RGC-ADV model comprises seven network layers: an input layer, a RoBERTa-wwm layer, an adversarial training layer, a BiGRU layer, a fully connected layer, a CRF layer, and an output layer. Its working mechanism is as follows:
[0045] 1) The input text is pre-trained in the RoBERTa-wwm layer to obtain word-level semantic information, and each word in the pest and disease text is converted into a feature vector.
[0046] 2) The adversarial training layer adds perturbations to the feature vectors output by the RoBERTa-wwm layer to generate adversarial examples, thereby improving the model's robustness to input perturbations.
[0047] 3) Input the original feature vector and adversarial examples together into the BiGRU layer for training to fully learn the relationship between contexts.
[0048] 4) The features extracted by the BiGRU layer are integrated by the fully connected layer and then input into the CRF layer to obtain the final prediction result. At the same time, Focal Loss is introduced to improve the problem of imbalanced labels of pest and disease samples.
[0049] In the above steps, the RoBERTa-wwm layer, adversarial training layer, BiGRU layer, fully connected layer, and CRF layer are the focus of this invention, and will be discussed in detail below.
[0050] (1) RoBERTa-wwm layer
[0051] The RoBERTa-wwm layer is implemented based on the RoBERTa-wwm model. RoBERTa-wwm is developed based on BERT (Bidirectional Encoder Representation from Transformers), a pre-trained language model that uses a bidirectional Transformer as its encoder. It effectively integrates information from the context on both sides of a word. BERT's training objective is to perform Next Sentence Prediction (NSP) and Masked Language Model (MLM). MLM works by randomly selecting 15% of the words from the input pest and disease sentences for replacement. Of these, 80% are replaced with a mask, and 10% each are replaced with random words or leave the original words unchanged. When performing the MLM task, BERT uses static masking, meaning each pest and disease sample is randomly masked only once during the entire training process. In contrast, the Roberta model uses dynamic masking, randomly selecting a certain percentage of words from the pest and disease samples for replacement in each iteration. This allows the model to obtain more representations of sentence patterns, thereby improving the accuracy of named entity recognition for different pest and disease types. Roberta removed the NSP task and instead used consecutive input Full-Sentences and Doc-Sentences until the input sentences reached their maximum length. Research shows that Roberta performs better in predicting sentence relationships. WWWM refers to a pre-training phase where samples are masked at the word level, effectively acquiring semantic representations at the disease / pest word level. Roberta-WWWM combines the advantages of Roberta and WWWM. For Chinese disease / pest corpora, BERT only masks individual Chinese characters each time it performs MLM, failing to learn word-level semantic information.
[0052] In this embodiment, the Roberta-wwm model employs a Chinese full-word masking strategy. First, it segments the pest and disease corpus text into words, then masks the words. After covering all the Chinese characters that make up the same word, it predicts these words. Finally, it generates dynamic word vectors with word-level features, making them more suitable for Chinese crop pest and disease named entity recognition tasks. The RGC-ADV model uses the Roberta-wwm model as a pre-training model for pest and disease named entity recognition to extract text features. During training, the model parameters are fine-tuned to better learn the semantic features of the pest and disease data. The Roberta-wwm model consists of 12 Transformer layers. Each Transformer uses a multi-head attention mechanism to reduce the distance between two words at any position in the input pest and disease sequence to a constant. The model structure is as follows: Figure 3 As shown. Assume the input is Tok = {Tok} [cls] Tok1, Tok2…Tok n Tok [SEP]}, where Tok i T i E i These represent the word vectors of the i-th character in the pest and disease text data, before and after Transformer encoding, respectively. The vector E = {E_i} corresponding to Tok is obtained through Roberta-wwm. [cls] E1, E2…E n E [SEP] Vector E contains the pest and disease semantic information obtained by Roberta-wwm during the pre-training phase.
[0053] (2) Adversarial Training Layer
[0054] The adversarial training layer adds perturbations to the model, specifically by adding perturbations to the original input pest and disease samples to obtain adversarial samples. These adversarial samples are then input into the model for training to prevent noise from pest and disease privacy information and improve the model's generalization ability. In this embodiment, Projected Gradient Descent (PGD) is introduced into the RGC-ADV model for iterative attacks. Each time, the perturbation range is controlled within a specified range S. Once the perturbation value exceeds the specified range S, the adversarial sample x is removed. t Projecting onto the specified range x+S, the iterative process is shown in formula (1);
[0055]
[0056] Where α represents the magnitude of the perturbation in each iteration of PGD, Π x+s Indicates the projection operation, x t and x t+1The RGC-ADV model generates adversarial examples E by adding perturbations to the initial vector E of pests and diseases output from the RoBERTa-wwm layer. ADV Then E and E ADV They are used together as input to the BiGRU layer for training.
[0057] (3) BiGRU layer
[0058] The BiGRU layer is composed of a forward GRU and a backward GRU. The internal structure of the GRU is as follows: Figure 4 As shown. The GRU consists of update gates and reset gates, which are represented by z. t and r t This means that the update gate determines the extent to which information from the previous time step enters the current time step, while the reset gate controls how much information is forgotten. The calculation process of the GRU network is shown in formulas (2) to (5);
[0059] r t =σ(W r [h t-1 ,x t (2)
[0060] Z t =σ(W z [h t-1 ,x t (3)
[0061]
[0062]
[0063] Where, x t Let σ represent the pest and disease information input at time t, and h represent the Sigmoid activation function. t and h t-1 Let these represent the output vectors of the hidden layer at times t and t-1, respectively. Indicates the current state of the candidate set. This represents the input weight matrix of the activation function, * represents the Hadamard product, and tanh() represents the activation function.
[0064] However, since GRU can only process pest and disease data in one direction, it can only make predictions by acquiring features from forward text data. Therefore, the BiGRU layer uses a bidirectional gated recurrent unit (BiGRU) to train the pest and disease text vectors output from the adversarial training layer, capture the semantic dependencies of the pest and disease information context, and simultaneously acquire and utilize forward and backward data features to improve prediction accuracy.
[0065] (4) Fully connected layer
[0066] The fully connected layer is used to map the learned distributed feature representations to the sample label space to achieve sample classification. The RGC-ADV model integrates the pest and disease sample feature results output by the BiGRU layer through the fully connected layer to reduce the influence of location features on the classification results, thereby improving the classification effect of pest and disease samples.
[0067] (5) CRF layer
[0068] The CRF layer uses the sequence P output by the fully connected layer as the input of the CRF to calculate the score of the output sequence H. The calculation process is shown in formula (6).
[0069]
[0070] Where n is the sequence length, and A represents the transition score matrix. ij This represents the transition score matrix from the i-th label to the j-th label of a pest or disease sample; during decoding, the Viterbi algorithm is used to calculate the sequence label sequence H with the highest probability among all H sequences. max As shown in formula (7);
[0071] H max =arg max(score(H,y)) (7)
[0072] In the named entity recognition task for crop diseases and pests, neighboring disease and pest labels have an order relationship, such as the I-DISEASE label should appear after B-DISEASE. However, BiGRU has limited ability to handle the dependency relationship of learned labels. Therefore, the RGC-ADV model adds a CRF layer to obtain the globally optimal label prediction sequence. In addition, due to the imbalance of disease and pest sample label classification, the Focal Loss loss function is introduced to optimize the CRF model. Focal Loss controls the weight of positive and negative classification samples, making the training process pay more attention to negative samples, so as to continuously optimize the model performance; its calculation process is shown in formula (8).
[0073] Loss Focal =-α(1-P(y|x)) γ ln(P(y|x)) (8)
[0074] Where α∈[0,1] is a balancing factor used to balance the number of positive and negative samples in the pest and disease samples, γ≥0 is a modulation coefficient used to reduce the loss of easily classified samples, i.e., non-pest and disease entity samples, so that the model pays more attention to difficult samples, i.e. pest and disease entity samples; P(y|x) represents the probability that the label of pest and disease sample x is y.
[0075] 3. Named entity recognition of crop diseases and pests using the trained RGC-ADV model.
[0076] In this embodiment, the constructed entity recognition model RGC-ADV is used to study and analyze named entity recognition of crop diseases and pests, specifically as follows:
[0077] (1) The dataset was pre-trained using the RoBERTa-wwm model, the AdamW optimizer was used to train the model, and the Warmup learning rate strategy was used to assist learning.
[0078] (2) Multiple models were used for comparative analysis, including two types of comparative experiments: different embedding methods and different downstream model structures under the same embedding method.
[0079] (3) Model performance tests were conducted on eight types of entities: pests and diseases, alternative names, pathogens, affected parts, distribution areas, disease onset periods, affected crops, and control agents. The models were evaluated using three evaluation indicators: accuracy, recall, and F1 score.
[0080] This embodiment selects Fujian Province as the study area. Based on data from the Third National Crop Germplasm Resources Survey of Fujian Academy of Agricultural Sciences and the Fujian Provincial Statistical Yearbook, pest and disease materials were obtained from 10 major crops, including five food crops (rice, soybean, wheat, barley, and sweet potato) and five cash crops (tea, sugarcane, peanut, radish, and rapeseed). The RGC-ADV model proposed in this invention was used to train the crop pest and disease dataset, and the model performance was tested on eight entity classes. The experimental results are shown in Table 1. Overall, the RGC-ADV model can effectively learn the textual feature information of pests and diseases and has good recognition performance. The F1 scores for the six entity classes—pests and diseases, alternative names, pathogens, distribution areas, affected crops, and pesticides—are all greater than 89%. This may be because the descriptions of these entity classes are relatively simple, and the data features are obvious. The identification of damaged parts was poor, with an F1 score of 78.29%. The study found that descriptions of the same crop part varied in the pest and disease texts. For example, words such as "stem," "stem base," and "stem" all described the same part, making it difficult to distinguish the boundaries of the damaged parts. The identification of the disease stage was also poor, with an F1 score of 81.21%. This is related to the small number of samples during the disease stage and the fact that the boundaries of the affected parts are relatively more difficult to distinguish, making it difficult for the model to learn fully.
[0081] Table 1 Results of Crop Pest and Disease Category Identification
[0082]
[0083] To verify the superiority of the RGC-ADV model proposed in this invention for entity recognition in the field of crop diseases and pests, a variety of models were used for comparative analysis, including two types of comparative experiments: different embedding methods and different downstream model structures under the same embedding method. The selected models included BiGRU-CRF, BERT-BiGRU-CRF, ALBERT-BiGRU-CRF, RoBERTa-wwm-BiGRU-CRF, and RoBERTa-wwm-CRF. Specific results are shown in Table 2.
[0084] Table 2. Comparison of experimental results for different entity recognition models.
[0085]
[0086]
[0087] Based on the same dataset and downstream model, the RoBERTa-wwm embedding method can identify crop pests and diseases more accurately than ALBERT and BERT. In terms of accuracy, recall, and F1 score, RGC-ADV improves upon BiGRU-CRF, BERT-BiGRU-CRF, and ALBERT-BiGRU-CRF by 9.15, 1.57, and 3.39 percentage points, 9.76, 0.33, and 5.36 percentage points, and 9.48, 0.97, and 4.4 percentage points, respectively. This indicates that RoBERTa-wwm embedding can help improve the model's semantic representation ability of text, thereby optimizing the performance of pest and disease identification tasks. By introducing adversarial training, RGC-ADV improves accuracy by 0.67 percentage points, recall by 1.24 percentage points, and F1 score by 0.95 percentage points. The performance improvement shows that adversarial training helps the model better adapt to the pest and disease domain. In addition, compared with the RoBERTa-wwm-CRF model, RGC-ADV improved accuracy, recall and F1 score by 4.2, 4.19 and 4.19 percentage points, respectively.
[0088] Overall, the entity extraction performance of the method proposed in this invention is superior to other methods, with a precision of 89.23%, a recall of 90.90%, and an F1 score of 90.04%, indicating that it has good adaptability in entity recognition tasks in the field of crop diseases and pests and can effectively abstract and model text data related to crop diseases and pests.
[0089] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0090] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0091] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0092] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0093] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention in any other way. Any person skilled in the art may make changes or modifications to the above-disclosed technical content to create equivalent embodiments. However, any simple modifications, equivalent changes, and modifications made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the protection scope of the present invention.
Claims
1. A crop plant disease and pest named entity recognition method fusing RoBERTa-wwm and adversarial training, characterized in that, include: Data is obtained from different data sources to construct a raw corpus of crop diseases and pests. The raw corpus is then preprocessed to remove redundant data, resulting in a standardized corpus of crop diseases and pests. The entity data is labeled, and the labeled data is divided into training set, validation set and test set according to a set ratio; A crop pest and disease named entity recognition model, RGC-ADV, is constructed by integrating RoBERTa-wwm and adversarial training. The RGC-ADV model extracts word-level semantic information from pest and disease texts through the RoBERTa-wwm model to obtain dynamic word vectors, thereby solving the problem of incomplete word recognition. Adversarial training is used to solve the problem of difficult definition of pest and disease entities with ambiguous boundaries. At the same time, Focal Loss is introduced to improve the problem of imbalanced labels of pest and disease samples. The constructed RGC-ADV model is trained and tested using the obtained training set, validation set, and test set to obtain a trained RGC-ADV model. Named entity recognition of crop diseases and pests using a trained RGC-ADV model. The RGC-ADV model comprises seven network layers: input layer, RoBERTa-wwm layer, adversarial training layer, BiGRU layer, fully connected layer, CRF layer, and output layer. The input text is pre-trained in the RoBERTa-wwm layer to obtain word-level semantic information, and each word in the pest and disease text is converted into a feature vector. The adversarial training layer adds perturbations to the feature vectors output by the RoBERTa-wwm layer to generate adversarial examples and improve the robustness of the model to input perturbations. The original feature vectors and adversarial examples are fed into the BiGRU layer for training to fully learn the relationship between contexts. The features extracted by the BiGRU layer are integrated by a fully connected layer and then input into the CRF layer to obtain the final prediction result. At the same time, Focal Loss is introduced to improve the problem of imbalanced labels of pest and disease samples. The RoBERTa-wwm layer is implemented based on the RoBERTa-wwm model, which employs a Chinese full-word masking strategy. First, the pest and disease corpus text is segmented into words, and then the words are masked. After covering all the Chinese characters that make up the same word, these words are predicted. Finally, dynamic word vectors with word-level features are generated, making them more suitable for Chinese crop pest and disease named entity recognition tasks. The RGC-ADV model uses the RoBERTa-wwm model as a pre-training model for pest and disease named entity recognition to extract text features. During training, the model parameters are fine-tuned to enable the model to better learn the semantic features of pest and disease data. The RoBERTa-wwm model consists of 12 Transformer layers. Each Transformer employs a multi-head attention mechanism to reduce the distance between any two words in the input pest and disease sequence to a constant. Assuming the input is... ,in This indicates the first element in the text data of pests and diseases. The character, obtained through RoBERTa-wwm, is related to... corresponding vector , E represents the word vector after Transformer encoding of the nth word in the pest and disease text data. Vector E contains the pest and disease semantic information obtained by RoBERTa-wwm during the pre-training stage.
2. The method for named entity recognition of crop diseases and pests integrating RoBERTa-wwm and adversarial training according to claim 1, characterized in that, The adversarial training layer adds perturbations to the model, specifically by adding perturbations to the original input pest and disease samples to obtain adversarial samples. These adversarial samples are then input into the model for training to prevent noise from pest and disease privacy information and improve the model's generalization ability. The RGC-ADV model introduces Projected Gradient Descent (PGD) for iterative attacks, keeping the perturbation range within a specified range S each time. Once the perturbation value exceeds the specified range S, the adversarial sample is removed. Project to a specified range The iterative process is shown in formula (1); (1) in, This indicates the magnitude of the perturbation during each iteration of PGD. This indicates a projection operation. and Represents the adversarial examples generated at iterations t and t+1; the initial vector of pests and diseases output by the RGC-ADV model at the RoBERTa-wwm layer. Based on the representation, perturbations are added to generate pest and disease resistance samples. Then and They are used together as input to the BiGRU layer for training.
3. The method for named entity recognition of crop diseases and pests integrating RoBERTa-wwm and adversarial training according to claim 1, characterized in that, The BiGRU layer is composed of a forward GRU and a backward GRU; each GRU consists of update gates and reset gates, which are respectively used for updating gates and reset gates. and This means that the update gate is used to determine the extent to which information from the previous moment enters the current moment, and the reset gate is used to control how much information is forgotten; the calculation process of the GRU network is shown in formulas (2) to (5); (2) (3) (4) (5) in, This represents the pest and disease information input at time t. This represents the Sigmoid activation function. and Let these represent the output vectors of the hidden layer at times t and t-1, respectively. Indicates the current state of the candidate set. , , This represents the input weight matrix of the activation function, and * represents the Hadamard product. ( ) represents the activation function; The BiGRU layer uses a bidirectional gated recurrent unit (BiGRU) to train the pest and disease text vectors output by the adversarial training layer, captures the semantic dependencies of the pest and disease information context, and acquires and utilizes forward and backward data features to improve prediction accuracy.
4. The method for named entity recognition of crop diseases and pests integrating RoBERTa-wwm and adversarial training according to claim 1, characterized in that, The fully connected layer is used to map the learned distributed feature representations to the sample label space to achieve sample classification; the RGC-ADV model integrates the pest and disease sample feature results output by the BiGRU layer through the fully connected layer to weaken the influence of location features on the classification results, thereby improving the classification effect of pest and disease samples.