A named entity recognition method based on a pre-trained model and a progressive convolution network
By using a method based on pre-trained language models and progressive convolutional networks, the problem of insufficient information mining in pre-trained models is solved, the performance of named entity recognition is improved, and it is applicable to various deep pre-trained language models.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHONGKE HEFEI INST OF COLLABORATIVE RES & INNOVATION FOR INTELLIGENT AGRI
- Filing Date
- 2022-10-24
- Publication Date
- 2026-06-19
AI Technical Summary
Existing named entity recognition methods based on pre-trained language models have failed to fully exploit the language features learned by the pre-trained models on large-scale corpora, resulting in insufficient named entity recognition performance.
We employ a method based on pre-trained language models and progressive convolutional networks. By encoding natural language and using progressive convolutional network modules to fuse the encoding of the pre-trained language model from low to high layers, we combine it with a CRF model for decoding to achieve named entity recognition.
It improves the performance of named entity recognition, enhances the ability to represent natural language, reduces the complexity of introducing external knowledge, and is applicable to various deep pre-trained language models.
Smart Images

Figure CN115906843B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of natural language processing technology, and in particular to a named entity recognition method based on a pre-trained model and a progressive convolutional network. Background Technology
[0002] Named entity recognition (NER) is a subtask of information extraction, aiming to locate and classify named entities in text into predefined categories, such as people, organizations, locations, time expressions, quantities, currency values, percentages, etc. Named entities in natural language, such as personal names, place names, and organization names, often function as subjects or objects, directly determining the semantics of the natural language. Therefore, the quality of NER directly impacts the effectiveness of downstream tasks such as information extraction, question answering systems, and machine translation.
[0003] Named entity recognition (NID) methods can be categorized at the model level into rule-based methods, unsupervised learning methods, and supervised learning methods. Early dictionary- and rule-based NID methods required the construction of appropriate dictionaries and rules. Supervised NID methods based on machine learning and deep learning primarily include Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), and Long Short-Term Memory (LSTM) networks. These supervised methods treat the NID task as a sequence labeling problem, performing supervised learning on a given dataset to endow the model with the ability to recognize named entities.
[0004] However, due to the scarcity of labeled data, researchers have begun to explore unsupervised learning, with pre-trained language models obtained through unsupervised learning on large-scale corpora being increasingly proposed in recent years. Specifically, existing methods first subject a large, multi-layered deep language model to self-supervised learning on a large amount of unlabeled text data, enabling the model to learn the relationships between words in sentences. This allows the pre-trained language model to obtain an effective distributed representation of natural language. Then, the model is fine-tuned on different downstream tasks using various datasets. This "pre-training + fine-tuning" paradigm differs from traditional machine learning and deep learning methods that rely on supervised learning on datasets. It utilizes the pre-trained language model obtained through unsupervised learning to represent natural language more effectively, thereby enhancing the model's named entity recognition capabilities.
[0005] Existing methods for obtaining effective distributed representations of natural language through pre-trained language models primarily use the last layer to encode the distributed representation, such as BERT. This approach is relatively coarse-grained and does not fully leverage the diverse linguistic features learned from large-scale corpora. Furthermore, research combining pre-training and fine-tuning in named entity recognition is currently limited. Summary of the Invention
[0006] To address the shortcomings of insufficient information mining from pre-trained language models in named entity recognition, this invention aims to provide a named entity recognition method based on pre-trained language models and progressive convolutional networks that can improve the performance of named entity recognition tasks.
[0007] To achieve the above objectives, the present invention adopts the following technical solution: a named entity recognition method based on a pre-trained language model and a progressive convolutional network, the method comprising the following sequential steps:
[0008] (1) Based on the pre-trained language model, natural language is encoded to obtain the representation set LS;
[0009] (2) The representation set LS is input into a progressive convolutional network module. The encodings of adjacent layers are progressively fused from low to high levels using the progressive convolutional network module to obtain an aggregated distributed representation AR that integrates the characteristics of all layers of the pre-trained language model. c ;
[0010] (3) Utilizing the CRF model, i.e., Conditional Random Field, to decode aggregated distributed representation AR c This enables named entity recognition.
[0011] Step (1) specifically includes the following steps:
[0012] (1a) Generate the input embeddings of the pre-trained language model. The pre-trained language model requires three types of input data during training:
[0013] The word segmentation ID, word_ids, represents the ID of each word in the natural sentence S;
[0014] The sentence IDs, segment_ids, represent the distinction between two input natural sentences S;
[0015] Position encoding represents the location of each word in the natural sentence S;
[0016] The pre-trained language model concatenates the three data types of each word in the natural language, i.e., the natural sentence S, to obtain the final encoded vector, which is used as the input embedding of the natural sentence S.
[0017] (1b) Transform the input embedding into a full-text enhanced semantic representation: Input the input embedding of the natural sentence S into the pre-trained language model. The pre-trained language model extracts the context information of each character in the natural sentence S, and finally obtains the representation set LS of the natural sentence S, as shown in Equation (1):
[0018]
[0019] Among them, L i ∈R n×h Let i∈{1,2,…,l} represent the sentence representation of the natural sentence S encoded by the i-th layer of the pre-trained language model, l be the depth of the pre-trained language model, and n and h represent the length of the natural sentence S and the hidden layer dimension of the pre-trained language model, respectively.
[0020] Step (2) specifically includes the following steps:
[0021] (2a) Design a progressive convolutional network module. This module is a multi-layer deep structure with a depth c of l-1. Each layer has the same structure. Each layer of the progressive convolutional network consists of three parts: layer concatenation, convolutional layer, and normalization. The three-layer operation is repeated until the last layer of the pre-trained language model encoding is reached.
[0022] (2b) Concatenate the layers representing the LS input progressive convolutional network modules. The layer concatenation fuses the outputs of the previous layer to represent the AR. i-1 ∈R n×h Sentence representation L at the current layer i ∈R n×h By concatenating the components, we obtain the multidimensional hybrid representation (MR). i ∈R 2 ×n×h As shown in equation (2):
[0023] MR i =concat(AR) i-1 ,L i ) = [AR i-1 L i (2)
[0024] (2c) The convolutional layer fuses the output of the previous layer to represent AR. i-1 And the current level sentence representation L i Fusion, that is, for the c-th convolutional layer of a progressive convolutional network module, its input is MR i The convolution kernel is k c ∈R 2×w×b×1 w and b are the length and width of the convolution kernel, respectively. The calculation process of the convolutional layer is shown in equation (3):
[0025]
[0026]
[0027] In the formula, E c ∈R w×b For the current convolutional layer output, LR x,y E represents the element in row x and column y of the sentence at the current level. ci,j For E c The element in row i and column j of matrix AR x,y The elements in row x and column y of the AR matrix;
[0028] (2d) The final normalization layer ensures that the sentence representation after convolution in the current layer remains consistent in magnitude with that before convolution, which is beneficial for the fusion of the next layer. That is, given E c Its normalized value AR c Calculated from equation (4):
[0029]
[0030] Among them, g i b is a trainable parameter. As can be seen from equation (4), this normalization method normalizes on the same hidden layer dimension of the sentence embedding; the normalized value AR c This is an aggregated distributed representation, where c = l-1, n is the sentence length, and u i For E c The average of the i-th column, σ i For E c The variance of the i-th column.
[0031] Step (3) specifically includes the following steps:
[0032] (3a) Obtain the probability matrix P: Aggregate the distributed representation AR c The input to the CRF model is transformed into a probability matrix P, where P∈R, through its fully connected layers. n×k , k represents the predicted label category size, and the probability matrix P is learned as a learnable parameter in the CRF model;
[0033] (3b) Obtain the transition matrix A: The CRF model randomly initializes the transition matrix A of the label transition, trains the CRF model, and continuously adjusts this transition matrix A during the backpropagation process to finally generate the transition matrix A that constrains the order of the labels.
[0034] (3d) Selecting the path with the highest score: The probability matrix P and the transition matrix A jointly determine how to select a label data point; the path with the highest score is taken as the final result. During the training phase, the idea of dynamic programming is used to obtain the scores of all paths. The problem of calculating this total score is divided into many sub-problems. To calculate the scores of all paths, i.e., the score of the END point, the column before the END point is calculated first, and the problem is split into multiple sub-problems. For the input sentence sequence E = {e1, e2, ..., en}, where n is the length of the natural sentence S, its corresponding label sequence is y = {y1, y2, ..., yn}, y i ∈Y, where Y is the set of categories of named entities, calculate the probability of a sequence of labels appearing:
[0035]
[0036] Where s(E,y) is the score function. This indicates that the i-th position in the label sequence is y. i The probability; A is the state transition matrix, Indicates that the label y i-1 To tag y i The probability of y is given by E; p(y|E) is the conditional probability given E.
[0037] During the training phase, the probability matrix P and the state transition matrix A are learned using maximum likelihood estimation:
[0038]
[0039] During prediction, the path with the highest score among all paths is selected as the prediction result. After learning the P and A matrices, in the prediction stage, the label sequence y* that maximizes the conditional probability p(y|E) is calculated using the Viterbi algorithm and used as the label for the input sequence. * Let E represent the set of all possible label sequences for the input sentence sequence E, i.e., the decoded result, as shown in equation (7):
[0040]
[0041] As can be seen from the above technical solution, the beneficial effects of the present invention are as follows: First, the present invention uses a pre-trained language model to encode natural language. The pre-trained language model can learn the context of the language. As the pre-trained language model has an increasingly deep architecture, the information contained in the encoding becomes increasingly rich, which is beneficial to downstream tasks. Second, when enhancing the encoded information of natural language, it is not necessary to introduce external knowledge or operations to enhance entity information to improve the accuracy of named entity recognition, as in existing methods. Instead, it focuses on the results obtained by the computing power already used and uses the progressive convolutional network module proposed in this invention to extract the full-layer representation of the pre-trained language model, overcoming the defect of insufficient information mining of the pre-trained language model and reducing the complexity of introducing external data for computation. Third, the present invention can be easily transferred to various deep pre-trained language models without complex model alignment operations. It only requires adding a convolutional module after the pre-trained language model for convolutional extraction, and is applicable to various existing large-scale pre-trained language models based on Transformer. Attached Figure Description
[0042] Figure 1 This is a flowchart of the method of the present invention;
[0043] Figure 2 This is a diagram of the representation layer structure of a pre-trained language model.
[0044] Figure 3 This is a diagram illustrating the overall framework of the pre-trained language model of this invention.
[0045] Figure 4 Here is a diagram of the Encoder structure of the Transformer;
[0046] Figure 5 This is a schematic diagram of the BERT model structure. Detailed Implementation
[0047] like Figure 1 As shown, a named entity recognition method based on a pre-trained language model and a progressive convolutional network is presented. This method includes the following sequential steps:
[0048] (1) Based on the pre-trained language model, natural language is encoded to obtain the representation set LS;
[0049] (2) The representation set LS is input into a progressive convolutional network module. The encodings of adjacent layers are progressively fused from low to high levels using the progressive convolutional network module to obtain an aggregated distributed representation AR that integrates the characteristics of all layers of the pre-trained language model. c ;
[0050] (3) Utilizing the CRF model, i.e., Conditional Random Field, to decode aggregated distributed representation AR c This enables named entity recognition.
[0051] Step (1) specifically includes the following steps:
[0052] (1a) Generate the input embeddings of the pre-trained language model. The pre-trained language model requires three types of input data during training:
[0053] The word segmentation ID, word_ids, represents the ID of each word in the natural sentence S;
[0054] The sentence IDs, segment_ids, represent the distinction between two input natural sentences S;
[0055] Position encoding represents the location of each word in the natural sentence S;
[0056] The pre-trained language model concatenates the three data types of each word in the natural language, i.e., the natural sentence S, to obtain the final encoded vector, which is used as the input embedding of the natural sentence S.
[0057] (1b) Transform the input embedding into a full-text enhanced semantic representation: Input the input embedding of the natural sentence S into the pre-trained language model. The pre-trained language model extracts the context information of each character in the natural sentence S, and finally obtains the representation set LS of the natural sentence S, as shown in Equation (1):
[0058]
[0059] Among them, L i ∈R n×h Let i∈{1,2,…,l} represent the sentence representation of the natural sentence S encoded by the i-th layer of the pre-trained language model, l be the depth of the pre-trained language model (e.g., BERT has 12 layers), and n and h represent the length of the natural sentence S and the hidden layer dimension of the pre-trained language model, respectively. S is an abbreviation for sentence, and PM is an abbreviation for pre-trained model.
[0060] The encoding structure consists of multiple layers of Transformer structures, such as... Figure 2 As shown, a single-layer Transformer consists of a multi-head self-attention mechanism and a feedforward network. Therefore, the natural sentence S obtained from formula (1) is input into the encoder, which extracts the information before and after each character in the sentence, ultimately yielding the sentence representation set LS. The Transformer is an Encoder-Decoder structure model; BERT, as a natural language encoder, uses the Encoder part of the Transformer. The Transformer consists of a multi-head self-attention mechanism and a feedforward network, as shown... Figure 4 As shown. Multi-head self-attention is based on single-head self-attention, obtaining representations in different spaces.
[0061] Step (2) specifically includes the following steps:
[0062] (2a) Design a progressive convolutional network module. This module is a multi-layer deep structure with a depth c of l-1. Each layer has the same structure. Each layer of the progressive convolutional network consists of three parts: layer concatenation, convolutional layer, and normalization. The three-layer operation is repeated until the last layer of the pre-trained language model encoding is reached.
[0063] By using layer connections and convolutional layers, the size of the sentence embeddings before and after fusion can be guaranteed to remain unchanged. Furthermore, compared to before fusion, the fused sentence embeddings extract features from the current layer. For a specific position in a sequence, convolutional operations allow it to learn the characteristics of its contextual representation, and the range of the learned context is affected by the size of the convolutional kernel. For named entity recognition, named entities are character sequences with a certain span; learning the representations of other characters in a named entity for a specific character should be helpful for the entity recognition task.
[0064] This invention utilizes convolutional functions to fuse the distributed representations of adjacent layers, progressively aggregating the distributed representations of all layers in the pre-trained language model from low to high levels. The progressive convolutional network designed in this invention has a depth of l⁻¹, with each layer having the same structure, such as... Figure 3 As shown, instead of simply embedding the last layer of the pre-trained language model as a sentence representation, a convolutional network is used to extract multi-dimensional information from the representation set of the pre-trained language model to obtain enhanced sentence representations and achieve better entity recognition.
[0065] (2b) Concatenate the layers representing the LS input progressive convolutional network modules. The layer concatenation fuses the outputs of the previous layer to represent the AR. i-1 ∈R n×h Sentence representation L at the current layer i ∈R n×h By concatenating the components, we obtain the multidimensional hybrid representation (MR). i ∈R 2 ×n×h As shown in equation (2):
[0066] MR i =concat(AR) i-1 ,L i ) = [AR i-1 L i (2)
[0067] (2c) The convolutional layer fuses the output of the previous layer to represent AR. i-1 And the current level sentence representation L i Fusion, that is, for the c-th convolutional layer of a progressive convolutional network module, its input is MR i The convolution kernel is k c ∈R 2×w×b×1 w and b are the length and width of the convolution kernel, respectively. The calculation process of the convolutional layer is shown in equation (3):
[0068]
[0069]
[0070] In the formula, Ec ∈R w×b For the current convolutional layer output, LR x,y E represents the element in row x and column y of the sentence at the current level. c i,j For E c The element in row i and column j of matrix AR x,y The elements in row x and column y of the AR matrix;
[0071] (2d) The final normalization layer ensures that the sentence representation after convolution in the current layer remains consistent in magnitude with that before convolution, which is beneficial for the fusion of the next layer. That is, given E c Its normalized value AR c Calculated from equation (4):
[0072]
[0073] Among them, g i b is a trainable parameter. As can be seen from equation (4), this normalization method normalizes on the same hidden layer dimension of the sentence embedding; the normalized value AR c This is an aggregated distributed representation, where c = l-1, n is the sentence length, and u i For E c The average of the i-th column, σ i For E c The variance of the i-th column.
[0074] Step (3) specifically includes the following steps:
[0075] (3a) Obtain the probability matrix P: Aggregate the distributed representation AR c The input to the CRF model is transformed into a probability matrix P, where P∈R, through its fully connected layers. n×k , k represents the predicted label category size, and the probability matrix P is learned as a learnable parameter in the CRF model;
[0076] (3b) Obtain the transition matrix A: The CRF model randomly initializes the transition matrix A of the label transition, trains the CRF model, and continuously adjusts this transition matrix A during the backpropagation process to finally generate the transition matrix A that constrains the order of the labels.
[0077] (3d) Selecting the path with the highest score: The probability matrix P and the transition matrix A jointly determine how to select a label data point, with the path with the highest score being the final result; During the training phase, the idea of dynamic programming is used to obtain the scores of all paths, dividing the problem of calculating this total score into many sub-problems. To calculate the scores of all paths, i.e., the scores of the END point, we first calculate the column before the END point, thus splitting the problem into multiple sub-problems; For the input sentence sequence E = {e1, e2, ..., en}, where n is the length of the natural sentence S, its corresponding label sequence is y = {y1, y2, ..., yn}, yi ∈ Y, where Y is the set of categories of named entities. Calculate the probability of a label sequence appearing:
[0078]
[0079] Where s(E,y) is the score function. This indicates that the i-th position in the label sequence is y. i The probability; A is the state transition matrix, Indicates that the label y i-1 To tag y i The probability of y is given by E; p(y|E) is the conditional probability given E.
[0080] During the training phase, the probability matrix P and the state transition matrix A are learned using maximum likelihood estimation:
[0081]
[0082] During prediction, the path with the highest score among all paths is selected as the prediction result. After learning the P and A matrices, in the prediction phase, the label sequence* that maximizes the conditional probability p(y|E) is calculated using the Viterbi algorithm and used as the label for the input sequence, Y. * Let E represent the set of all possible label sequences for the input sentence sequence E, i.e., the decoded result, as shown in equation (7):
[0083] Specific Implementation Example 1
[0085] The first step is to prepare the dataset and determine the evaluation metrics: The experiment selects the PeopleDaily Named Entity Recognition dataset from People's Daily and the MSRA Named Entity Recognition dataset from Microsoft Research Asia. Both the PeopleDaily and MSRA datasets classify named entities into three categories: person names, place names, and organization names. The datasets are divided into training and testing sets, and the relevant information is shown in Table 1. To evaluate the performance of the method on the named entity recognition task datasets, commonly used evaluation metrics for named entity recognition tasks are selected, including precision (microP), recall (R), and F1 score.
[0086] The second step is to encode the natural language dataset: This invention uses Google's official model BERT-base for experiments. BERT-base is a deep language model capable of bidirectionally learning natural language representations; it is a type of pre-trained language model, and its structure is as follows: Figure 5 As shown. It consists of 12 Transformer layers, i.e. Figure 5 There are 12 TRM layers in the middle vertical direction. Figure 5 In the middle, E i ,i∈{1,2,…,N} represents the input sequence, and N represents the length of the input sequence; T i ,i∈{1,2,…,N} represents the model output;Trm represents the Encoder part of the Transformer. The selected dataset is input into the BERT-base model for encoding, resulting in the representation set LS. The model hyperparameters were determined through multiple experiments, as shown in Table 2.
[0087] Table 1
[0088]
[0089] Table 2
[0090] parameter value Maximum sequence length 128 Batch size 32 Learning rate 0.00005 Dropout inactivation rate 0.5 kernel size 5*5
[0091] The third step is to obtain the aggregated distributed representation (AR). c The LS representation set output by the BERT-base model was subjected to progressive convolution operations. The experimentally selected convolution kernel size was 5*5, indicating that smaller kernels are more suitable for this invention. Compared to the 5*768 convolution kernel commonly used in natural language processing tasks such as named entity recognition (which fall under text classification), smaller kernels achieve better results and require less training time. The large 5*768 kernel, where 768 represents the hidden layer dimension of BERT-base, did not significantly improve the F1 score; instead, it resulted in a larger number of convolutional layer parameters, requiring more training time. The corresponding performance and time comparisons are shown in Table 3.
[0092] Table 3
[0093]
[0094] The fourth step is to decode the aggregated distributed representation and experimentally evaluate the decoding results: The aggregated representation AR... c Input to the CRF layer for decoding to obtain the tag sequence set Y * Observe Y *Based on the labels in the corresponding dataset, model evaluation metrics were calculated: precision (microP), recall (R), and F1 score. Each experiment was repeated three times, and the average value was used as the experimental result to neutralize the impact of random initialization of model parameters. The experimental results are shown in Table 4. The Sesame and JAM models are two multi-layer representation fusion models used for comparing the experimental results. Table 4 shows that the proposed named entity recognition method based on progressive convolutional networks achieves higher F1 scores on the PeopleDaily and MSRA datasets than other models, and its overall performance is superior to the Sesame and JAM multi-layer representation fusion models. Compared to the initial BERT-base model, the proposed method improves the F1 score by 0.51% on the PeopleDaily dataset and by 0.84% on the MSRA dataset. The experimental results demonstrate that the proposed method can enhance the model's ability to represent natural language to a certain extent and improve the model's accuracy in the named entity recognition task.
[0095] Table 4
[0096]
[0097] In summary, this invention uses a pre-trained language model to encode natural language. The pre-trained language model can learn the context of the language. As the pre-trained language model has an increasingly deep architecture, the encoded information becomes richer, which is beneficial for downstream tasks. When enhancing the encoded information of natural language, it does not need to introduce external knowledge or operations to enhance entity information and improve named entity recognition accuracy, as in existing methods. Instead, it focuses on the results obtained using the computational power already employed. The progressive convolutional network module proposed in this invention is used to extract the full-layer representation of the pre-trained language model, overcoming the deficiency of insufficient information mining from the pre-trained language model and reducing the complexity of introducing external data for computation.
Claims
1. A named entity recognition method based on a pre-trained language model and a progressive convolutional network, characterized in that: The method includes the following steps in sequence: (1) Based on the pre-trained language model, natural language is encoded to obtain the representation set. ; (2) Representing a set The input is a progressive convolutional network module, which progressively fuses the encodings of adjacent layers from low to high levels to obtain an aggregated distributed representation that incorporates features from all layers of the pre-trained language model. ; (3) Using the CRF model, i.e., Conditional Random Field, to decode and aggregate distributed representations To achieve named entity recognition; Step (2) specifically includes the following steps: (2a) Design a progressive convolutional network module, which is a multi-layer deep structure with a depth c of Furthermore, each layer has the same structure. Each layer of the progressive convolutional network consists of three parts: layer concatenation, convolutional layer, and normalization. The three-layer operation is repeated until the last layer of the pre-trained language model encoding is reached. (2b) will represent a set The input is a layer concatenation of a progressive convolutional network module, which fuses the outputs of the previous layer to represent AR. i-1 ∈R n×h Sentence representation of the current layer ∈R n×h When pieced together, a multi-dimensional hybrid representation is obtained. ∈R 2×n×h As shown in equation (2): (2) (2c) The convolutional layer fuses the output of the previous layer to represent AR. i-1 And the current level sentence representation L i Fusion, that is, for the c-th convolutional layer of a progressive convolutional network module, its input is The convolution kernel is k c ∈R 2×w×b×1 w and b are the length and width of the convolution kernel, respectively. The calculation process of the convolutional layer is shown in equation (3): (3) In the formula, ∈R w×b This is the output of the current convolutional layer. Sentence representation for the current layer OK Column elements, for matrix OK Column elements, for matrix OK Column elements; (2d) The final normalization layer ensures that the sentence representation after convolution in the current layer remains consistent in magnitude with that before convolution, which is beneficial for the fusion of the next layer, i.e., given... Its normalized value Calculated from equation (4): (4) in, , These are trainable parameters. As can be seen from equation (4), this normalization method normalizes on the same hidden layer dimension of the sentence embedding; the normalized value This is an aggregated distributed representation, where c = - , Sentence length for The average of the i-th column, for The variance of the i-th column.
2. The named entity recognition method based on a pre-trained language model and a progressive convolutional network according to claim 1, characterized in that: Step (1) specifically includes the following steps: (1a) Generate the input embeddings of the pre-trained language model. The pre-trained language model requires three types of input data during training: The word segmentation ID, word_ids, represents the ID of each word in the natural sentence S; The sentence IDs, segment_ids, represent the distinction between two input natural sentences S; Position encoding represents the location of each word in the natural sentence S; The pre-trained language model concatenates the three data types of each word in the natural language, i.e., the natural sentence S, to obtain the final encoded vector, which is used as the input embedding of the natural sentence S. (1b) Transform the input embedding into a full-text enhanced semantic representation: Input the input embedding of the natural sentence S into the pre-trained language model. The pre-trained language model extracts the context information of each character in the natural sentence S, and finally obtains the representation set of the natural sentence S. As shown in equation (1): , in, ∈R n×h , i∈{1, 2, …, } represents the sentence representation of the natural sentence S encoded by the i-th layer of the pre-trained language model. It represents the depth of the pre-trained language model, where n and h represent the length of the natural sentence S and the hidden layer dimension of the pre-trained language model, respectively.
3. The named entity recognition method based on a pre-trained language model and a progressive convolutional network according to claim 1, characterized in that: Step (3) specifically includes the following steps: (3a) Obtain the probability matrix P: Aggregate distributed representation The input to the CRF model is transformed into a probability matrix P, where P∈R, through its fully connected layers. n×k , k represents the predicted label category size, and the probability matrix P is learned as a learnable parameter in the CRF model; (3b) Obtaining the transition matrix A: The CRF model randomly initializes the transition matrix A of the label transition, trains the CRF model, and continuously adjusts this transition matrix A during the backpropagation process, finally generating the transition matrix A that constrains the order of the labels; (3d) Selecting the path with the highest score: The probability matrix P and the transition matrix A jointly determine how to select a label data point, with the path with the highest score being the final result; During the training phase, the idea of dynamic programming is used to obtain the scores of all paths, dividing the problem of calculating this total score into many sub-problems. To calculate the scores of all paths, i.e., the scores of the END point, the column before the END point is first calculated, thus splitting the problem into multiple sub-problems; For the input sentence sequence E={e1, e2, …, en}, where n is the length of the natural sentence S, its corresponding label sequence is y={y1, y2, …, yn}, ∈Y, where Y is the set of categories of named entities, calculate the probability of a sequence of labels appearing: (5) in, For the score function, Indicates the first label in the label sequence The positions are The probability; A is the state transition matrix, Indicates by label To tag The probability of; It is the conditional probability under the condition of E; During the training phase, the probability matrix P and the state transition matrix A are learned using maximum likelihood estimation. The maximum likelihood function is: (6) During prediction, the path with the highest score among all paths is selected as the prediction result. After learning the P and A matrices, the conditional probability is solved using the Viterbi algorithm during the prediction phase. The label sequence at maximum As a label for the input sequence, Let E represent the set of all possible label sequences for the input sentence sequence E, i.e., the decoded result, as shown in equation (7): (7)。