A Deep Learning-Based Method and System for Named Entity Recognition in Biomedical Text
By constructing a multi-task model and using data augmentation techniques, the problems of label imbalance and data scarcity in biomedical text named entity recognition were solved, improving the recognition accuracy and generalization ability of genomic variant entities.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHONGQING UNIV OF POSTS & TELECOMM
- Filing Date
- 2023-03-08
- Publication Date
- 2026-06-30
Smart Images

Figure CN116362248B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of artificial intelligence and natural language processing, and specifically relates to a method and system for biomedical text named entity recognition based on deep learning. Background Technology
[0002] Named Entity Recognition (NER), also known as proper name recognition, refers to the identification of entities with specific meanings in text, mainly including names of people, places, organizations, and proper nouns. NER is a crucial foundational tool in applications such as information extraction, question answering systems, syntactic analysis, machine translation, and metadata annotation for the Semantic Web, playing a vital role in the practical application of natural language processing technology. Biomedical text NER aims to extract medical entities from medical texts and classify them.
[0003] Existing methods for mining genomic variations in biomedical literature mainly include regular expression matching and machine learning. Regular expression matching requires manually designing a large number of regular expressions to identify genomic variation entities in biomedical texts. A representative tool is MutaitonFinder, which uses 700 manually designed regular expressions to match protein variation entities. However, this tool has the drawback of only being able to identify a single entity, and the regular expressions are extremely complex to construct. The most representative rule-based machine learning method is tmVar, which does not require designing regular expressions and can efficiently identify multiple entity types. However, this method heavily relies on preprocessing, manually designed input features, and requires complex post-processing. Traditional methods almost inevitably suffer from the drawbacks of manually designing rules and features, while deep learning-based models often overcome these problems; however, the following two issues still exist:
[0004] (1) Failure to leverage the characteristics of biomedical text: Unlike entities in general domains, genomic variant entities have diverse naming methods, and most of them contain punctuation marks, numbers, etc., such as p.Pro246HisfsX13, G>A. Current deep learning-based models, such as DeepVar, do not design models based on the characteristics of biomedical text and genomic variants. In particular, during preprocessing, the label alignment methods used in general domains are adopted, exacerbating label imbalance and label discontinuity. Secondly, existing models cannot accurately identify the boundaries of entities that appear infrequently or are long (such as G>Asubstitution at nucleotide 893).
[0005] (2) Data scarcity: High-quality datasets related to genomic variation are very scarce, while the performance of deep learning models is strongly correlated with the size of the dataset. With limited training data, the model's performance may not meet expectations. Summary of the Invention
[0006] To address the aforementioned problems, this invention provides a method and system for biomedical text named entity recognition based on deep learning.
[0007] In a first aspect, the present invention provides a biomedical text named entity recognition method based on deep learning, wherein a multi-task model is constructed, the multi-task model including a BioBERT layer, a stacked BILSTM layer, an attention layer, a NER module, a simple NER module, a sentence recognition module and a token recognition module;
[0008] The biomedical text named entity recognition method based on a multi-task model includes the following steps:
[0009] S1. Obtain biomedical text training data with entity annotations for genomic variants;
[0010] S2. Perform data augmentation on the biomedical text training data to obtain augmented data;
[0011] S3. Perform word segmentation on the enhanced data according to the improved word segmentation tagging method to obtain a word segmentation sequence;
[0012] S4. Extract features from the segmented sequence using the BioBERT layer to obtain a word vector sequence;
[0013] S5. Input the word vector sequence into a stacked BILSTM layer to further extract text position information and obtain the feature vector sequence;
[0014] S6. The attention layer uses an improved scoring function to obtain the semantic features of the feature vector sequence;
[0015] S7. The semantic features are fed into the NER module, simple NER module, sentence recognition module and token recognition module respectively, and the task loss is calculated using a comprehensive loss function. The multi-task model is trained by backpropagation.
[0016] S8. Acquire the biomedical text data to be identified in real time, and use the trained multi-task model to obtain the recognition result.
[0017] Furthermore, step S2, the process of data augmentation for the biomedical text training data, includes:
[0018] S21. Perform text-to-genome-variant entity separation processing on each biomedical text in the biomedical text training data with genomic variant entity annotations;
[0019] S22. Sort all the obtained genomic variant entities to obtain a priority sequence of genomic variant entities;
[0020] S23. Preprocess the words in each text to obtain preprocessed text. The preprocessing operations include synonym replacement, deletion and case conversion of words.
[0021] S24. Recombine the preprocessed text with the genomic variant entities in the priority sequence of genomic variant entities to obtain new biomedical text training data;
[0022] S25. The biomedical text training data, together with the new biomedical text training data, constitutes the augmented data.
[0023] Furthermore, the specific process of sorting all genomic variant entities in step S22 includes:
[0024] S221. Obtain the type of each genomic variant entity isolated from the biomedical text training data, and classify all genomic variant entities according to their type;
[0025] S222. Calculate the frequency of occurrence of each type and arrange the genomic variant entities according to the frequency of occurrence of each type; the lower the frequency of occurrence, the higher the priority of the genomic variant entity of the corresponding type.
[0026] S223. For multiple genomic variant entities belonging to the same type, sort them according to the length of the genomic variant entity; the longer the genomic variant entity, the higher its priority.
[0027] Furthermore, the improved word segmentation tagging method includes: performing word segmentation on any data in the augmented data to obtain multiple tokens corresponding to the data; combining the BIO format to label the first token with the first original tag, and labeling the remaining tokens with the second original tag, finally obtaining the word segmentation sequence corresponding to the data.
[0028] Furthermore, the improved scoring function used in the attention layer is expressed as follows:
[0029]
[0030] in, This represents the score of the association between any two vectors in the input sequence, where Let q represent the i-th key vector. nLet d represent the nth query vector, D represent the dimension of the feature vector, λ be a constant, L represent the length of the feature vector sequence, and d be the length of the feature vector sequence. n Let d represent the nth distance vector. i Let represent the i-th distance vector, and argmax represent the index of the non-zero distance vector. The distance vector is one-hot encoded. In a second aspect, based on the method of the first aspect, this invention also proposes a deep learning-based biomedical text named entity recognition system, including a data acquisition module, a data augmentation module, a word segmentation module, a classification module, a multi-task model training module, and a training data storage module, wherein:
[0031] The training data storage module is used to collect and store biomedical text training data with genomic variant entity annotations;
[0032] The data augmentation module is used to augment biomedical text training data to obtain augmented data;
[0033] The word segmentation module is used to segment the enhanced data according to the improved word segmentation tag method to obtain the word segmentation sequence;
[0034] The multi-task model training module is used to train a multi-task model based on the word segmentation sequence, and to calculate the classification loss using a loss function until the model parameters converge.
[0035] The data acquisition module is used to acquire biomedical text data to be identified in real time;
[0036] The classification module is used to store the trained multi-task model and to use the trained multi-task model to identify and classify the biomedical text data to be identified.
[0037] The beneficial effects of this invention are:
[0038] This invention takes into account the scarcity of genomic variant entity data, designs a data augmentation method to effectively solve the data scarcity problem, and proposes an improved word segmentation tagging method based on the characteristics of genomic variant entities to perform word segmentation processing on the augmented data, thereby alleviating the problem of tag sparsity.
[0039] This invention provides a multi-task model that simultaneously learns multiple related tasks, allowing these tasks to share knowledge during the learning process. It leverages the correlation between multiple tasks to improve the model's performance and generalization ability on each task. This enables full utilization of novel data dimensions, greatly helping to address data scarcity issues. Furthermore, it introduces an attention mechanism into the model and employs an improved scoring function to learn location information. Attached Figure Description
[0040] Figure 1 This is a flowchart of the method of the present invention;
[0041] Figure 2 This is a structural diagram of the model of the present invention;
[0042] Figure 3 This is a flowchart of the data augmentation process according to an embodiment of the present invention;
[0043] Figure 4 A flowchart illustrating the construction process of a multi-task dataset according to an embodiment of the present invention;
[0044] Figure 5 This is a flowchart of the post-segmentation tagging process according to an embodiment of the present invention. Detailed Implementation
[0045] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0046] This invention provides a deep learning-based method for named entity recognition in biomedical text, constructing a multi-task model, such as... Figure 2 As shown, the multi-task model includes a BioBERT layer, stacked BILSTM layers, an attention layer, a NER module, a simpleNER module, a sentence recognition module, and a token recognition module.
[0047] like Figure 1 As shown, the biomedical text named entity recognition method based on a multi-task model includes the following steps:
[0048] S1. Obtain biomedical text training data with entity annotations for genomic variants;
[0049] S2. Perform data augmentation on the biomedical text training data to obtain augmented data;
[0050] S3. Perform word segmentation on the enhanced data according to the improved word segmentation tagging method to obtain a word segmentation sequence;
[0051] S4. Extract features from the segmented sequence using the BioBERT layer to obtain word vectors;
[0052] S5. Input the word vectors into stacked BILSTM layers to further extract text location information and obtain feature vectors;
[0053] S6. The attention layer uses an improved scoring function to obtain the semantic features of the feature vector;
[0054] S7. The semantic features are fed into the NER module, simple NER module, sentence recognition module and token recognition module respectively, and the task loss is calculated using a comprehensive loss function. The multi-task model is trained by backpropagation.
[0055] S8. Acquire the biomedical text data to be identified in real time, and use the trained multi-task model to obtain the recognition result.
[0056] In one embodiment, considering the scarcity of data related to genomic variations, and the strong correlation between the performance of deep learning models and dataset size, this embodiment designs a simple method to enhance training data to alleviate the scarcity of training data and achieve the expected model performance. Specifically, this includes:
[0057] S21. Perform text-to-genome-variant entity separation processing on each biomedical text in the biomedical text training data with genomic variant entity annotations;
[0058] S22. Sort all the obtained genomic variant entities to obtain a priority sequence of genomic variant entities;
[0059] Specifically, the process of sorting all genomic variant entities in step S22 includes:
[0060] S221. Obtain the type of each genomic variant entity isolated from the biomedical text training data, specifically including protein variants, DNA variants, and SNPs, and classify all genomic variant entities according to their types;
[0061] S222. Calculate the frequency of occurrence of each type and arrange the genomic variant entities according to the frequency of occurrence of each type; the lower the frequency of occurrence, the higher the priority of the genomic variant entity corresponding to that frequency.
[0062] S223. For multiple genomic variant entities belonging to the same type, sort them according to the length of the genomic variant entity; the longer the genomic variant entity, the higher its priority.
[0063] Prioritization is performed to adjust the data types accordingly during training, and data that appears less frequently should be increased accordingly.
[0064] S23. Preprocess the words in each text to obtain preprocessed text. The preprocessing operations include synonym replacement, deletion and case conversion of words.
[0065] S24. Recombine the preprocessed text with genomic variant entities to obtain new biomedical text training data;
[0066] S25. The biomedical text training data, together with the new biomedical text training data, constitutes the augmented data.
[0067] Specifically, such as Figure 3 As shown, the original biomedical text "The polymorphism at positionEx2+860G>C of the CXCR1 gene" is separated into entity and text, resulting in the text "The polymorphism at position{{placeholder}}of the CXCR1 gene" and the genomic variant entity "Ex2+860G>C". The same separation process is performed on other original biomedical texts, ultimately yielding a text set and a genomic variant entity set. The genomic variant entities are then sorted in descending order of priority, and each text in the text set undergoes operations such as synonym replacement, random addition / deletion, or case conversion to prevent overfitting of the training data. Finally, the processed text and genomic variant entities are combined to obtain new biomedical text training data.
[0068] In one embodiment, an improved word segmentation and labeling method is proposed for word segmentation processing of augmented data. The multi-task model proposed in this paper is based on the BERT pre-trained model, which involves a word segmentation process during input, that is, converting a word into one or more tokens in the vocabulary. This means that the tokens generated after word segmentation need to be labeled. Generally, in BERT-based models, if an entity is divided into multiple tokens, only the first token needs to be labeled with the original label, and the remaining tokens are directly labeled as X. However, considering the characteristics of genomic variant entities, the probability of a genomic variant entity being word segmented is very high. If the traditional method is used, the final labels will be quite sparse. At the same time, common labeling formats in the field of named entity recognition include BIO and BIOES. Therefore, this embodiment proposes an improved word segmentation and labeling method that combines the BIO format, including performing word segmentation processing on any data in the augmented data to obtain multiple tokens; combining the BIO format to label the first token with the first original label, and labeling the remaining tokens with the second original label, finally obtaining the word segmentation sequence corresponding to the data.
[0069] Specifically, such as Figure 5As shown, for a certain data "IVS8 -2A>G" in the augmented data, the original label of "IVS8" is BD, and the original label of "-2A>G" is ID. After the data is segmented, the segmented sequence is obtained as {"I", "V", "S", "8", "-", "2", "A", ">", "G"}. If the traditional method is used, only the first token "I" needs to be labeled with the original label BD, and the remaining tokens are directly labeled as X. However, the improved segmentation labeling method proposed in this embodiment labels the first token "I" with the original label BD, and the remaining tokens are labeled with the original label ID.
[0070] In one embodiment, a series of improvements and innovations were made based on BIOBERT to construct a system such as Figure 2 The multi-task model shown first feeds the segmented word sequence into a BioBERT layer for feature extraction, ensuring that each word in the sequence receives a corresponding word vector. This word vector sequence is then fed into a stacked BILSTM layer. The stacked BILSTM layer consists of multiple stacked BILSTM modules, with residual connections introduced between every two modules to reduce the probability of gradient vanishing and prevent the gradient vanishing problem caused by excessive network depth. Specifically, assuming the output of the t-th BILSTM layer is h... x Then the input i of the (t+1)th layer of the network t+1 for:
[0071] i t+1 =F(h) x )+h x
[0072] Where F is the nonlinear activation function tanh.
[0073] While bidirectional LSTM can capture contextual information of word vectors within a BILSTM layer, it doesn't highlight the potential semantic relevance between the word vectors and their context. Therefore, this embodiment adds an attention layer after the BILSTM layers to further extract semantic features from the word vectors. For example, in "Two novel mutations, L490R and V561X", "L490R", "V561X", and "mutations" are strongly related. Therefore, the attention mechanism assigns more weight to "mutations", helping to identify the mutated entities "L490R" and "V561X".
[0074] First, construct the bond matrix K = [k1, ..., K]. N Value matrix V = [v1, ..., v] N ] and query matrix Q = [q1, ..., q N ], represented as:
[0075] Q = W q X∈R Dk×N
[0076] K = W k X∈R Dk×N
[0077] V = W v X∈R Dk×N
[0078] Where N represents the length of the input sequence for the self-attention mechanism, Dk represents the vector dimension, and R... Dk×N Let W represent the vector space. q W k and W v Let k represent the parameter matrices of the linear mapping, respectively. n Let v represent the nth key vector. n Let q represent the nth value vector. n Let X represent the nth query vector, where X = [x1, ..., xn]. N ]∈R Dx×N The input to the attention layer, namely the sequence of feature vectors, x n This represents the feature vector of the nth word.
[0079] Then, the attention score is calculated based on the key matrix K, value matrix V, and query matrix Q. The attention scoring model is as follows:
[0080]
[0081] in, Let k represent the score of the association between two vectors, where k i T Let q represent the i-th key vector. n Let represent the nth query vector, and D represent the dimension of the feature vector.
[0082] Then, the attention score is transformed using softmax and expressed as:
[0083]
[0084] The input information is summarized using a weighted summation method to obtain the attention value, which is represented as:
[0085]
[0086] This embodiment considers that the location information of genomic variant entities in biomedical texts is very important, but traditional attention mechanisms cannot effectively process location information. Therefore, the attention mechanism is improved to enable it to fuse location information. Assuming d is a location vector (distance vector) using One-Hot encoding, the distance between the two vectors is:
[0087]
[0088] Where L represents the length of the input to the attention layer, i.e., the length of the feature vector sequence, and d n Let d represent the nth distance vector. i Let represent the i-th distance vector, and Δd represent the distance between two feature vector positions. The larger Δd is, the closer the positions of the two tokens are. Therefore, an improved scoring function is proposed, expressed as:
[0089]
[0090] Where λ represents a constant.
[0091] In one embodiment, a multi-task learning structure was designed to further fuse features, such as... Figure 2 , Figure 4 As shown, it includes 4 learning tasks:
[0092] Named Entity Recognition (NER module), which is the main task of this invention, is to accurately identify the location and category of genomic variant entities;
[0093] The Simple Named Entity Recognition (Simple NER) task is the same as the main task, except that the number of labels is reduced and only the location of genomic variant entities needs to be identified.
[0094] The sentence classification task (sentence recognition module) is used to determine whether a sentence contains genomic variant entities;
[0095] The Token Classification task uses softmax as a classifier to identify the category of tokens.
[0096] Specifically, such as Figure 4 As shown, a multi-task learning structure is used for training. Compared with the NER module, the training data of the simple NER module no longer need to be labeled with specific categories. Data labeled as B-DNA in the NER module is labeled as B in the simple NER module.
[0097] Based on the classifier type, the above four tasks can be divided into two main categories: CRF-based tasks, including named entity recognition and simple named entity recognition tasks; and softmax-based tasks, including sentence classification and token classification tasks. Assuming the losses for each task are L1, L2, L3, and L4 respectively, the final loss is:
[0098] Loss = αL1 + βL2 + γL3 + δL4
[0099] Where α, β, γ, and δ are hyperparameters.
[0100] This invention also provides a deep learning-based biomedical text named entity recognition system, comprising a data acquisition module, a data augmentation module, a word segmentation module, a classification module, a multi-task model training module, and a training data storage module, wherein:
[0101] The training data storage module is used to collect and store biomedical text training data with genomic variant entity annotations;
[0102] The data augmentation module is used to augment biomedical text training data to obtain augmented data;
[0103] The word segmentation module is used to segment the enhanced data according to the improved word segmentation tag method to obtain the word segmentation sequence;
[0104] The multi-task model training module is used to train a multi-task model based on the word segmentation sequence, and to calculate the classification loss using a loss function until the model parameters converge.
[0105] The data acquisition module is used to acquire biomedical text data to be identified in real time;
[0106] The classification module is used to store the trained multi-task model and to use the trained multi-task model to identify and classify the biomedical text data to be identified.
[0107] In this invention, unless otherwise explicitly specified and limited, the terms "installation," "setting," "connection," "fixing," "rotation," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal connection of two components or the interaction between two components. Unless otherwise explicitly limited, those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.
[0108] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A biomedical text named entity recognition method based on deep learning, characterized in that, A multi-task model is constructed, which includes a BioBERT layer, stacked BILSTM layers, an attention layer, a NER module, a simple NER module, a sentence recognition module, and a token recognition module; The biomedical text named entity recognition method based on a multi-task model includes the following steps: S1. Obtain biomedical text training data with entity annotations for genomic variants; S2. Perform data augmentation on the biomedical text training data to obtain augmented data; S3. Perform word segmentation on the enhanced data according to the improved word segmentation tagging method to obtain a word segmentation sequence; The improved word segmentation tagging method includes: performing word segmentation on any data in the augmented data to obtain multiple tokens corresponding to the data; combining the BIO format to label the first token with the first original tag, and labeling the remaining tokens with the second original tag, finally obtaining the word segmentation sequence corresponding to the data; S4. Extract features from the segmented sequence using the BioBERT layer to obtain a word vector sequence; S5. Input the word vector sequence into a stacked BILSTM layer to further extract text position information and obtain the feature vector sequence; S6. The attention layer uses an improved scoring function to obtain the semantic features of the feature vector sequence; The improved scoring function used in the attention layer is expressed as follows: wherein, denotes a score of the association between any two vectors in the input sequence, wherein denotes the key vector, denotes the query vector, denotes the dimension of the feature vector, is a constant, denotes the length of the sequence of feature vectors, denotes the denotes the denotes the indices for which the distance vector is not zero. S7. The semantic features are fed into the NER module, simple NER module, sentence recognition module and token recognition module respectively, and the task loss is calculated using a comprehensive loss function. The multi-task model is trained by backpropagation. S8. Acquire the biomedical text data to be identified in real time, and use the trained multi-task model to obtain the recognition result.
2. The method for biomedical text named entity recognition based on deep learning according to claim 1, characterized in that, Step S2, the process of data augmentation for the biomedical text training data, includes: S21. Perform text-to-genome-variant entity separation processing on each biomedical text in the biomedical text training data with genomic variant entity annotations; S22. Sort all the obtained genomic variant entities to obtain a priority sequence of genomic variant entities; S23. Preprocess the words in each text to obtain preprocessed text. The preprocessing operations include synonym replacement, deletion and case conversion of words. S24. Recombine the preprocessed text with the genomic variant entities in the priority sequence of genomic variant entities to obtain new biomedical text training data; S25. The biomedical text training data, together with the new biomedical text training data, constitutes the augmented data.
3. The method for named entity recognition of biomedical text based on deep learning according to claim 2, characterized in that, The specific process of sorting all genomic variant entities in step S22 includes: S221. Obtain the type of each genomic variant entity isolated from the biomedical text training data, and classify all genomic variant entities according to their type; S222. Calculate the frequency of occurrence of each type and arrange the genomic variant entities according to the frequency of occurrence of each type; the lower the frequency of occurrence, the higher the priority of the genomic variant entity of the corresponding type. S223. For multiple genomic variant entities belonging to the same type, sort them according to the length of the genomic variant entity; the longer the genomic variant entity, the higher its priority.
4. A biomedical text named entity recognition system based on deep learning, characterized in that, It includes a data acquisition module, a data augmentation module, a word segmentation module, a classification module, a multi-task model training module, and a training data storage module, among which: The training data storage module is used to collect and store biomedical text training data with genomic variant entity annotations; The data augmentation module is used to augment biomedical text training data to obtain augmented data; The word segmentation module is used to segment the augmented data according to the improved word segmentation tagging method to obtain a word segmentation sequence. The improved word segmentation tagging method includes: segmenting any data in the augmented data to obtain multiple tokens corresponding to the data; combining the BIO format to label the first token with the first original tag, and labeling the remaining tokens with the second original tag, finally obtaining the word segmentation sequence corresponding to the data. The multi-task model training module is used to train a multi-task model based on the word segmentation sequence, and to calculate the classification loss using a loss function until the model parameters converge. The multi-task model includes BioBERT layers, stacked BILSTM layers, attention layers, a NER module, a simple NER module, a sentence recognition module, and a token recognition module; the multi-task model training module specifically includes: The word segmentation sequence is processed by the BioBERT layer to extract features, resulting in a word vector sequence. The word vector sequence is input into a stacked BILSTM layer to further extract text position information, resulting in a feature vector sequence; The attention layer uses an improved scoring function to obtain the semantic features of the feature vector sequence; The improved scoring function used in the attention layer is expressed as follows: in, This represents the score of the association between any two vectors in the input sequence, where Indicates the first A key vector, Indicates the first A query vector, The dimension of the feature vector. It is a constant. Indicates the length of the feature vector sequence. This represents the nth distance vector. This represents the i-th distance vector. This indicates taking the index of the distance vector when it is not zero; Semantic features are fed into the NER module, simple NER module, sentence recognition module and token recognition module respectively, and the task loss is calculated using a comprehensive loss function. The multi-task model is trained by backpropagation. The data acquisition module is used to acquire biomedical text data to be identified in real time; The classification module is used to store the trained multi-task model and to use the trained multi-task model to identify and classify the biomedical text data to be identified.
5. A biomedical text named entity recognition system based on deep learning according to claim 4, characterized in that, The data augmentation module performs data augmentation on biomedical text training data, including the following processes: S31. Perform text-to-genome-variant entity separation processing on each biomedical text in the biomedical text training data with genomic variant entity annotations; S32. Sort all the obtained genomic variant entities to obtain a priority sequence of genomic variant entities; S33. Preprocess the words in each text to obtain preprocessed text. The preprocessing operations include synonym replacement, deletion and case conversion of words. S34. Recombine the preprocessed text with the genomic variant entities in the priority sequence of genomic variant entities to obtain new biomedical text training data; S35. The biomedical text training data and the new biomedical text training data together constitute the augmented data.