Number extraction method and system based on natural language processing

By employing a natural language processing-based quantity extraction method and utilizing the attention mechanism of the encoder and decoder, combined with unsupervised and supervised training, this method addresses the problem that numbers in existing technologies lack practical meaning. It enables the extraction of quantities of specific targets and types from natural language text, improving the accuracy and efficiency of the extraction.

CN114707491BActive Publication Date: 2026-06-26LINGXI QUANTUM (BEIJING) MEDICAL TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
LINGXI QUANTUM (BEIJING) MEDICAL TECH CO LTD
Filing Date
2022-03-15
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

The numbers extracted by existing technologies do not have practical meaning, which is not conducive to subsequent analysis and evaluation, and cannot efficiently extract quantities for specific targets and types.

Method used

We employ a quantity extraction method based on natural language processing. By combining an unsupervised trained original model and a supervised trained intermediate model with the attention mechanism of the encoder and decoder, we utilize question-and-answer input to output quantity information from natural language text.

Benefits of technology

It enables the extraction of quantities of specific targets and types from natural language text, improving the accuracy and efficiency of natural language understanding and quantity extraction, and enabling more efficient completion of quantity extraction tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114707491B_ABST
    Figure CN114707491B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of natural language processing, and provides a quantity extraction method and system based on natural language processing. The method comprises the following steps: obtaining a natural language text comprising a quantity; running a quantity extraction model based on the natural language text to obtain a quantity result; the input of the quantity extraction model comprises a first prefix sentence, a first suffix sentence and the natural language text, and the output comprises a second suffix sentence. The intermediate model obtained by performing unsupervised first training on an original model has better natural language understanding capability, and the quantity extraction model obtained by performing supervised second training on the intermediate model has better quantity extraction capability, so that the problem that quantity extraction cannot be performed on specific to-be-extracted targets and types in the prior art is solved, and the quantity extraction task can be more efficiently completed.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing technology, and in particular to a method and system for extracting quantities based on natural language processing. Background Technology

[0002] Data extraction refers to the process of extracting necessary information from original documents for a specific purpose, in order to further store, convert, and analyze it.

[0003] In data extraction tasks, extracting a specific target quantity is a common requirement. Traditional quantity extraction, which extracts quantities from the numbers themselves, is relatively simple and easy to implement, but the numbers obtained by this method do not have practical meaning and are not conducive to subsequent analysis and evaluation.

[0004] Therefore, how to provide an efficient method for extracting quantity meaning has become an urgent technical problem to be solved. Summary of the Invention

[0005] This invention provides a quantity extraction method and system based on natural language processing to address the shortcomings of existing technologies where the obtained numbers do not have practical meaning, which is not conducive to subsequent analysis and evaluation, and to achieve efficient quantity extraction that can extract the meaning of quantities.

[0006] This invention provides a quantity extraction method based on natural language processing, comprising:

[0007] Retrieve natural language text including quantity;

[0008] The quantity extraction model is run based on the natural language text to obtain the quantity results;

[0009] The input of the quantity extraction model includes a first prefix statement, a first suffix statement, and the natural language text, and the output includes a second suffix statement; the first prefix statement is a character or string set based on the target to be extracted; the first suffix statement is a character or string set based on the type of the target to be extracted; the second suffix statement includes a quantity that corresponds one-to-one with the type of the target to be extracted.

[0010] The quantity extraction model is obtained by sequentially performing a first training with a first sample and a second training with a second sample and the corresponding label of the second sample; the original model is a natural language processing model; the first training is unsupervised training; and the second training is supervised training.

[0011] According to the present invention, a method for extracting quantities based on natural language processing is provided, wherein the original model is an attention model that takes a source sequence as input and a target sequence as output, and includes an encoder and a decoder; both the source sequence and the target sequence are sequences of natural language morphemes.

[0012] The encoder can take the source sequence as input and obtain semantic encoding based on preset attention allocation parameters; the decoder can obtain natural language morphemes in the target sequence according to the semantic encoding.

[0013] The attention allocation parameter is a calculated weight for natural language morphemes in the source sequence and / or the target sequence.

[0014] According to the present invention, a method for extracting quantities based on natural language processing, the first training includes:

[0015] The natural language morphemes in the first sample are replaced with masks and input into the original model to predict the training of the natural language morphemes replaced by the masks;

[0016] And / or, input at least two natural language morphemes from the first sample into the original model to train the prediction of whether the at least two natural language morphemes are adjacent morphemes.

[0017] A method for extracting quantities based on natural language processing provided by the present invention:

[0018] The encoder can take the source sequence as input and obtain at least two source sequence semantic codes based on preset attention allocation parameters; the attention allocation parameters corresponding to the at least two source sequence semantic codes are different.

[0019] The decoder is capable of:

[0020] The semantic encoding of the first natural language morpheme of the target sequence is obtained by taking the semantic encoding of the source sequence as input;

[0021] Using the semantic encoding of the source sequence and the set of semantic encodings of the first to (i-1)th natural language morphemes of the target sequence as input, the semantic encoding of the ith natural language morpheme of the target sequence is obtained; i is an integer greater than 1.

[0022] The natural language morphemes of the target sequence are obtained by encoding the morpheme semantics of the natural language morphemes of the target sequence.

[0023] According to the present invention, a method for extracting quantities based on natural language processing is provided, wherein the original model after the first training is designated as the intermediate model, and the second training includes:

[0024] Using a second sample including a first prefix statement and a first suffix statement as the source sequence, inputting it into an intermediate model to obtain a target sequence including a second suffix statement, and adjusting the parameters of the intermediate model based on the target sequence and the second label, thereby obtaining the training of the quantity extraction model;

[0025] The first suffix statement includes the target type and quantity mask to be extracted; the second suffix statement is obtained by replacing the quantity mask with the predicted quantity based on the first suffix statement; the second label includes the true quantity value.

[0026] According to the present invention, a method for extracting quantities based on natural language processing is provided, wherein the target sequence further includes a second prefix statement;

[0027] The encoder can take the source sequence and the natural language text as input and obtain at least two source sequence semantic codes based on preset attention allocation parameters; the attention allocation parameters corresponding to the at least two source sequence semantic codes are different.

[0028] The decoder is capable of:

[0029] Using the source sequence semantic encoding as input, the morpheme semantic encoding of the second prefix statement in the target sequence is obtained;

[0030] Using the source sequence semantic encoding as input, the morpheme semantic encoding of the first natural language morpheme of the second suffix sentence in the target sequence is obtained;

[0031] Using the semantic encoding of the source sequence and the set of semantic encodings of the first to (j-1)th natural language morphemes of the second suffix statement in the target sequence as input, the semantic encoding of the j-th natural language morpheme of the target sequence is obtained; j is an integer greater than 1.

[0032] The natural language morphemes of the target sequence are obtained by encoding the morpheme semantics of the natural language morphemes of the target sequence.

[0033] The present invention also provides a quantity extraction system based on natural language processing, comprising:

[0034] The acquisition module is used to acquire a quantity of natural language text.

[0035] The quantity module is used to run a quantity extraction model based on the natural language text to obtain quantity results;

[0036] The input of the quantity extraction model includes a first prefix statement, a first suffix statement, and the natural language text, and the output includes a second suffix statement; the first prefix statement is a character or string set based on the target to be extracted; the first suffix statement is a character or string set based on the type of the target to be extracted; the second suffix statement includes a quantity that corresponds one-to-one with the type of the target to be extracted.

[0037] The quantity extraction model is obtained by sequentially performing a first training with a first sample and a second training with a second sample and the corresponding label of the second sample; the original model is a natural language processing model; the first training is unsupervised training; and the second training is supervised training.

[0038] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of any of the above-described natural language processing-based quantity extraction methods.

[0039] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the quantity extraction method based on natural language processing as described above.

[0040] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of any of the above-described natural language processing-based quantity extraction methods.

[0041] The present invention provides a quantity extraction method and system based on natural language processing. Through a natural language processing model, it performs question-and-answer-style quantity extraction on natural language text. Specifically, it uses a first prefix statement to determine the target to be extracted and a first suffix statement to determine the type of the target to be extracted, inputting these into the quantity extraction model. This yields a second suffix statement containing a one-to-one correspondence between the target type and quantity, serving as the quantity extraction result. This solves the problem in existing technologies where quantity extraction cannot be performed on specific targets and types. Furthermore, the intermediate model obtained through unsupervised first training of the original model has better natural language understanding capabilities, and the quantity extraction model obtained through supervised second training of the intermediate model has even better quantity extraction capabilities. This allows for more efficient understanding and encoding of natural language text and more efficient completion of the quantity extraction task. Attached Figure Description

[0042] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0043] Figure 1 This is a flowchart illustrating the quantity extraction method based on natural language processing provided by the present invention;

[0044] Figure 2 This is a schematic diagram of the quantity extraction model framework provided in an embodiment of the present invention;

[0045] Figure 3 This is a schematic diagram of the medical literature sample size extraction process provided in an embodiment of the present invention;

[0046] Figure 4 This is a schematic diagram of the structure of the quantity extraction system based on natural language processing provided by the present invention;

[0047] Figure 5 This is a schematic diagram of the structure of the electronic device provided by the present invention.

[0048] Figure label:

[0049] 401: Get Module;

[0050] 402: Quantity module;

[0051] 510: Processor;

[0052] 520: Communication interface;

[0053] 530: Memory;

[0054] 540: Communication bus. Detailed Implementation

[0055] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0056] The following is combined with Figures 1-3 This invention describes a quantity extraction method based on natural language processing.

[0057] like Figure 1As shown, this embodiment of the invention provides a quantity extraction method based on natural language processing, including:

[0058] Step 102: Obtain natural language text including the quantity;

[0059] Step 104: Run the quantity extraction model based on the natural language text to obtain the quantity results;

[0060] The input of the quantity extraction model includes a first prefix statement, a first suffix statement, and the natural language text, and the output includes a second suffix statement; the first prefix statement is a character or string set based on the target to be extracted; the first suffix statement is a character or string set based on the type of the target to be extracted; the second suffix statement includes a quantity that corresponds one-to-one with the type of the target to be extracted.

[0061] The quantity extraction model is obtained by sequentially performing a first training with a first sample and a second training with a second sample and the corresponding label of the second sample; the original model is a natural language processing model; the first training is unsupervised training; and the second training is supervised training.

[0062] In this embodiment, the first prefix statement is an identifier statement given for the target to be extracted, and the first suffix statement is an identifier statement given for the subdivision type of the target to be extracted. Both serve as input to the quantity extraction model, so that the quantity extraction model can perform targeted quantity extraction in the natural language text.

[0063] In a preferred embodiment, the first prefix statement and the second prefix statement can be understood as questions based on question-answering tasks in a natural language processing model, and the second suffix statement can be understood as answers based on question-answering tasks in a natural language processing model, i.e., model output.

[0064] For example, in a task to extract the number of houses for sale, the first prefix statement could be "the number of houses currently for sale"; correspondingly, the first suffix statement could be "the number of second-hand houses for sale: X;", "the number of second-hand houses for sale in a certain area: X;", or "the number of second-hand houses for sale: X; the number of new houses for sale: Y;", where X and Y are characters or strings with specific identifiers (such as spaces, set masks, set letters, etc.). In this example, the second suffix statement could be "the number of second-hand houses for sale: 10000;", "the number of second-hand houses for sale in a certain area: 1000", "the number of second-hand houses for sale: 10000; the number of new houses for sale: 11000;", or "X = 10000; Y = 11000".

[0065] For example, in the task of extracting sample size from medical literature, the first prefix statement can be "What is the sample size of the current study" (or, coded as "[CLS] What is the sample size of the current study [SEP]"), and the first suffix statement can be "X cases" (or, coded as "[SEP] X cases [SEP]").

[0066] It is worth noting that the output value of the quantity extraction model is related to its input value, training samples, and labels (and in some cases, also to the setting of the loss function). Therefore, although the first prefix statement and the first suffix statement can be understood as questions in a question-answering task, their specific forms are not limited to interrogative sentences in natural language, and in some cases, they are not limited to natural language forms either. For example, specific symbol codes can be used as the first prefix statement and the first suffix statement.

[0067] In this embodiment, the first training is an unsupervised training performed on the natural language processing model, i.e. the original model, to improve the model's natural language encoding ability (i.e. the model's natural language "understanding" ability); the original model after the first training is denoted as the intermediate model, and the second training is a supervised training performed on the intermediate model to improve the model's number extraction accuracy and recall rate.

[0068] In a preferred embodiment:

[0069] The first training includes:

[0070] The natural language morphemes in the first sample are replaced with masks and input into the original model to predict the training of the natural language morphemes replaced by the masks;

[0071] And / or, input at least two natural language morphemes from the first sample into the original model to train the prediction of whether the at least two natural language morphemes are adjacent morphemes.

[0072] The original model after the first training is designated as the intermediate model. The second training includes:

[0073] Using a second sample including a first prefix statement and a first suffix statement as the source sequence, inputting it into an intermediate model to obtain a target sequence including a second suffix statement, and adjusting the parameters of the intermediate model based on the target sequence and the second label, thereby obtaining the training of the quantity extraction model;

[0074] The first suffix statement includes the target type and quantity mask to be extracted; the second suffix statement is obtained by replacing the quantity mask with the predicted quantity based on the first suffix statement; the second label includes the true quantity value.

[0075] The beneficial effects of this embodiment are as follows:

[0076] This paper utilizes a natural language processing model to perform question-and-answer-style quantity extraction on natural language text. Specifically, it uses a first prefix statement to determine the target to be extracted and a first suffix statement to determine the type of the target to be extracted, inputting these statements into the quantity extraction model. This results in a second suffix statement containing a one-to-one correspondence between the target type and quantity, which serves as the quantity extraction result. This solves the problem in existing technologies where quantity extraction cannot be performed on specific targets and types. Furthermore, the intermediate model obtained through unsupervised first training of the original model has better natural language understanding capabilities, and the quantity extraction model obtained through supervised second training of the intermediate model has even better quantity extraction capabilities. This allows for more efficient understanding and encoding of natural language text and more efficient completion of quantity extraction tasks.

[0077] According to the above embodiments, in this embodiment:

[0078] The original model is an attention model that takes a source sequence as input and a target sequence as output, and includes an encoder and a decoder; both the source sequence and the target sequence are sequences of natural language morphemes.

[0079] The encoder can take the source sequence as input and obtain semantic encoding based on preset attention allocation parameters; the decoder can obtain natural language morphemes in the target sequence according to the semantic encoding.

[0080] The attention allocation parameter is a calculated weight for natural language morphemes in the source sequence and / or the target sequence.

[0081] In this embodiment, the original model is a bidirectional attention distraction model (Bidirectional LM, where LM is an abbreviation for language model), that is, the output of each element (e.g., natural language morpheme) of the target sequence comes from each natural language morpheme of the source sequence.

[0082] The beneficial effects of this embodiment are as follows:

[0083] The original model, trained using the omnidirectional attention mechanism, ultimately yields a quantity extraction model. This model is able to encode and decode each natural language morpheme in the source sequence (i.e., the first prefix statement, the first suffix statement, and the natural language text) as input, thereby gaining a more holistic understanding of the natural language text as the extraction source and the first prefix and first suffix statements as the extraction targets, resulting in more accurate quantity extraction results.

[0084] According to any of the above embodiments, in this embodiment:

[0085] The encoder can take the source sequence as input and obtain at least two source sequence semantic codes based on preset attention allocation parameters; the attention allocation parameters corresponding to the at least two source sequence semantic codes are different.

[0086] The decoder is capable of:

[0087] The semantic encoding of the first natural language morpheme of the target sequence is obtained by taking the semantic encoding of the source sequence as input;

[0088] Using the semantic encoding of the source sequence and the set of semantic encodings of the first to (i-1)th natural language morphemes of the target sequence as input, the semantic encoding of the ith natural language morpheme of the target sequence is obtained; i is an integer greater than 1.

[0089] The natural language morphemes of the target sequence are obtained by encoding the morpheme semantics of the natural language morphemes of the target sequence.

[0090] In this embodiment, the original model is a context model with a one-way attention mechanism (Left-to-Right LM, a language model from left to right), that is, the output of each element (e.g., a natural language morpheme) of the target sequence comes from all the elements preceding that element.

[0091] The beneficial effects of this embodiment are as follows:

[0092] The original model, trained using a unidirectional attention mechanism, eventually yields a quantity extraction model that can decode the preceding elements, thus obtaining quantity extraction results more efficiently.

[0093] According to any of the above embodiments, in this embodiment:

[0094] The target sequence also includes a second prefix statement;

[0095] The encoder can take the source sequence and the natural language text as input and obtain at least two source sequence semantic codes based on preset attention allocation parameters; the attention allocation parameters corresponding to the at least two source sequence semantic codes are different.

[0096] The decoder is capable of:

[0097] Using the source sequence semantic encoding as input, the morpheme semantic encoding of the second prefix statement in the target sequence is obtained;

[0098] Using the source sequence semantic encoding as input, the morpheme semantic encoding of the first natural language morpheme of the second suffix sentence in the target sequence is obtained;

[0099] Using the semantic encoding of the source sequence and the set of semantic encodings of the first to (j-1)th natural language morphemes of the second suffix statement in the target sequence as input, the semantic encoding of the j-th natural language morpheme of the target sequence is obtained; j is an integer greater than 1.

[0100] The natural language morphemes of the target sequence are obtained by encoding the morpheme semantics of the natural language morphemes of the target sequence.

[0101] like Figure 2 As shown, in this embodiment, the original model is a two-sentence model that combines omnidirectional attention and unidirectional attention mechanisms; that is, the distraction model of omnidirectional attention is performed on the first prefix statement in the source sequence to obtain the second prefix statement; the context model of unidirectional attention is performed on the first prefix statement and the first suffix statement in the source sequence to obtain the second suffix statement.

[0102] The beneficial effects of this embodiment are as follows:

[0103] By combining the original model with omnidirectional attention and unidirectional attention mechanisms for training, a quantity extraction model is finally obtained, which balances the accuracy and efficiency of the model and can obtain quantity extraction results more efficiently and accurately.

[0104] Based on any of the above embodiments, a relatively complete embodiment will be provided below, taking the sample size extraction task in medical literature as an example.

[0105] First, we will introduce the relevant background of the sample size extraction task in medical literature.

[0106] In recent years, evidence-based medicine has been widely used in assisting medical decision-making and medical research. The core idea of ​​evidence-based medicine is to rigorously, accurately, and wisely apply the best available research evidence, combined with the individual professional skills and clinical experience of clinicians, and taking into account patients' values ​​and preferences to develop treatment plans. Among these methods, systematic reviews are the way to obtain the best possible evidence. The research steps of a systematic review include key stages such as literature search and screening, data extraction, and quantitative pooling.

[0107] In the process of creating a systematic review, data extraction is the most time-consuming and labor-intensive step compared to other stages. For a single clinical study, the first step is to extract key information such as PICOS (students, intervention, control, outcome measure, and study type), sample size, and effect size. Then, the extracted data is used to quantitatively merge homogeneous studies or qualitatively describe heterogeneous studies. In clinical research statistics, sample size has a significant impact on the entire systematic review process. If the sample size included in a clinical study is inaccurate, it can bias the results and increase the uncertainty of the conclusions. Therefore, the accurate extraction of sample size, as a crucial data point, is a critical issue.

[0108] With the development of Natural Language Processing (NLP), especially the release of the BERT (Bidirectional Encoder Representation from Transformers) model, new directions have been provided for NLP development. This model structure, through its self-attention mechanism, can better understand contextual semantics and represent textual information, continuously achieving new highs in accuracy across various text tasks. This embodiment combines medical literature sample size extraction with the latest NLP technology, designing a precise and effective method for extracting sample size based on a prefix-tuning structure.

[0109] Sample size is described in various ways in clinical research literature, and some information is not directly mentioned in the text, requiring analysis of the text's meaning to obtain it. Traditional sample size extraction requires two researchers to extract data independently, back to back. This manual extraction method consumes a great deal of researchers' time and energy and cannot efficiently extract data. Deep learning-based extraction models first transform the raw data into sequence-labeled corpora and then perform modeling. This is relatively simple to implement and has relatively stable results, making it the main method in current data extraction. However, extraction models can only extract key information described in the text and cannot handle complex scenarios of sample size extraction. For example, if the text only contains sample sizes for the intervention and control groups, the extraction model can only extract the sample sizes for the intervention and control groups, but cannot obtain the total sample size. Generative models, on the other hand, can generate the total sample size based on textual information. Therefore, this embodiment proposes a method for extracting and generating sample sizes from medical literature based on a prefix-tuning structure, thereby achieving the goal of quickly, accurately, and effectively extracting sample sizes from clinical research.

[0110] like Figure 3 As shown, the specific solution of this embodiment will be described below:

[0111] This embodiment relies on abstract data from the company's self-built medical literature database. The training dataset is constructed using a combination of manual labeling and rule extraction. The final sample size is obtained by combining rule extraction with model inference.

[0112] 1. Rule Formulation

[0113] 1) Through the analysis of different types of document data, we found that data of the same type often have certain descriptive forms with the same sample size. Therefore, when constructing the dataset, we adopted different processing methods for different types of data. Here, we constructed a sample size dataset for the abstract data of two types of medical documents: RCT and Meta.

[0114] 2) By observing and statistically analyzing the data, certain characteristic words related to the sample size in the abstract are obtained, and a rule dictionary is constructed for the logical formulation of the rule module;

[0115] 2. Construction of training dataset

[0116] 1) Rule extraction: The rule dictionary mentioned above is used to mark the rule codes of the sample size in the summary, and 10,000 data points of RCT and Meta type are selected and included in the training set;

[0117] 2) Manual labeling: Randomly extract 5,000 RCT and 5,000 Meta-type literature abstracts from the literature database. The data must not conform to the above rules. Then, two people label the data back-to-back. The two people label the data separately. Then, the data with different labels are extracted for secondary verification. After the manual labeling is completed, the data is included in the training data extracted by the rules and then scattered.

[0118] 3. AI Model Algorithm Architecture Construction

[0119] We designed a task that balances understanding and generation capabilities based on the biomedical pre-trained model PubMedBERT and combined it with the UniLM concept. This was further fine-tuned to obtain the biomedical generative model LX-GenePubmedBERT, as detailed below:

[0120] 1) UniLM is a Transformer model that integrates NLU (Natural Language Understanding) and NLG (Natural Language Generation) capabilities. Its core is to endow the model with Seq2Seq (sequence to sequence) capabilities through a special Attention Mask. If the input is "What is the sample size of the current study?" and the target sentence is "One thousand cases", then UniLM will combine the two sentences into one: [CLS] What is the sample size of the current study? [SEP] One thousand cases? [SEP] In other words, the tokens "[CLS] What is the sample size of the current study? [SEP]" are bidirectional Attention, while "One thousand cases? [SEP]" is unidirectional, so it can predict "One thousand cases? [SEP]", and thus the model has the ability to generate data.

[0121] 2) Due to UniLM's unique Attention Mask, the tokens "[CLS] What is the current sample size [SEP]" only interact with each other, and are completely unrelated to "One Thousand Cases [SEP]". This means that even though "One Thousand Cases [SEP]" is concatenated later, it does not affect the first few encoded vectors. To put it more clearly, the first few encoded vectors are equivalent to the encoding result when only "[CLS] What is the current sample size [SEP]" is used. If the vector of [CLS] represents a sentence vector, then it is the sentence vector of "What is the current sample size", not the sentence vector after adding "One Thousand Cases". Because of this characteristic, UniLM also randomly adds some [MASK] during input, so the input part can perform MLM (Mask Language Model) tasks, and the output part can perform Seq2Seq tasks. MLM enhances NLU capabilities, while Seq2Seq enhances NLG capabilities.

[0122] 3) Based on the above UniLM principle, we used our existing medical literature abstract data and constructed sample data to conduct supervised training, thereby obtaining our medical domain generative model LX-GenePubmedBERT;

[0123] 4) Input Template Construction: We referenced the Prefx-tuning structure and combined it with our medical domain generative model LX-GenePubmedBERT for sample generation and extraction. The general process is as follows: The model input is a single text segment S1S2, where S1 and S2 represent the source and target sequences, respectively, constructing the input [SOS]S1[EOS]S2[EOS]. [SOS] is the start position of the text, and [EOS] is used to segment the text and also serves as a marker for the end of text generation. The fine-turning process of LX-GenePubmedBERT involves randomly masking a certain proportion of words in S2, allowing the model to learn the masked words. The final [EOS] can also be masked, allowing the model to predict it. When the prediction is [EOS], the model automatically ends the generation task, and text generation is complete. This embodiment makes special processing of S2, adding a specific input template K (total sample size K1, intervention group K2, control group K3) to S2, [SOS]S1[EOS]K1X K2Y K3Z[EOS]. During the fine-turning process, when X is masked, X can be predicted through S1, K1 and the encoding of the mask itself, and X represents the total sample size. When Y is masked, Y can be predicted through S1, K1, X, K2 and the encoding of the mask itself, and Y represents the sample size of the intervention group. When Z is masked, Z can be predicted through S1, K1, X, K2, Y, K3 and the encoding of the mask itself, and Z represents the sample size of the control group.

[0124] 4. Result Evaluation

[0125] This embodiment uses F-Measure and Accuracy to evaluate the model's performance.

[0126] The F-Measure formula is as follows, where R represents recall and P represents precision:

[0127]

[0128] The accuracy formula is as follows:

[0129]

[0130] A higher F-Measure indicates better model performance. Accuracy represents the accuracy of the extracted sample size; a higher accuracy indicates more accurate extraction of the sample size.

[0131] The beneficial effects of this embodiment are as follows:

[0132] 1) By combining rules and models, key sentences describing sample size were accurately extracted from Chinese and English medical literature abstracts, with rule accuracy reaching 100% and model accuracy reaching 95.6%.

[0133] 2) The training process leverages the self-developed biomedical field and training generation model LX-GenePubmedBERT, which is more effective for medical data.

[0134] 3) A contextual learning method that uses input control to generate ideal data through prefix-tuning.

[0135] The purpose of this embodiment is:

[0136] AI empowers evidence-based medicine by applying cutting-edge NLP technology to sample size extraction in medical literature. This enables rapid, accurate, and effective extraction of sample sizes from clinical studies, improving the research and innovation efficiency of medical researchers and supporting high-quality systematic review studies.

[0137] The quantity extraction device based on natural language processing provided by the present invention is described below. The quantity extraction device based on natural language processing described below and the quantity extraction method based on natural language processing described above can be referred to and correspond to each other.

[0138] like Figure 4 As shown, embodiments of the present invention also provide a quantity extraction system based on natural language processing, comprising:

[0139] Module 401 is used to acquire natural language text including a quantity;

[0140] The quantity module 402 is used to run a quantity extraction model based on the natural language text to obtain quantity results;

[0141] The input of the quantity extraction model includes a first prefix statement, a first suffix statement, and the natural language text, and the output includes a second suffix statement; the first prefix statement is a character or string set based on the target to be extracted; the first suffix statement is a character or string set based on the type of the target to be extracted; the second suffix statement includes a quantity that corresponds one-to-one with the type of the target to be extracted.

[0142] The quantity extraction model is obtained by sequentially performing a first training with a first sample and a second training with a second sample and the corresponding label of the second sample; the original model is a natural language processing model; the first training is unsupervised training; and the second training is supervised training.

[0143] The beneficial effects of this embodiment are as follows:

[0144] This paper utilizes a natural language processing model to perform question-and-answer-style quantity extraction on natural language text. Specifically, it uses a first prefix statement to determine the target to be extracted and a first suffix statement to determine the type of the target to be extracted, inputting these statements into the quantity extraction model. This results in a second suffix statement containing a one-to-one correspondence between the target type and quantity, which serves as the quantity extraction result. This solves the problem in existing technologies where quantity extraction cannot be performed on specific targets and types. Furthermore, the intermediate model obtained through unsupervised first training of the original model has better natural language understanding capabilities, and the quantity extraction model obtained through supervised second training of the intermediate model has even better quantity extraction capabilities. This allows for more efficient understanding and encoding of natural language text and more efficient completion of quantity extraction tasks.

[0145] Figure 5 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 5 As shown, the electronic device may include: a processor 510, a communication interface 520, a memory 530, and a communication bus 540, wherein the processor 510, the communication interface 520, and the memory 530 communicate with each other through the communication bus 540. The processor 510 can call logical instructions in the memory 530 to execute a quantity extraction method based on natural language processing. This method includes: acquiring natural language text containing quantities; running a quantity extraction model based on the natural language text to obtain quantity results; the input of the quantity extraction model includes a first prefix statement, a first suffix statement, and the natural language text, and the output includes a second suffix statement; the first prefix statement is a character or string set based on the target to be extracted; the first suffix statement is a character or string set based on the type of the target to be extracted; the second suffix statement includes quantities corresponding one-to-one with the type of the target to be extracted; the quantity extraction model is obtained by sequentially performing a first training with a first sample and a second training with a second sample and labels corresponding to the second sample; the original model is a natural language processing model; the first training is unsupervised training; the second training is supervised training.

[0146] Furthermore, the logical instructions in the aforementioned memory 530 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0147] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the quantity extraction method based on natural language processing provided by the above methods. The method includes: acquiring natural language text containing quantities; running a quantity extraction model based on the natural language text to obtain quantity results; the input of the quantity extraction model includes a first prefix statement, a first suffix statement, and the natural language text, and the output includes a second suffix statement; the first prefix statement is a character or string set based on the target to be extracted; the first suffix statement is a character or string set based on the type of the target to be extracted; the second suffix statement includes quantities that correspond one-to-one with the type of the target to be extracted; the quantity extraction model is obtained by sequentially performing a first training with a first sample and a second training with a second sample and a label corresponding to the second sample; the original model is a natural language processing model; the first training is unsupervised training; and the second training is supervised training.

[0148] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the quantity extraction method based on natural language processing provided by the above methods. The method includes: acquiring natural language text containing quantities; running a quantity extraction model based on the natural language text to obtain quantity results; the input of the quantity extraction model includes a first prefix statement, a first suffix statement, and the natural language text, and the output includes a second suffix statement; the first prefix statement is a character or string set based on the target to be extracted; the first suffix statement is a character or string set based on the type of the target to be extracted; the second suffix statement includes quantities corresponding one-to-one with the type of the target to be extracted; the quantity extraction model is obtained by sequentially performing a first training with a first sample and a second training with a second sample and labels corresponding to the second sample; the original model is a natural language processing model; the first training is unsupervised training; and the second training is supervised training.

[0149] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0150] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0151] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A quantity extraction method based on natural language processing, characterized in that, include: Retrieve natural language text including quantity; The quantity extraction model is run based on the natural language text to obtain the quantity results; The input of the quantity extraction model includes a first prefix statement, a first suffix statement, and the natural language text, and the output includes a second suffix statement; the first prefix statement is a character or string set based on the target to be extracted; the first suffix statement is a character or string set based on the type of the target to be extracted; the second suffix statement includes a quantity that corresponds one-to-one with the type of the target to be extracted. The quantity extraction model is obtained by sequentially performing a first training with a first sample and a second training with a second sample and the label corresponding to the second sample; The original model is a natural language processing model; The first training is unsupervised training; The second type of training is supervised training; The first training includes: replacing natural language morphemes in the first sample with a mask and inputting it into the original model to predict the natural language morphemes replaced by the mask; and / or, inputting at least two natural language morphemes in the first sample into the original model to predict whether the at least two natural language morphemes are adjacent morphemes. The original model after the first training is denoted as the intermediate model. The second training includes: using a second sample including a first prefix statement and a first suffix statement as the source sequence, inputting it into the intermediate model to obtain a target sequence including a second suffix statement, and adjusting the parameters of the intermediate model based on the target sequence and the second label, thereby obtaining the training of the quantity extraction model; the first suffix statement includes the type of target to be extracted and a quantity mask; the second suffix statement is obtained by replacing the quantity mask with the predicted quantity based on the first suffix statement; the second label includes the ground truth quantity.

2. The quantity extraction method based on natural language processing according to claim 1, characterized in that, The original model is an attention model that takes a source sequence as input and a target sequence as output, and includes an encoder and a decoder; both the source sequence and the target sequence are sequences of natural language morphemes. The encoder can take the source sequence as input and obtain semantic encoding based on preset attention allocation parameters; The decoder is able to obtain natural language morphemes in the target sequence based on the semantic encoding; The attention allocation parameter is a calculated weight for natural language morphemes in the source sequence and / or the target sequence.

3. The quantity extraction method based on natural language processing according to claim 2, characterized in that: The encoder can take the source sequence as input and obtain at least two source sequence semantic codes based on preset attention allocation parameters; the attention allocation parameters corresponding to the at least two source sequence semantic codes are different. The decoder is capable of: The semantic encoding of the first natural language morpheme of the target sequence is obtained by taking the semantic encoding of the source sequence as input; Using the semantic encoding of the source sequence and the set of semantic encodings of the first to (i-1)th natural language morphemes of the target sequence as input, the semantic encoding of the ith natural language morpheme of the target sequence is obtained; i is an integer greater than 1. The natural language morphemes of the target sequence are obtained by encoding the morpheme semantics of the natural language morphemes of the target sequence.

4. The quantity extraction method based on natural language processing according to claim 2, characterized in that, The target sequence also includes a second prefix statement; The encoder is able to take the source sequence and the natural language text as input and obtain semantic codes for at least two source sequences based on preset attention allocation parameters; The attention allocation parameters corresponding to the semantic encodings of the at least two source sequences are different; The decoder is capable of: Using the source sequence semantic encoding as input, the morpheme semantic encoding of the second prefix statement in the target sequence is obtained; Using the source sequence semantic encoding as input, the morpheme semantic encoding of the first natural language morpheme of the second suffix sentence in the target sequence is obtained; Using the semantic encoding of the source sequence and the set of semantic encodings of the first to (j-1)th natural language morphemes of the second suffix statement in the target sequence as input, the semantic encoding of the j-th natural language morpheme of the target sequence is obtained; j is an integer greater than 1. The natural language morphemes of the target sequence are obtained by encoding the morpheme semantics of the natural language morphemes of the target sequence.

5. A quantity extraction system based on natural language processing, characterized in that, include: The acquisition module is used to acquire a quantity of natural language text. The quantity module is used to run a quantity extraction model based on the natural language text to obtain quantity results; The input of the quantity extraction model includes a first prefix statement, a first suffix statement, and the natural language text, and the output includes a second suffix statement; the first prefix statement is a character or string set based on the target to be extracted; the first suffix statement is a character or string set based on the type of the target to be extracted; the second suffix statement includes a quantity that corresponds one-to-one with the type of the target to be extracted. The quantity extraction model is obtained by sequentially performing a first training with a first sample and a second training with a second sample and the label corresponding to the second sample; The original model is a natural language processing model; The first training is unsupervised training; The second type of training is supervised training; The first training includes: replacing natural language morphemes in the first sample with a mask and inputting it into the original model to predict the natural language morphemes replaced by the mask; and / or, inputting at least two natural language morphemes in the first sample into the original model to predict whether the at least two natural language morphemes are adjacent morphemes. The original model after the first training is denoted as the intermediate model. The second training includes: using a second sample including a first prefix statement and a first suffix statement as the source sequence, inputting it into the intermediate model to obtain a target sequence including a second suffix statement, and adjusting the parameters of the intermediate model based on the target sequence and the second label, thereby obtaining the training of the quantity extraction model; the first suffix statement includes the type of target to be extracted and a quantity mask; the second suffix statement is obtained by replacing the quantity mask with the predicted quantity based on the first suffix statement; the second label includes the ground truth quantity.

6. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the quantity extraction method based on natural language processing as described in any one of claims 1 to 4.

7. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the quantity extraction method based on natural language processing as described in any one of claims 1 to 4.

8. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the quantity extraction method based on natural language processing as described in any one of claims 1 to 4.