A text information rapid extraction method, device and equipment and storage medium
By using a pre-trained encoder network and pointer network to generate start and end lists, the problem of slow text information extraction speed in existing technologies is solved, and fast text information extraction is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- KINGDEE SOFTWARE(CHINA) CO LTD
- Filing Date
- 2022-11-02
- Publication Date
- 2026-06-16
AI Technical Summary
Existing text information extraction methods rely on CRF to model the conditional transition probabilities between information extraction labels, resulting in slow model training and inference speeds, making it difficult to quickly extract information from text.
A pre-trained encoder network is used as the encoder, and a first pointer network and a second pointer network are used as the decoding layer. Start and end lists are generated through semantic encoding and decoding, thereby quickly extracting target information from the text.
It improves the training and inference speed of the text information extraction model, enabling rapid extraction of information from text.
Smart Images

Figure CN115905481B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of natural language processing technology, and in particular to a method, apparatus, device, and storage medium for rapid extraction of text information. Background Technology
[0002] Text often contains valuable information, and extracting this important information is a popular research area in Natural Language Processing (NLP). Early Named Entity Recognition (NER) systems typically relied on manually defined rules and domain-specific dictionaries. This rule-based and dictionary-based approach has significant limitations, is difficult to extend to other domains, and incurs high maintenance costs. With the development of deep learning, more and more researchers are attempting to use the learning capabilities of deep models to solve the NER problem.
[0003] Existing text information extraction methods primarily employ BLSTM (Bidirectional Long Short-Term Memory) and CRF (Conditional Random Field) models. These models first use BLSTM to acquire contextual information from the text to be extracted, and then use CRF to model the conditional transition probabilities between information extraction labels. With the advent of large-scale pre-trained language models, BLSTM has gradually been replaced by pre-trained language models, such as BERT (Bidirectional Encoder Representation from Transformers, a pre-trained language model proposed by Google). BERT is used as the encoder to acquire semantic information from the text to be extracted, and finally, CRF is used to extract information from the text. However, existing text information extraction methods all require the use of CRF to model the conditional transition probabilities between information extraction labels, which significantly impacts the training and inference speed of the model, making it difficult to quickly extract information from text. Summary of the Invention
[0004] To overcome the shortcomings of existing technologies, this invention provides a method, apparatus, device, and storage medium for rapid text information extraction, which can improve the training and inference speed of text information extraction models and achieve rapid extraction of information from text.
[0005] To address the aforementioned technical problems, in a first aspect, an embodiment of the present invention provides a method for rapid text information extraction, comprising:
[0006] Model for extracting target text information;
[0007] The semantic encoding features in the text data to be extracted are obtained by using the encoder network of the target text information extraction model to perform semantic encoding processing on the text data to be extracted.
[0008] The semantic encoding features are subjected to first semantic decoding processing by the first pointer network of the target text information extraction model to generate a starting list containing semantic categories of all information in the text data to be extracted;
[0009] The semantic encoding features are subjected to second semantic decoding processing through the second pointer network of the target text information extraction model to generate an end list containing semantic categories of all information in the text data to be extracted;
[0010] By combining the start list and the end list, target information is extracted from the text data to be extracted.
[0011] Furthermore, the target text information extraction model specifically includes:
[0012] All collected corpus data are preprocessed to obtain a sample dataset, which is then divided into a training dataset, a validation dataset, and a test dataset.
[0013] A pre-trained encoder network is used as the encoder, and a first pointer network and a second pointer network are used as the decoding layer to establish a first text information extraction model.
[0014] The first text information extraction model is trained based on the training dataset, and the parameters of the encoder network are fine-tuned to obtain the second text information extraction model.
[0015] When the cumulative number of training iterations in the current training round reaches a preset training iteration threshold, the second text information extraction model in the current training round is verified according to the verification dataset to obtain the evaluation index value of each second text information extraction model.
[0016] When the cumulative number of training rounds reaches a preset training round threshold, the second text information extraction model with the largest evaluation index value is selected as the third text information extraction model, and the third text information extraction model is tested according to the test dataset to obtain the evaluation index value of the third text information extraction model.
[0017] When the evaluation index value of the third text information extraction model reaches the preset evaluation index threshold, the third text information extraction model is used as the target text information extraction model.
[0018] Furthermore, the preprocessing of all collected corpus data to obtain a sample dataset is specifically as follows:
[0019] According to the predefined sentence segmentation strategy, each of the corpus data is segmented into sentences to obtain several word segmentation data.
[0020] All the segmented data are deduplicated to obtain several sample data, and all the sample data are used as the sample dataset.
[0021] Furthermore, the step of dividing the sample dataset into a training dataset, a validation dataset, and a test dataset specifically involves:
[0022] The BIO annotation system is used to annotate each sample data in the sample dataset to obtain the label of each sample data;
[0023] According to a preset data allocation ratio, the sample dataset is divided into the training dataset, the validation dataset, and the test dataset.
[0024] Furthermore, the evaluation metric value of the second text information extraction model or the third text information extraction model is:
[0025]
[0026] in, TP represents the number of target information correctly predicted by the second or third text information extraction model, FP represents the number of target information incorrectly predicted by the second or third text information extraction model, and FN represents the number of target information missed by the second or third text information extraction model.
[0027] Furthermore, the step of extracting target information from the text data to be extracted by combining the start list and the end list specifically involves:
[0028] For the start list, iterate through the semantic category of each piece of information in the text data to be extracted. When the semantic category of the current information belongs to the entity category, search for the matching information of the current information in the end list. Take the information between the position of the current information and the position of the matching information as the target information and extract the target information from the text data to be extracted.
[0029] The matching information is information located after the current information and whose semantic category is the same as that of the current information.
[0030] Furthermore, the semantic categories include non-entity categories and several entity categories.
[0031] Secondly, an embodiment of the present invention provides a text information rapid extraction device, comprising:
[0032] The target model acquisition module is used to acquire the target text information extraction model;
[0033] The semantic encoding processing module is used to perform semantic encoding processing on the text data to be extracted through the encoder network of the target text information extraction model to obtain the semantic encoding features in the text data to be extracted.
[0034] The first semantic decoding processing module is used to perform first semantic decoding processing on the semantic encoding features through the first pointer network of the target text information extraction model, and generate a starting list containing semantic categories of all information in the text data to be extracted;
[0035] The second semantic decoding processing module is used to perform second semantic decoding processing on the semantic encoding features through the second pointer network of the target text information extraction model, and generate an end list containing semantic categories of all information in the text data to be extracted;
[0036] The target information extraction module is used to extract target information from the text data to be extracted by combining the start list and the end list.
[0037] Thirdly, an embodiment of the present invention provides a text information fast extraction device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the memory being coupled to the processor, and the processor implementing the text information fast extraction method as described above when executing the computer program.
[0038] Fourthly, one embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium including a stored computer program, wherein, when the computer program is executed, it controls the device where the computer-readable storage medium is located to perform the text information fast extraction method described above.
[0039] Fifthly, one embodiment of the present invention provides a computer program product that, when run on a computer, causes the computer to execute the text information extraction method described above.
[0040] The embodiments of the present invention have the following beneficial effects:
[0041] By acquiring a target text information extraction model; using the encoder network of the target text information extraction model to perform semantic encoding processing on the text data to be extracted, semantic encoding features in the text data to be extracted are obtained; using the first pointer network of the target text information extraction model to perform first semantic decoding processing on the semantic encoding features, a start list containing semantic categories of all information in the text data to be extracted is generated; using the second pointer network of the target text information extraction model to perform second semantic decoding processing on the semantic encoding features, an end list containing semantic categories of all information in the text data to be extracted is generated; combining the start list and the end list, target information is extracted from the text data to be extracted, thus completing text information extraction. Compared with the prior art, the embodiments of the present invention use a pre-trained encoder network as the encoder and a first pointer network and a second pointer network as the decoding layer to acquire a target text information extraction model. Based on the target text information extraction model, text information extraction is performed, which can improve the training and inference speed of the text information extraction model and achieve rapid extraction of information from text. Attached Figure Description
[0042] Figure 1 This is a flowchart illustrating a method for rapidly extracting text information according to the first embodiment of the present invention.
[0043] Figure 2 This is a schematic diagram of the target text information extraction model exemplified in the first embodiment of the present invention;
[0044] Figure 3 This is a flowchart illustrating the target text information extraction model in the first embodiment of the present invention.
[0045] Figure 4 This is a schematic diagram of a text information rapid extraction device according to the second embodiment of the present invention. Detailed Implementation
[0046] The technical solutions of this invention will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this invention, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0047] It should be noted that the step numbers in this document are only for the convenience of explaining the specific embodiments and are not intended to limit the order in which the steps are executed. The method provided in this embodiment can be executed by relevant terminal devices, and the following description uses a processor as the execution subject.
[0048] like Figure 1As shown, the first embodiment provides a method for fast text information extraction, including S1 to S5:
[0049] S1. Obtain the target text information extraction model;
[0050] S2. The encoder network of the target text information extraction model performs semantic encoding processing on the text data to be extracted to obtain the semantic encoding features in the text data to be extracted.
[0051] S3. Perform first semantic decoding processing on the semantic encoded features through the first pointer network of the target text information extraction model to generate a starting list containing semantic categories of all information in the text data to be extracted.
[0052] S4. The semantic encoding features are processed by the second pointer network of the target text information extraction model to generate an end list containing semantic categories of all information in the text data to be extracted.
[0053] S5. Combine the start list and the end list to extract the target information from the text data to be extracted.
[0054] As an example, in step S1, a target text information extraction model is obtained, wherein the target text information extraction model uses a pre-trained encoder network as the encoder and a first pointer network and a second pointer network as the decoding layer. A schematic diagram of the target text information extraction model is shown below. Figure 2 As shown.
[0055] In step S2, the text data to be extracted is input into the target text information extraction model. The encoder network of the target text information extraction model performs semantic encoding processing on the text data to be extracted to obtain the semantic encoding features in the text data to be extracted.
[0056] In step S3, the semantic encoding features output by the encoder network of the target text information extraction model are input into the first pointer network of the target text information extraction model. The first pointer network of the target text information extraction model performs the first semantic decoding process on the semantic encoding features to generate a starting list containing semantic categories of all information in the text data to be extracted.
[0057] In step S4, the semantic encoding features output by the encoder network of the target text information extraction model are input into the second pointer network of the target text information extraction model. The second pointer network of the target text information extraction model performs second semantic decoding processing on the semantic encoding features to generate an end list containing semantic categories of all information in the text data to be extracted.
[0058] In step S5, the start and end positions of the target information are determined by combining the start list and the end list. Based on the start and end positions of the target information, the target information is extracted from the text data to be extracted, thus completing the text information extraction.
[0059] This embodiment uses a pre-trained encoder network as the encoder and a first pointer network and a second pointer network as the decoding layer to obtain the target text information extraction model. Based on the target text information extraction model, text information extraction is performed, which can improve the training and inference speed of the text information extraction model and realize the rapid extraction of information from the text.
[0060] In a preferred embodiment, after obtaining the target text information extraction model, the method further includes: deploying the target text information extraction model using a preset inference deployment tool.
[0061] As an example, a target text information extraction model is deployed using a preset inference deployment tool to obtain a deployed target text information extraction model, which is then used to extract target information from the text data to be extracted.
[0062] Among them, the inference deployment tools mainly include the mobile platform deployment tool NCNN, the deployment tool OpenVino, the deployment tool TensorRT, the deployment tool MediaPipe, and the open neural network exchange format ONNX (Open Neural Network Exchange).
[0063] For example, the default inference deployment tool uses the Open Neural Network Exchange (ONNX) format. After obtaining the target text information extraction model, the torch.onnx.export function can be directly called to convert the target text information extraction model into an ONNX model. Based on the ONNX model, the target information can be extracted from the text data to be extracted.
[0064] As is understandable, ONNX is a model IR (Inference and Representation) format used for transformation across various deep learning training and inference frameworks. ONNX defines a set of environment- and platform-independent standard formats to enhance the interoperability of various AI models, exhibiting strong openness.
[0065] This embodiment uses a preset inference deployment tool to deploy the target text information extraction model, which can further improve the inference speed of the target text information extraction model and better achieve rapid extraction of text information.
[0066] like Figure 3As shown, in a preferred embodiment, the acquisition of the target text information extraction model specifically includes steps S11 to S16:
[0067] S11. Preprocess all collected corpus data to obtain a sample dataset, and divide the sample dataset into a training dataset, a validation dataset, and a test dataset.
[0068] S12. A first text information extraction model is established by using a pre-trained encoder network as the encoder and a first pointer network and a second pointer network as the decoding layer.
[0069] S13. Train the first text information extraction model based on the training dataset, and fine-tune the parameters of the encoder network to obtain the second text information extraction model.
[0070] S14. When the cumulative number of training iterations in the current training round reaches the preset training iteration threshold, verify each second text information extraction model in the current training round based on the verification dataset, and obtain the evaluation index value of each second text information extraction model.
[0071] S15. When the cumulative number of training rounds reaches the preset training round threshold, select the second text information extraction model with the largest evaluation index value as the third text information extraction model, and test the third text information extraction model according to the test dataset to obtain the evaluation index value of the third text information extraction model.
[0072] S16. When the evaluation index value of the third text information extraction model reaches the preset evaluation index threshold, the third text information extraction model shall be used as the target text information extraction model.
[0073] As an example, in step S11, according to the actual information extraction task, corpus data is collected, such as legal documents and financial contracts from multiple fields of the Internet. All collected corpus data is preprocessed to obtain a sample dataset, and the sample dataset is divided into a training dataset, a validation dataset, and a test dataset.
[0074] In step S12, a pre-trained encoder network, such as the BERT model, is used as the encoder, and two pointer networks, the first pointer network and the second pointer network, are used as the decoding layer. The first text information extraction model is established by combining the BERT model, the first pointer network and the second pointer network.
[0075] It is understandable that the BERT model is used to obtain semantic vectors containing rich semantic information in the input data. In this embodiment, a pre-trained Chinese BERT-WWM model can be selected as the encoder to provide features for subsequent information extraction tasks.
[0076] A first pointer network and a second pointer network are used to predict the start and end positions of all information in the input data, respectively. The pointer network refers to a linear layer. Both the first and second pointer networks perform binary classification based on the semantic vectors output by the BERT model. If a position in the input data is the start position of information, the first pointer network predicts 1; otherwise, it predicts 0. Similarly, if a position in the input data is the end position of information, the second pointer network predicts 1; otherwise, it predicts 0. To predict the category of information while extracting it from the input data, the first and second pointer networks need to be improved. The first pointer network is used to predict the start position and category of all information in the input data, and the second pointer network is used to predict the end position and category of all information in the input data. This essentially transforms the binary classification into a multi-class classification. Assuming there are n categories of information, the number of categories is n+1. When the first pointer network predicts 0, it means that the position is not the starting position of the information to be extracted, i.e., it is not an entity. When the first pointer network predicts any value from 1 to n, it means that the position is the starting position of a certain category of information to be extracted, i.e., it is an entity of a certain category. Similarly, when the second pointer network predicts 0, it means that the position is not the ending position of the information to be extracted, i.e., it is not an entity. When the second pointer network predicts any value from 1 to n, it means that the position is the ending position of a certain category of information to be extracted, i.e., it is an entity of a certain category.
[0077] A first text information extraction model is established by combining the BERT model, the first pointer network, and the second pointer network.
[0078] In step S13, training round conditions are defined according to a preset training round number threshold and a preset training number threshold corresponding to each training round. Based on the predefined training round conditions, the first text information extraction model is trained several times according to the corresponding preset training number threshold in each training round, and the parameters of the BERT model are fine-tuned to obtain several second text information extraction models.
[0079] It is understandable that after training the first text information extraction model based on the training dataset each time and fine-tuning the parameters of the BERT model, a second text information extraction model will be obtained.
[0080] In step S14, after each training session, it is determined whether the cumulative number of training iterations in the current training round has reached the preset training iteration threshold. If the cumulative number of training iterations in the current training round has not reached the preset training iteration threshold, the next training session continues. If the cumulative number of training iterations in the current training round has reached the preset training iteration threshold, the second text information extraction model in the current training round is verified based on the verification dataset. That is, the verification dataset is input into each second text information extraction model in the current training round to obtain the prediction results of each second text information extraction model. Then, the prediction results of each second text information extraction model are evaluated to obtain the evaluation index value of each second text information extraction model.
[0081] In step S15, after each training round, it is determined whether the current cumulative training rounds have reached the preset training round threshold. If the current cumulative training rounds have not reached the preset training round threshold, the next training round continues. If the current cumulative training rounds have reached the preset training round threshold, the second text information extraction model with the largest evaluation index value is selected as the third text information extraction model. The third text information extraction model is then tested based on the test dataset, i.e., the test dataset is input into the third text information extraction model to obtain the prediction result of the third text information extraction model. The prediction result of the third text information extraction model is then evaluated to obtain the evaluation index value of the third text information extraction model.
[0082] In step S16, it is determined whether the evaluation index value of the third text information extraction model reaches the preset evaluation index threshold. If the evaluation index value of the third text information extraction model does not reach the preset evaluation index threshold, the model is retrained. If the evaluation index value of the third text information extraction model reaches the preset evaluation index threshold, the third text information extraction model is considered to meet the requirements. The third text information extraction model is used as the target text information extraction model. The text data to be extracted is input into the target text information extraction model. The target information is extracted from the text data to be extracted by combining the start list and end list output by the target text information extraction model.
[0083] This embodiment establishes a first text information extraction model by using a pre-trained encoder network as the encoder and a first pointer network and a second pointer network as the decoding layer. The first text information extraction model is trained, validated, tested, and evaluated based on the training dataset, validation dataset, and test dataset. Text information is extracted based on the final target text information extraction model, which can improve the training and inference speed of the text information extraction model and achieve rapid extraction of information from text.
[0084] In a preferred embodiment, the preprocessing of all collected corpus data to obtain a sample dataset specifically involves: segmenting each corpus data into a sentence according to a predefined sentence segmentation strategy to obtain several word segmentation data; deduplicating all word segmentation data to obtain several sample data; and using all sample data as the sample dataset.
[0085] As an example, according to a predefined sentence segmentation strategy, such as segmenting sentences according to punctuation marks such as periods and commas, each corpus data is split into several word segments, and all word segments are deduplicated to obtain several sample data. All sample data are then integrated as a sample dataset.
[0086] This embodiment preprocesses all corpus data by employing sentence segmentation and deduplication, which shortens the length and reduces the amount of sample data. This improves the processing efficiency of the subsequent text information extraction model and further enhances the inference speed of the target text information extraction model, enabling faster extraction of text information.
[0087] In a preferred embodiment, dividing the sample dataset into a training dataset, a validation dataset, and a test dataset specifically involves: using the BIO annotation system to annotate each sample data in the sample dataset to obtain the label of each sample data; and dividing the sample dataset into a training dataset, a validation dataset, and a test dataset according to a preset data allocation ratio.
[0088] In a preferred embodiment of this example, the ratio of the data volume of the training dataset, the validation dataset, and the test dataset is 8:1:1.
[0089] As an example, the BIO annotation system is used to annotate each sample data in the sample dataset to obtain the labels for each sample data. For example, if the text is "Party A: Zigong Fire and Rescue Brigade, Party B: Sichuan Deman Automobile Sales and Service Co., Ltd.", the label is: "OOO B-Party AI-Party AI-Party AI-Party AI-Party AI-Party AI-Party AI-Party AI-Party AOOOO B-Party BI ...
[0090] Based on a preset data allocation ratio, such as a training data volume: validation data volume: test data volume = 8:1:1, the entire sample dataset is divided into a training dataset, a validation dataset, and a test dataset.
[0091] In a preferred embodiment, the evaluation metric value of the second or third text information extraction model is:
[0092]
[0093] in, TP represents the number of target information correctly predicted by the second or third text information extraction model, FP represents the number of target information incorrectly predicted by the second or third text information extraction model, and FN represents the number of target information missed by the second or third text information extraction model.
[0094] As an example, the prediction results of the second or third text information extraction model are evaluated according to a predefined evaluation metric F1, resulting in the evaluation metric value of the second or third text information extraction model, i.e.:
[0095]
[0096] In equation (1), TP represents the number of target information correctly predicted by the second or third text information extraction model, FP represents the number of target information incorrectly predicted by the second or third text information extraction model, and FN represents the number of target information missed by the second or third text information extraction model.
[0097] This embodiment evaluates the prediction results of the second and third text information extraction models by pre-defining the evaluation index F1. This allows for a rapid assessment of the extraction performance of the second and third text information extraction models. Selecting the text information extraction model with the best extraction performance as the target text information extraction model is beneficial for further improving the inference speed of the target text information extraction model and achieving faster text information extraction.
[0098] In a preferred embodiment, the step of extracting target information from the text data to be extracted by combining the start list and the end list specifically involves: for the start list, traversing the semantic category of each piece of information in the text data to be extracted; when the semantic category of the current information belongs to the entity category, searching for matching information of the current information in the end list; and taking the information between the position of the current information and the position of the matching information as target information, and extracting the target information from the text data to be extracted; wherein, the matching information is the information whose position is after the position of the current information and whose semantic category is the same as the semantic category of the current information.
[0099] As an example, after obtaining the target text information extraction model, the text data to be extracted is input into the target text information extraction model. The encoder network of the target text information extraction model, such as the BERT model, performs semantic encoding processing on the text data to be extracted, and obtains the semantic vector containing rich semantic information in the text data to be extracted as the semantic encoding feature. The first pointer network and the second pointer network of the target text information extraction model perform corresponding semantic decoding processing on the semantic encoding feature. Based on the semantic vector output by the BERT model, binary classification is performed to obtain a start list and an end list. The start list stores the semantic categories of all information in the text data to be extracted, and the semantic category of each information in the text data to be extracted takes any value in (0, 1). Similarly, the end list stores the semantic categories of all information in the text data to be extracted, and the semantic category of each information in the text data to be extracted takes any value in (0, 1), where 0 is the assignment of non-entity category and 1 is the assignment of entity category.
[0100] Iterate through the semantic category of each piece of information in the text data to be extracted, and determine whether the semantic category of the current piece of information belongs to the entity category. If the semantic category of the current piece of information does not belong to the entity category, i.e., the value is 0, then continue to iterate through the semantic category of the next piece of information. If the semantic category of the current piece of information belongs to the entity category, i.e., the value is 1, then according to the nearest matching principle, search for the matching information of the current piece of information from the end list. That is, the information whose position is after the current piece of information and whose semantic category belongs to the same entity category as the current piece of information. The information between the position of the current piece of information and the position of the matching information is taken as the target information and extracted from the text data to be extracted.
[0101] This embodiment inputs the text data to be extracted into the target text information extraction model. The encoder network obtains semantic vectors containing rich semantic information from the text data to be extracted. The first pointer network and the second pointer network perform binary classification based on the semantic vectors output by the encoder network. This can improve the inference speed of the target text information extraction model and achieve rapid extraction of text information.
[0102] In a preferred embodiment, the semantic categories include non-entity categories and several entity categories.
[0103] As an example, after obtaining the target text information extraction model, the text data to be extracted is input into the target text information extraction model. The encoder network of the target text information extraction model, such as the BERT model, performs semantic encoding processing on the text data to be extracted, and obtains semantic vectors containing rich semantic information in the text data to be extracted as semantic encoding features. The first pointer network and the second pointer network of the target text information extraction model perform corresponding semantic decoding processing on the semantic encoding features. Based on the semantic vectors output by the BERT model, multi-classification is performed to obtain a start list and an end list. The start list stores the semantic categories of all information in the text data to be extracted, and the semantic category of each piece of information in the text data to be extracted takes any value in (0, 1, ..., n+1). Similarly, the end list stores the semantic categories of all information in the text data to be extracted, and the semantic category of each piece of information in the text data to be extracted takes any value in (0, 1, ..., n+1), where 0 is the assignment of non-entity categories, and 1, ..., n+1 are the assignments of n entity categories.
[0104] Iterate through the semantic category of each piece of information in the start list, and determine whether the semantic category of the current piece of information belongs to an entity category. If the semantic category of the current piece of information does not belong to an entity category, i.e., the value is 0, then continue to iterate through the semantic category of the next piece of information. If the semantic category of the current piece of information belongs to an entity category, i.e., the value is any value in (0, 1, ..., n+1), then according to the nearest matching principle, search for the matching information of the current piece of information in the end list. That is, information whose position is after the current piece of information and whose semantic category belongs to the same entity category as the semantic category of the current piece of information. Then, take the information between the position of the current piece of information and the position of the matching information as the target information to extract the target information from the text data to be extracted.
[0105] Because target text information extraction models sometimes produce unreasonable predictions, a constraint needs to be added to the nearest matching principle: the target text information extraction model is only allowed to predict 0 at the midpoint of a matching pair with a start and end position, meaning it is not the start or end position of the entity. If this condition is not met, the matching pair is considered an incorrect prediction and discarded. After post-processing, the extraction performance of the target text information extraction model will be improved.
[0106] This embodiment inputs the text data to be extracted into the target text information extraction model. The encoder network obtains semantic vectors containing rich semantic information from the text data to be extracted. The first pointer network and the second pointer network perform multi-classification based on the semantic vectors output by the encoder network. This can improve the inference speed of the target text information extraction model and achieve rapid extraction of text information. Furthermore, by post-processing the prediction results of the target text information extraction model, the inference speed of the target text information extraction model can be further improved, thus better achieving rapid extraction of text information.
[0107] Based on the same inventive concept as the first embodiment, the second embodiment provides as follows: Figure 4 A text information rapid extraction device is shown, comprising: a target model acquisition module 21, used to acquire a target text information extraction model; a semantic encoding processing module 22, used to perform semantic encoding processing on the text data to be extracted through the encoder network of the target text information extraction model to obtain semantic encoding features in the text data to be extracted; a first semantic decoding processing module 23, used to perform first semantic decoding processing on the semantic encoding features through the first pointer network of the target text information extraction model to generate a start list containing semantic categories of all information in the text data to be extracted; a second semantic decoding processing module 24, used to perform second semantic decoding processing on the semantic encoding features through the second pointer network of the target text information extraction model to generate an end list containing semantic categories of all information in the text data to be extracted; and a target information extraction module 25, used to combine the start list and the end list to extract target information from the text data to be extracted.
[0108] In a preferred embodiment, the target model acquisition module 21 is further configured to deploy the target text information extraction model using a preset inference deployment tool after the target text information extraction model is acquired.
[0109] In a preferred embodiment, the target model acquisition module 21 specifically includes: a data processing unit, used to preprocess all collected corpus data to obtain a sample dataset, and divide the sample dataset into a training dataset, a validation dataset, and a test dataset; a model building unit, used to establish a first text information extraction model by using a pre-trained encoder network as the encoder and a first pointer network and a second pointer network as the decoding layer; a model training unit, used to train the first text information extraction model according to the training dataset, and fine-tune the parameters of the encoder network to obtain a second text information extraction model; and a model validation unit, used to verify the model when the cumulative training times in the current training round reach a preset training threshold. When the number of training rounds reaches a threshold, the second text information extraction model in the current training round is verified according to the verification dataset to obtain the evaluation index value of each second text information extraction model; the model testing unit is used to select the second text information extraction model with the largest evaluation index value as the third text information extraction model when the cumulative number of training rounds reaches a preset training round threshold, and test the third text information extraction model according to the test dataset to obtain the evaluation index value of the third text information extraction model; the model acquisition unit is used to select the third text information extraction model as the target text information extraction model when the evaluation index value of the third text information extraction model reaches a preset evaluation index threshold.
[0110] In a preferred embodiment, the data processing unit is specifically used to: perform sentence segmentation processing on each of the corpus data according to a predefined sentence segmentation strategy to obtain a number of word segmentation data; perform deduplication processing on all the word segmentation data to obtain a number of sample data, and use all the sample data as the sample dataset.
[0111] In a preferred embodiment, the data processing unit is specifically used to: use the BIO annotation system to annotate each sample data in the sample dataset to obtain the label of each sample data; and divide the sample dataset into the training dataset, the validation dataset and the test dataset according to a preset data allocation ratio.
[0112] In a preferred embodiment, the evaluation metric value of the second or third text information extraction model is:
[0113]
[0114] in, TP represents the number of target information correctly predicted by the second or third text information extraction model, FP represents the number of target information incorrectly predicted by the second or third text information extraction model, and FN represents the number of target information missed by the second or third text information extraction model.
[0115] In a preferred embodiment, the target information extraction module 25 is specifically used to traverse the semantic category of each piece of information in the text data to be extracted for the start list. When the semantic category of the current information belongs to the entity category, the matching information of the current information is searched from the end list. The information between the position of the current information and the position of the matching information is taken as target information and the target information is extracted from the text data to be extracted. The matching information is the information whose position is after the position of the current information and whose semantic category is the same as that of the current information.
[0116] In a preferred embodiment, the semantic categories include non-entity categories and several entity categories.
[0117] Based on the same inventive concept as the first embodiment, the third embodiment provides a text information fast extraction device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. The memory is coupled to the processor, and when the processor executes the computer program, it implements the text information fast extraction method as described in the first embodiment and achieves the same beneficial effects.
[0118] Based on the same inventive concept as the first embodiment, the fourth embodiment provides a computer-readable storage medium including a stored computer program, wherein, when the computer program is running, it controls the device where the computer-readable storage medium is located to execute the text information fast extraction method as described in the first embodiment, and can achieve the same beneficial effect.
[0119] Based on the same inventive concept as the first embodiment, the fifth embodiment provides a computer program product that, when run on a computer, enables the computer to execute the text information extraction method described in the first embodiment and achieve the same beneficial effect.
[0120] In summary, implementing the embodiments of the present invention has the following beneficial effects:
[0121] By acquiring a target text information extraction model; using the encoder network of the target text information extraction model to perform semantic encoding processing on the text data to be extracted, semantic encoding features in the text data to be extracted are obtained; using the first pointer network of the target text information extraction model to perform first semantic decoding processing on the semantic encoding features, a start list containing semantic categories of all information in the text data to be extracted is generated; using the second pointer network of the target text information extraction model to perform second semantic decoding processing on the semantic encoding features, an end list containing semantic categories of all information in the text data to be extracted is generated; combining the start list and the end list, target information is extracted from the text data to be extracted, thus completing text information extraction. In this embodiment of the invention, a pre-trained encoder network is used as the encoder, and the first and second pointer networks are used as decoding layers to acquire the target text information extraction model. Text information extraction based on the target text information extraction model can improve the training and inference speed of the text information extraction model, enabling rapid extraction of information from text.
[0122] The above description represents the preferred embodiments of the present invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of the present invention, and these improvements and modifications are also considered to be within the scope of protection of the present invention.
[0123] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes described in the above embodiments. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.
Claims
1. A method for rapid extraction of text information, characterized in that, include: Model for extracting target text information; The semantic encoding features in the text data to be extracted are obtained by using the encoder network of the target text information extraction model to perform semantic encoding processing on the text data to be extracted. The semantic encoding features are subjected to first semantic decoding processing by the first pointer network of the target text information extraction model to generate a starting list containing all information in the text data to be extracted; wherein, the semantic categories include non-entity categories and several entity categories; The semantic encoding features are subjected to second semantic decoding processing through the second pointer network of the target text information extraction model to generate an end list containing semantic categories of all information in the text data to be extracted; By combining the start list and the end list, target information is extracted from the text data to be extracted; The step of extracting target information from the text data to be extracted by combining the start list and the end list specifically involves: For the start list, iterate through the semantic category of each piece of information in the text data to be extracted. When the semantic category of the current information belongs to the entity category, search for the matching information of the current information in the end list. Take the information between the position of the current information and the position of the matching information as the target information and extract the target information from the text data to be extracted. The matching information is information located after the current information and whose semantic category is the same as that of the current information.
2. The method for rapid text information extraction as described in claim 1, characterized in that, The target text information extraction model specifically includes: All collected corpus data are preprocessed to obtain a sample dataset, which is then divided into a training dataset, a validation dataset, and a test dataset. A pre-trained encoder network is used as the encoder, and a first pointer network and a second pointer network are used as the decoding layer to establish a first text information extraction model. The first text information extraction model is trained based on the training dataset, and the parameters of the encoder network are fine-tuned to obtain the second text information extraction model. When the cumulative number of training iterations in the current training round reaches a preset training iteration threshold, the second text information extraction model in the current training round is verified according to the verification dataset to obtain the evaluation index value of each second text information extraction model. When the cumulative number of training rounds reaches a preset training round threshold, the second text information extraction model with the largest evaluation index value is selected as the third text information extraction model, and the third text information extraction model is tested according to the test dataset to obtain the evaluation index value of the third text information extraction model. When the evaluation index value of the third text information extraction model reaches the preset evaluation index threshold, the third text information extraction model is used as the target text information extraction model.
3. The method for rapid text information extraction as described in claim 2, characterized in that, The process of preprocessing all collected corpus data to obtain a sample dataset is as follows: According to the predefined sentence segmentation strategy, each of the corpus data is segmented into sentences to obtain several word segmentation data. All the segmented data are deduplicated to obtain several sample data, and all the sample data are used as the sample dataset.
4. The method for rapid text information extraction as described in claim 2, characterized in that, The process of dividing the sample dataset into training, validation, and test datasets is as follows: The BIO annotation system is used to annotate each sample data in the sample dataset to obtain the label of each sample data; According to a preset data allocation ratio, the sample dataset is divided into the training dataset, the validation dataset, and the test dataset.
5. The method for rapid text information extraction as described in claim 2, characterized in that, The evaluation metric value of the second text information extraction model or the third text information extraction model is: ; in, , TP represents the number of target information correctly predicted by the second text information extraction model or the third text information extraction model, FP represents the number of target information incorrectly predicted by the second text information extraction model or the third text information extraction model, and FN represents the number of target information missed by the second text information extraction model or the third text information extraction model.
6. A device for rapid extraction of text information, characterized in that, include: The target model acquisition module is used to acquire the target text information extraction model; The semantic encoding processing module is used to perform semantic encoding processing on the text data to be extracted through the encoder network of the target text information extraction model to obtain the semantic encoding features in the text data to be extracted. The first semantic decoding processing module is used to perform first semantic decoding processing on the semantic encoding features through the first pointer network of the target text information extraction model, and generate a starting list containing all information in the text data to be extracted; wherein, the semantic categories include non-entity categories and several entity categories; The second semantic decoding processing module is used to perform second semantic decoding processing on the semantic encoding features through the second pointer network of the target text information extraction model, and generate an end list containing semantic categories of all information in the text data to be extracted; The target information extraction module is used to extract target information from the text data to be extracted by combining the start list and the end list; The target information extraction module is specifically used to traverse the semantic category of each piece of information in the text data to be extracted for the start list. When the semantic category of the current information belongs to the entity category, the matching information of the current information is searched from the end list. The information between the position of the current information and the position of the matching information is taken as target information and the target information is extracted from the text data to be extracted. Among them, the matching information is information that is located after the current information and has the same semantic category as the current information.
7. A device for rapid extraction of text information, characterized in that, The method includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the memory being coupled to the processor, and the processor implementing the text information fast extraction method as described in any one of claims 1 to 5 when executing the computer program.
8. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes a stored computer program, wherein, when the computer program is executed, it controls the device containing the computer-readable storage medium to perform the text information fast extraction method as described in any one of claims 1 to 5.
9. A computer program product, characterized in that, When the computer program product is run on a computer, it causes the computer to perform the text information extraction method as described in any one of claims 1 to 5.