Method for establishing text extraction model based on regular expression, and equipment
An expression and model technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of unstable extraction effect, difficult to predict accuracy, and inapplicable extraction accuracy, so as to reduce labor costs and time, to improve the effect of the extraction effect
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0037] see figure 1 , a method for establishing a text extraction model based on regular expressions, comprising the following steps:
[0038] S1. Write several regular expressions;
[0039] S2. Extract a corpus from the corpus according to the regular expression;
[0040] S3, dividing the corpus into a training set (80%) and a verification set (20%);
[0041] S4, building a text extraction model;
[0042] S5. Input the training set into the text extraction model, and train the text extraction model;
[0043] S6. Input the verification set into the trained text extraction model, and verify the trained text extraction model.
[0044] The beneficial effect of this embodiment is that by writing a small number of regular expressions instead of manual labeling, the labor cost and time required for building a model are effectively reduced.
Embodiment 2
[0046] Further, the text extraction model is a CRF model.
[0047] In this embodiment, the open source "python-crfsuite" development kit is used to construct the CRF model.
[0048] The progress of this embodiment lies in that, combining the advantages of regular expressions and CRF models, it is possible to efficiently and accurately extract key information in the text, which is specifically reflected in:
[0049] Based on the characteristics of regular expressions, the present invention has a better effect in processing text fields with fixed templates, such as audit fields and patent fields. At the same time, the text extraction model is used as the executor of the final text information extraction. It is not limited to whether the information to be extracted has a strict template. field.
Embodiment 3
[0051] Further, the CRF model is also set with a threshold (in this embodiment, the threshold is set to 90%), and if the accuracy of the model is lower than 90%, go to step S1.
[0052] The improvement of this embodiment is that a small number of regular expressions are added, and steps S1 to S6 are repeated to retrain the CRF model. It can effectively improve the extraction effect of the CRF model, and the rules written in the previous stage will not be discarded.
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com