Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for establishing text extraction model based on regular expression, and equipment

An expression and model technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of unstable extraction effect, difficult to predict accuracy, and inapplicable extraction accuracy, so as to reduce labor costs and time, to improve the effect of the extraction effect

Pending Publication Date: 2021-10-22
FUJIAN YIRONG INFORMATION TECH +2
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the accuracy of the CRF model is not determined by itself, but mainly depends on whether the labeled corpus used for training is consistent with the target test corpus. It needs to prepare a lot of manually labeled corpus in advance, and the extraction effect is unstable and the accuracy is difficult to predict. , not suitable for scenarios with stricter requirements on extraction accuracy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for establishing text extraction model based on regular expression, and equipment
  • Method for establishing text extraction model based on regular expression, and equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0037] see figure 1 , a method for establishing a text extraction model based on regular expressions, comprising the following steps:

[0038] S1. Write several regular expressions;

[0039] S2. Extract a corpus from the corpus according to the regular expression;

[0040] S3, dividing the corpus into a training set (80%) and a verification set (20%);

[0041] S4, building a text extraction model;

[0042] S5. Input the training set into the text extraction model, and train the text extraction model;

[0043] S6. Input the verification set into the trained text extraction model, and verify the trained text extraction model.

[0044] The beneficial effect of this embodiment is that by writing a small number of regular expressions instead of manual labeling, the labor cost and time required for building a model are effectively reduced.

Embodiment 2

[0046] Further, the text extraction model is a CRF model.

[0047] In this embodiment, the open source "python-crfsuite" development kit is used to construct the CRF model.

[0048] The progress of this embodiment lies in that, combining the advantages of regular expressions and CRF models, it is possible to efficiently and accurately extract key information in the text, which is specifically reflected in:

[0049] Based on the characteristics of regular expressions, the present invention has a better effect in processing text fields with fixed templates, such as audit fields and patent fields. At the same time, the text extraction model is used as the executor of the final text information extraction. It is not limited to whether the information to be extracted has a strict template. field.

Embodiment 3

[0051] Further, the CRF model is also set with a threshold (in this embodiment, the threshold is set to 90%), and if the accuracy of the model is lower than 90%, go to step S1.

[0052] The improvement of this embodiment is that a small number of regular expressions are added, and steps S1 to S6 are repeated to retrain the CRF model. It can effectively improve the extraction effect of the CRF model, and the rules written in the previous stage will not be discarded.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a method for establishing a text extraction model based on regular expressions. The method comprises the following steps: S1, compiling a plurality of regular expressions; S2, extracting a corpus set from a corpus according to the regular expression; S3, segmenting the corpus set into a training set and a verification set; S4, constructing a text extraction model; S5, inputting the training set into a text extraction model, and training the text extraction model; and S6, inputting the verification set into the trained text extraction model, and verifying the trained text extraction model.

Description

technical field [0001] The invention relates to a method and equipment for establishing a text extraction model based on regular expressions, belonging to the field of natural language processing. Background technique [0002] Regular expressions are a description method for string rules, and are usually used to retrieve and replace texts that meet certain rules. For example, the regular expression to extract emails is: / ^(\w)+(\.\w+)*@(\w)+((\.\w{2,3}){1,3})$ / , where \w represents any character, and {2,3} represents two or three occurrences. This regular expression can identify email addresses in the format of xxxx@xxxx.xxx. Regular expressions are flexible and can match almost any pattern of text. But the premise of applying regular expressions is that the "pattern" or "rule" of the information to be extracted should be very clear. Therefore, it is not suitable for key information extraction in text without obvious rules. [0003] In the process of establishing a supe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/205G06F16/903
CPCG06F40/205G06F16/90344
Inventor 苏江文王燕蓉陈江海张垚庄莉梁懿
Owner FUJIAN YIRONG INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products