Unlock instant, AI-driven research and patent intelligence for your innovation.

Method for identifying and extracting structured information of business license by utilizing named entities

A named entity recognition and structured information technology, applied in neural learning methods, character and pattern recognition, neural architecture, etc., can solve problems such as easy failure, easy matching of fields, irregular fields, etc., to enhance generalization ability , a wide range of applications, and the effect of improving accuracy

Pending Publication Date: 2021-04-16
广州市申迪计算机系统有限公司
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] First, it is easy to fail to match the fields. For example, the text recognized by OCR is November 3, 2018. At this time, the date cannot be extracted with regularization, because OCR recognizes "day" as "day". This situation can be The purpose of successful recognition is achieved by improving the regularization rules, but the rules are endless and difficult to cover completely. As long as the text recognized by OCR is slightly deviated, the structured extraction based on regularization will easily fail;
[0009] Second, some fields are irregular. For example, in the name field on the business license, there are irregular store names such as "Lugu Lake Watching Time" and "There is no corn juice here". It is difficult to extract them by defining regular expressions come out
[0011] First, when there are a large number of templates, it is difficult to cover them completely. For example, there are at least three types of business licenses across the country, which means that at least three templates must be defined in advance. Secondly, when a picture is input, it is necessary to first determine which type of template the picture belongs to. , which involves image classification and increases the complexity;
[0012] Second, sometimes template fields cannot be found due to inaccurate text recognition or incomplete image input. At present, most template structuring mainly uses transmission transformation, which means that at least 4 template fields must be found to be extracted. When the template field cannot be found, the method based on template extraction will fail

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for identifying and extracting structured information of business license by utilizing named entities
  • Method for identifying and extracting structured information of business license by utilizing named entities

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0049] like figure 1 As shown, this embodiment provides a method for extracting structured information of a business license by using named entity recognition, including the following steps:

[0050] S1), training named entity model

[0051] S101), define entity

[0052] Define the entity to be extracted. In this embodiment, the entity is unified into eight entities: social credit code, name, type, business location, operator, composition form, registration date, and business scope;

[0053] S102), data collection

[0054] Obtain a photo of the business license, and then manually mark the entity. In this embodiment, after obtaining the business license, the social credit code, name, type, business place, operator, composition form, registration date, and business scope information will be unified through manual marking marked out;

[0055] S103), generating data

[0056] In this embodiment, data generation can directly input text paragraphs, and at the same time convert s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method for identifying and extracting structured information of a business license by using a named entity, comprising the following steps: training a named entity model, and predicting by using the model, the model training comprising defining an extracted entity, acquiring a photo of the business license, and then manually marking the entity; generating training data by using a BIO labeling method, constructing a model taking BERT + BILSTM + CRF as a model architecture, and training the model; and performing prediction by utilizing the model, including text splicing and model prediction, and organizing an identification result of a single character into an entity according to a BIO labeling method. The invention is high in stability and robustness and wide in application range, noise data are introduced during data generation in order to enhance the generalization ability of the model and improve the extraction accuracy, a pre-training model obtained through large-scale corpus training is used in a feature extraction layer of a model architecture, and the adversarial training is introduced during model training.

Description

technical field [0001] The invention relates to the technical field of business license information extraction, in particular to a method for extracting structured information of a business license by using named entity recognition. Background technique [0002] A business license is a certificate issued by the industrial and commercial administration to industrial and commercial enterprises and self-employed persons to engage in certain production and operation activities. [0003] In some scenarios, it is necessary to identify the key information in the business license, such as unified social credit code, business address, operator and registration date, etc. [0004] At present, OCR on the market generally has three major processes. The first step is to detect text boxes, the second step is text recognition, and the third step is to extract structured information. Among them, there are two mainstream technologies for extracting structured information, one is regularized...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/295G06K9/20G06K9/62G06N3/04G06N3/08
Inventor 周俊贤朱汝维
Owner 广州市申迪计算机系统有限公司