Method and system for generating training data

A technology for training data and generators, applied in digital data processing, natural language data processing, instruments, etc., can solve problems such as insufficient information, scarcity of training data sets for document images, lack of training data sets, etc., and achieve high labeling Accuracy, the effect of avoiding complicated processes

Active Publication Date: 2021-05-25
北京灵伴即时智能科技有限公司
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, the inventor found through research that in the prior art, the document image analysis and recognition system still lacks sufficient and accurately labeled training data sets due to the complicated manual labeling process of its training samples, especially the training of Chinese document images. The data set is extremely scarce; and the document image synthesis and labeling technology in the prior art only considers the individual influencing factors in the process of document image generation, and the information of the label is not comprehensive enough, which has great limitations in practical applications and cannot Applied in a complete document image analysis and recognition system

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for generating training data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0064]Next, the technical solutions in the embodiments of the present invention will be apparent from the embodiment of the present invention, and it is clearly described, and it is understood that the described embodiments are merely embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, those of ordinary skill in the art will belong to the scope of the present invention without all other embodiments obtained without creative labor.

[0065]Training data generation technology for document image analysis and identification system includes document image synthesis technology, document image labeling technology;

[0066]Among them, document image synthesis technique is analyzed by analyzing the various influencing factors of real document images, using computer programs to model the process of generating factors, thereby automatically generating simulation document images; the influencing factors in the real document image includ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a training data generation system, which includes a text generator, a layout generator, a text renderer, a chart renderer, a noise adder, a deformation adder, an annotation generator, and a document image sample library; the text renderer selects text line and render it to the position area of ​​the text line; the chart renderer renders the chart element to the position area of ​​the chart element; the markup generator generates layout analysis markup information, text positioning and identification markup information. In addition, the invention also discloses a method for generating training data. The present invention considers various factors in the document image generation process, modularizes, parameterizes, and configures the factors, and can automatically generate training samples applied to document image analysis and recognition systems, and the synthesized document images are rich in form, The effect is realistic, and the overall flexibility is adjustable, and the scalability is strong. It can also automatically complete the labeling of information at all levels of document images, and provide fully labeled training data.

Description

Technical field[0001]The present invention relates to the field of image synthesis and labeling, and in particular, to a training data generation method and system.Background technique[0002]In the prior art, the depth learning method represented by depth neural network has been widely applied to various image recognition systems. Among them, the document image analysis and recognition system uses a computer vision method to analyze the physical and logical structure of the document image, and form a complete description of the document by positioning and identifying various document elements within the document such as text, tables, images, graphics, etc. to form a complete description of the document. At the same time, training data automatic generation technology is the technique that is often used in the machine learning, and the training data automatic generation technology expands data by adding various variables to real data, or it directly generates simulation of real data mo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/103G06K9/00G06K9/62
CPCG06F40/177G06F40/166G06V30/413G06V30/10G06F18/214
Inventor 豆浩斌陈博朱风云
Owner 北京灵伴即时智能科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products