Complex PDF structure analysis method and device based on neural network

A neural network and structure analysis technology, applied in the computer field, can solve problems such as poor generalization ability, difficult to design analysis rules, etc., and achieve the effect of strong generalization

Active Publication Date: 2019-12-20
文灵科技(北京)有限公司
View PDF2 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The embodiment of this specification provides a complex PDF structure analysis method and device based on a neural network, which solves the problem that in the prior art, because the PDF document structure is not single, it is difficult to design analysis rules that can adapt

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Complex PDF structure analysis method and device based on neural network
  • Complex PDF structure analysis method and device based on neural network
  • Complex PDF structure analysis method and device based on neural network

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0046] Example one

[0047] figure 1 It is a schematic flowchart of a method for analyzing a complex PDF structure based on a neural network in an embodiment of the present invention. Such as figure 1 Shown. The method is applied to a complex PDF structure analysis device based on neural network. The complex PDF structure analysis processing device based on neural network includes an input device and a display device. The input device has a document input module and a document processing module. , Memory, signal input module, the input device can be connected to a device that generates output signals such as a printer or scanner, the display device is connected to the input device, and can process input devices such as the printer or scanner The display screen and other equipment where the document is displayed. The method includes steps S101-S104.

[0048] S101: Obtain feature information of the PDF document;

[0049] Further, the obtaining the characteristic information of the ...

Example Embodiment

[0059] Example two

[0060] Based on the same inventive concept as the neural network-based complex PDF structure analysis method in the foregoing embodiment, the present invention also provides a neural network-based complex PDF structure analysis device, such as image 3 Shown, including:

[0061] The first obtaining unit 11 is configured to obtain feature information of the PDF document;

[0062] The second obtaining unit 12 is configured to perform coarse-grained division of the feature information of the PDF document according to the maximum entropy model to obtain hierarchical paragraphs of the PDF document;

[0063] The third obtaining unit 13 is configured to transform the hierarchical paragraphs of the PDF document to obtain paragraph word vectors according to the two-layer bidirectional language model trained in the corpus, and compress the paragraph word vectors to obtain paragraph semantic vectors;

[0064] The fourth obtaining unit 14 is configured to input the paragraph se...

Example Embodiment

[0079] Example three

[0080] Based on the same inventive concept as the neural network-based analysis method of complex PDF structure in the first embodiment, the present invention also provides a computer-readable storage medium on which a computer program is stored, which is implemented when executed by a processor The steps of any method of the complex PDF structure analysis method based on neural network described above.

[0081] Among them, in Figure 4 In the bus architecture (represented by the bus 300), the bus 300 can include any number of interconnected buses and bridges. The bus 300 will include one or more processors represented by the processor 302 and various memories represented by the memory 304. The circuits are linked together. The bus 300 may also link various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are all known in the art, and therefore, no further descriptions thereof are provided herein. The bus...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Embodiments of the invention provide a complex PDF structure analysis method and device based on a neural network. The method comprises the steps of obtaining feature information of a PDF document; carrying out coarse particle division on the feature information of the PDF document according to a maximum entropy model to obtain a layered paragraph of the PDF document; converting layered paragraphsof the PDF document according to a two-layer bidirectional language model trained in a large-scale corpus set to obtain paragraph word vectors, and compressing the paragraph word vectors to obtain paragraph semantic vectors; and inputting the paragraph semantic vector into a multi-layer bidirectional long-short-term memory network to obtain a hierarchical sequence of all paragraphs of the PDF document. The technical problem that the generalization ability is poor due to the fact that the PDF document structure is not single is solved, and the technical effects that the limitation of manual design rule logic is avoided, the complex PDF document structure can be analyzed at a high level, and the generalization ability is high are achieved.

Description

technical field [0001] The embodiments of this specification relate to the field of computer technology, and in particular to a neural network-based complex PDF structure analysis method and device. Background technique [0002] The PDF document parsing method is mainly used to establish the PDF document structure system, and can also prepare for further extraction of document entity information. PDF is a common file format. One is a document with a clear directory structure and text distinction. This form often requires manual entry and typesetting. The other is the vast majority of PDF documents, which are scanned pictures of physical originals one by one. There is neither a directory structure nor a clear text distinction for page saving, which is not conducive to reading or further information extraction. The current mainstream PDF document structure extraction method is based on the document content and design rules to extract several titles that may be used as directo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/22G06F17/27G06N3/04G06N3/08
CPCG06N3/084G06N3/044
Inventor 宋永生汤铭王楠
Owner 文灵科技(北京)有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products