PDF table structure identification method based on graph attention mechanism

A table structure and attention technology, applied in character and pattern recognition, neural architecture, computer components, etc., can solve problems such as difficult and complex tables, and achieve the effect of improving the effect

Inactive Publication Date: 2020-02-04
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF2 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to solve the problem that the existing methods are difficult to accurately identify the structure of complex forms in PDF format. In order to improve the accuracy and recall rate of structure recognition on complex forms, a PDF form based on graph attention mechanism is proposed. Structure Recognition Method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PDF table structure identification method based on graph attention mechanism
  • PDF table structure identification method based on graph attention mechanism
  • PDF table structure identification method based on graph attention mechanism

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0030] Such as figure 1 As shown, a PDF table structure relationship recognition method based on graph attention mechanism, including the following steps:

[0031] Step 1. Preprocessing: Get all the cells in the table and their position coordinates.

[0032] Step 1: Extract all the text characters in the document according to the storage format of the PDF, form a cell with all the characters whose distance is less than the threshold d, and record the position coordinates and size of each cell. Assuming that there are n cells in total, we will record these n cells as w 1 ,w 2 ,...,w n . Such as figure 1 (Step 1) shown.

[0033] Step 2, graph construction: build an undirected graph for the obtained cells.

[0034] Step 2: Use the K-nearest neighbor method to create an undirected graph for the obtained cells. Such as figure 1 (Step 2) shown.

[0035] Step 2.1: Treat each cell as a node in the graph, and the nodes are in figure 1 It is indicated by a circle in the upper...

Embodiment 2

[0069] This embodiment describes the process of identifying the table structure on two public table structure identification data sets, the process used, the parameter design involved and the experimental results.

[0070] In this embodiment, three stages are involved. First, the edge classification model based on the graph attention mechanism is trained on the public table structure recognition data set to obtain the parameters of the model; then, implement the four aspects of the technical solution of the present invention. The step is to identify the structure of the tables in the test set; finally, compare the identified table structure with the correct result, and compare the present invention with the existing method.

[0071] (A) Model training

[0072] Step A: Use the training set to train the edge classification model based on the graph attention mechanism, and obtain the parameters of the model.

[0073] Step A.1: Prepare the dataset.

[0074] In this embodiment, t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a PDF table structure recognition method based on a graph attention mechanism, and belongs to the technical field of document analysis in a data mining technology. The methodcomprises the following steps of 1, preprocessing, wherein all cells in a table and position coordinates of the cells are obtained; 2, graph construction: establishing an undirected graph for the obtained cells; and 3, relationship prediction: classifying the edges on the constructed undirected graph, and predicting the adjacency relationship between the cells by using a neural network model. Compared with the prior art, the method for identifying the complex table structure in the PDF is proposed for the first time, the best effect is achieved on two table structure identification data sets,and particularly, the effect is obviously improved on complex table structure identification.

Description

technical field [0001] The invention relates to a table structure recognition method, in particular to a PDF table structure recognition technology based on a graph attention mechanism, and belongs to the technical field of document analysis in data mining technology. Background technique [0002] Table structure recognition is the task of identifying the internal structure of a table, which is an important step in enabling machines to understand tables. The recognized machine-understandable tables have many applications, such as question answering systems, dialogue systems, and table-generated text. [0003] Nowadays, there are related studies on table structure recognition in formats such as text, HTML, and images. As a popular and widely used file format, table structure recognition on PDF has also attracted extensive attention. Existing methods can be divided into rule-based methods and data-driven methods. The rule-based method mainly determines the table structure b...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00G06K9/62G06N3/04
CPCG06V30/414G06N3/045G06F18/2411
Inventor 毛先领迟泽闻徐恒达
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products