Method for identifying data form in document and device thereof

An intermediate data structure and table technology, which is applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of unrecognizable tables without table lines, slow processing, difficult typesetting, etc., so as to improve the degree of reduction and readability , reduce the work of manual processing again, and improve the effect of editability

Active Publication Date: 2011-02-16
WONDERSHARE TECH CO LTD
View PDF3 Cites 42 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Therefore, when converting a PDF file to other document formats that are easier to edit, if you only extract the original data content from the PDF document, you will get scattered text content and lines; if you need a table, you need to manually delete the lines and insert the table , and refill the text content into the form, which is time-consuming and laborious
[0004] At the same time, in the PDF, there are still some text contents that are presented in a form similar to a table, but there is no corresponding table line to form a real table
After the text content is extracted, it is difficult to maintain the original typesetting without special processing
[0005] After extracting the data from the PDF documen

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for identifying data form in document and device thereof
  • Method for identifying data form in document and device thereof
  • Method for identifying data form in document and device thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0025] figure 1 The flowchart of the method for identifying data tables in a document provided by the embodiment of the present invention is shown.

[0026] In step S101, extract the text in the PDF document;

[0027] In step S102, the text is divided according to the attributes of the extracted text to obtain a division result;

[0028] In step S103, identify and generate a data table by judging and processing the division result;

[0029] In step S104, the result of recognition is stored in an independent intermediate data structure;

[0030] In step S105, restore the data table in the int...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the field of document application and discloses a method for identifying data form in document and a device thereof. The method includes that: text in PDF document is extracted; the text is divided according to the attribute of the extracted text, thus obtaining a division result; the division result is judged and processed, and identification is carried out, thus generating a data form; the data form is stored into an independent intermediate data structure; and the data form in the intermediate data structure is restored according to target document format. In the invention, data form in PDF is accurately converted, editability after conversion is greatly improved, and manual processing on document after conversion is reduced.

Description

technical field [0001] The invention belongs to the field of document application, and in particular relates to a method and device for identifying data tables in a document. Background technique [0002] With the continuous popularization of computers, paperless office has been more and more applied, and a large number of various documents have also appeared in front of users. [0003] In a Portable Document Format (PDF) document, the table actually seen is formed by superimposing lines and text. Therefore, when converting a PDF file to other document formats that are easier to edit, if you only extract the original data content from the PDF document, you will get scattered text content and lines; if you need a table, you need to manually delete the lines and insert the table , and refilling the text content into the form is time-consuming and labor-intensive. [0004] At the same time, in the PDF, some text content is presented in a form similar to a table, but there is ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/22
Inventor 李譞
Owner WONDERSHARE TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products