A method and device for extracting information from pdf files

An information extraction and file technology, applied in the computer field, can solve problems such as the inability to filter out interference information such as catalogs and notes, and the inability to determine the attribution relationship of titles at all levels, so as to achieve automatic extraction, improve information extraction performance, and high extraction efficiency.

Active Publication Date: 2020-09-29
JINGDONG TECH HLDG CO LTD
View PDF12 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Unable to determine the attribution relationship of titles at all levels, the corresponding relationship between titles and corresponding content fragments, and the corresponding relationship between tables and related texts; unable to filter out interference information such as catalogs and notes

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and device for extracting information from pdf files
  • A method and device for extracting information from pdf files
  • A method and device for extracting information from pdf files

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] Exemplary embodiments of the present invention are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present invention to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

[0041] figure 1 is a schematic diagram of main steps of a method for extracting information from a PDF file according to an embodiment of the present invention. Such as figure 1 As shown, the information extraction method of the PDF file of the embodiment of the present invention mainly comprises the following steps:

[0042] Step S101: Obtain location information of a ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an information extraction method and device for a PDF file, and relates to the technical field of computers. A specific embodiment of the method includes: obtaining the position information of the text object from the PDF file, and marking the position information on the image; wherein, the text object includes at least one key name and corresponding key value; performing image processing according to the layout characteristics of the image Classification, to determine the position range of the key name and the corresponding key value in the PDF file based on the image type; establish an association relationship between the key names according to the level of the key name, to combine the key name and the position range of the corresponding key value, and output keys of different levels name and the corresponding key value. In this method, the position of the text object in the PDF file is marked on the image, and after the image is classified according to the layout feature, the key name and the position of the corresponding key value are determined according to the image type, and the association relationship between the key names at all levels is established. The key name and corresponding key value are structured by combining position and association relationship, which improves the performance of information extraction.

Description

technical field [0001] The invention relates to the field of computers, in particular to an information extraction method and device for PDF files. Background technique [0002] In order to facilitate users to obtain interesting content from PDF files, it is necessary to structure the content of PDF files, identify the parent and child titles corresponding to each title, content fragments, chart content and other information, and organize them in an orderly manner. In the prior art, for information extraction of PDF files, the toolkit is mainly used to extract plain text and plain tables. Extracting plain text refers to extracting all text information from the entire PDF file, and extracting pure tables refers to extracting text information related to tables from the entire PDF file. [0003] In the course of realizing the present invention, the inventor finds that there are at least the following problems in the prior art: [0004] It is impossible to determine the attrib...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/258G06K9/00
CPCG06V30/40G06V30/413G06F40/258
Inventor 郑宇宇
Owner JINGDONG TECH HLDG CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products