A method and device for extracting information from pdf files

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
An information extraction and file technology, applied in the computer field, can solve problems such as the inability to filter out interference information such as catalogs and notes, and the inability to determine the attribution relationship of titles at all levels, so as to achieve automatic extraction, improve information extraction performance, and high extraction efficiency.

Active Publication Date: 2020-09-29

JINGDONG TECH HLDG CO LTD

View PDF12 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] Unable to determine the attribution relationship of titles at all levels, the corresponding relationship between titles and corresponding content fragments, and the corresponding relationship between tables and related texts; unable to filter out interference information such as catalogs and notes

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0040] Exemplary embodiments of the present invention are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present invention to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

[0041] figure 1 is a schematic diagram of main steps of a method for extracting information from a PDF file according to an embodiment of the present invention. Such as figure 1 As shown, the information extraction method of the PDF file of the embodiment of the present invention mainly comprises the following steps:

[0042] Step S101: Obtain location information of a ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses an information extraction method and device for a PDF file, and relates to the technical field of computers. A specific embodiment of the method includes: obtaining the position information of the text object from the PDF file, and marking the position information on the image; wherein, the text object includes at least one key name and corresponding key value; performing image processing according to the layout characteristics of the image Classification, to determine the position range of the key name and the corresponding key value in the PDF file based on the image type; establish an association relationship between the key names according to the level of the key name, to combine the key name and the position range of the corresponding key value, and output keys of different levels name and the corresponding key value. In this method, the position of the text object in the PDF file is marked on the image, and after the image is classified according to the layout feature, the key name and the position of the corresponding key value are determined according to the image type, and the association relationship between the key names at all levels is established. The key name and corresponding key value are structured by combining position and association relationship, which improves the performance of information extraction.

Description

technical field [0001] The invention relates to the field of computers, in particular to an information extraction method and device for PDF files. Background technique [0002] In order to facilitate users to obtain interesting content from PDF files, it is necessary to structure the content of PDF files, identify the parent and child titles corresponding to each title, content fragments, chart content and other information, and organize them in an orderly manner. In the prior art, for information extraction of PDF files, the toolkit is mainly used to extract plain text and plain tables. Extracting plain text refers to extracting all text information from the entire PDF file, and extracting pure tables refers to extracting text information related to tables from the entire PDF file. [0003] In the course of realizing the present invention, the inventor finds that there are at least the following problems in the prior art: [0004] It is impossible to determine the attrib...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityPatents(China)

IPC IPC(8): G06F40/258G06K9/00

CPCG06V30/40G06V30/413G06F40/258

Inventor郑宇宇

OwnerJINGDONG TECH HLDG CO LTD

A method and device for extracting information from pdf files

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology