PDF content extraction method, device and equipment

An extraction method and content technology, applied in the field of PDF content extraction and computer-readable storage media, can solve the problems of reducing content identification and content extraction efficiency, poor efficiency, interference, etc., to improve identification and extraction efficiency, improve accuracy, The effect of reducing the size of an image

Pending Publication Date: 2021-12-17
四川医枢科技有限责任公司
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Based on the above situation, many scholars and enterprises have carried out the extraction of PDF text and PDF tables. At present, the main work is focused on establishing algorithm models and analysis systems from the perspective of rule engines and deep learning, focusing on image recognition and analysis of PDF pages. Text recognition, but the efficiency is often poor, and it is easy to be interfered by page elements that have nothing to do with the content on the page, reducing the efficiency of content recognition and content extraction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PDF content extraction method, device and equipment
  • PDF content extraction method, device and equipment
  • PDF content extraction method, device and equipment

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach

[0047] The core of the present invention is to provide a method for extracting PDF content, a schematic flow chart of a specific implementation is as follows figure 1 As shown, it is called the first specific implementation mode, including:

[0048] S101: Receive a PDF file to be processed.

[0049] S102: Determine PDF text information according to the PDF file to be processed.

[0050] The above-mentioned determination of the PDF text information through the PDF file to be processed can be realized by a machine learning method, such as through LSTM structure or CNN neural network training, to realize the search for samples by learning several sample pages in the same PDF file The commonality between pages can automatically exclude information such as headers, footers, and page numbers, leaving only the text information of the PDF; or, it can also directly cut the PDF pages according to the preset rules, such as cutting the presets at the upper and lower ends of the PDF page....

Embodiment approach

[0077] In the process of model training, the data is unbalanced, and the focus of the current task is precisely those types with small amount of data, such as: tables, titles. In order for the model to better learn the characteristics of these regions, an impact factor is added to the loss function to increase the learning ability of these categories.

[0078] The loss function of this model is mainly divided into three parts: frame loss, classification loss, and confidence loss. The border loss of Yolov4 is the CIoU loss used, and this part does not need to be modified; the confidence loss does not need to be changed, because in terms of confidence, the higher the confidence, the better, and there is no difference between categories; the present invention What needs to be modified is the category loss caused by classification.

[0079] Modified category loss function:

[0080]

[0081] where is the impact factor of Φ(c) category, Is the cross-entropy loss b...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a PDF (Portable Document Format) content extraction method. The method comprises the following steps: receiving a to-be-processed PDF file; determining PDF text information according to the PDF file to be processed; and obtaining PDF content extraction information according to the PDF text information. According to the method, the to-be-processed PDF file is preprocessed, format information, such as page headers, page footers and page numbers, of the PDF file at the edge of the PDF file is removed, only the PDF text information is left for subsequent recognition, compared with the prior art, the size of an image to be recognized by a subsequent program is reduced, meanwhile, the page edge elements which assist reading but do not bear the content information are eliminated, and only the PDF text information related to the content is left, so that the content identification and extraction efficiency of a subsequent program is greatly improved, and the accuracy of later content identification is improved. The invention also provides a PDF content extraction device and equipment with the above beneficial effects, and a computer readable storage medium.

Description

technical field [0001] The present invention relates to the field of PDF recognition, in particular to a PDF content extraction method, device, equipment and computer-readable storage medium. Background technique [0002] With the development of society, PDF (Portable Document Format), a portable file format, can be migrated between common operating platforms and can reliably restore every character and color of the file when printing. In our daily life, we often convert edited documents into PDF format to facilitate the reliable dissemination of information. Especially in recent decades, the amount of information has been increasing, and a large amount of data has emerged in the form of PDF. When we want to extract interesting content from massive PDF files, it becomes very tricky. Because it is very easy to convert common readable files such as HTML and word into PDF files, but it is very difficult to reverse convert PDF files into more readable files. Based on the abov...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00G06K9/62G06N3/04
CPCG06N3/044G06N3/045G06F18/23213
Inventor 邓川闾磊黄甫毅高阳郄蓓蓓陶鑫鑫
Owner 四川医枢科技有限责任公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products