Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

PDF character extraction method and device, computer device and storage medium

A technology of text extraction and computer program, applied in the field of computer equipment and storage media, PDF text extraction, can solve problems such as unguaranteed recognition rate and unsatisfactory

Pending Publication Date: 2019-04-23
GUANGDONG ESHORE TECH
View PDF11 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in the actual production process, there are a large number of PDF files without explicit table borders, but the actual text content is displayed in a table distribution
At the same time, ocr technology cannot guarantee a 100% recognition rate, and there are high requirements for the clarity of documents, fonts and the inclination of scanned pictures
In actual production scenarios, for example: some financial types of PDFs need to achieve a 100% recognition rate, and the general-purpose OCR recognition technology can no longer meet the needs of actual production

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PDF character extraction method and device, computer device and storage medium
  • PDF character extraction method and device, computer device and storage medium
  • PDF character extraction method and device, computer device and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0053] In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

[0054] The PDF text extraction method that the embodiment of the present invention provides can be applied to such as figure 1 shown in the application environment. The server 110 is connected to a computer device 120 through a network, wherein the computer device 120 includes any computer device such as a personal computer and a large computer. The server 110 sends a PDF text extraction request to the computer device 120, and the computer device 120 obtains the PDF text extraction request, and determines the range of the data area in the PDF document according to the PDF...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a PDF character extraction method and device, a computer device and a storage medium. The method comprises the steps of obtaining a PDF character extraction request; determining the range of a data area in the PDF document according to the PDF character extraction request; performing column cutting and row cutting on the data area; dividing the data area into a plurality of cells according to the results of column cutting and row cutting; and respectively extracting characters in the plurality of cells. According to the invention, aiming at the editable PDF file, the data area is dotted, segmented and mapped through the visual front-end tool in the frameless table area, the data area is subjected to unit gridding, and the character extraction is carried out throughthe regional model text of the unit gridding.

Description

technical field [0001] The present invention relates to the field of computer application technology, in particular to a PDF text extraction method, device, computer equipment and storage medium. Background technique [0002] Currently, PDF is the abbreviation of Portable Document Format (Portable Document Format), which is an open electronic file format developed by Adobe. PDF was developed from the PostScript (postscript) programming language, which is still widely used in professional publishing as the mainstream printer programming language. PDF largely continues the page description method in the PostScript programming language, and adopts the character encoding method defined in the PostScript programming language. [0003] In traditional technology, PDF table recognition is generally based on explicit table borders, model analysis is performed through OCR technology, the data border area is confirmed, and text data is extracted according to the border area. However,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/00G06K9/34
CPCG06V30/416G06V30/414G06V30/158G06V30/153
Inventor 郑裕濠廖小文詹先余伦强黄瑞延
Owner GUANGDONG ESHORE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products