Unlock instant, AI-driven research and patent intelligence for your innovation.

PDF (Portable Document Format) document layout detection method and device, equipment and medium

A detection method and document technology, applied in the field of computer vision, can solve the problems of large amount of calculation, low efficiency, high computational complexity, etc., achieve precise positioning and classification, improve accuracy and speed, and optimize the size and structure of the effect

Pending Publication Date: 2022-06-24
中电科网络安全科技股份有限公司
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Although these electronic documents are easy to use and disseminate, it is complicated to understand the document layout and use this format to extract information, and the languages ​​of PDF documents are different (Chinese, English, etc.), the typesetting formats are different, and the document formats are different (scanning type, text type, etc.) ), different font types and font sizes, different industries, etc., it is very difficult and challenging to achieve a unified document layout detection, so it is difficult to automatically process and detect these documents
Existing PDF document parsing tools can achieve document layout detection to a certain extent, but most of the existing algorithms are trained for a specific type of PDF document data, such as: English papers, Chinese journals, and cannot be applied to those with large differences in content formats. In other industries and fields, for example, it is impossible to realize the layout detection of documents such as notices, notices, contracts, leave notes, technical reports, etc.; and the existing document layout detection algorithms cannot realize the positioning and extraction of various important objects, including: title, There are six categories of text, images, tables, lists, and formulas. Most of the existing algorithms cannot locate and extract the two categories of formulas and lists, and the types of layout detection are not rich enough; in addition, for the function of title extraction, existing algorithms Unable to locate and extract multi-level titles, including: first-level titles, second-level titles, third-level titles, fourth-level titles, and various subtitles without obvious prefixes, the function is not perfect enough to achieve refined layout detection; the existing The document layout detection algorithm for object detection uses the Faster-Rcnn framework for training and testing. This framework has complex models, high computational complexity, long running time, and low efficiency, which cannot meet the requirements of real-time layout detection. Finally, existing natural language processing-based The document layout detection algorithm relies on the semantic information and position information of each character in the document, and cannot handle scanned PDF documents. The application scenarios are limited, and the model is complex and the calculation is large.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • PDF (Portable Document Format) document layout detection method and device, equipment and medium
  • PDF (Portable Document Format) document layout detection method and device, equipment and medium
  • PDF (Portable Document Format) document layout detection method and device, equipment and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0058] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

[0059] The current method for PDF document layout detection cannot be applied to other industries and fields with large differences in content formats, cannot locate and extract various important objects, cannot locate and extract multi-level titles, and cannot process scanned PDF documents. In addition, the framework model is complex, the operation complexity is high, the running time is long, and the efficiency is low, which cannot m...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a PDF (Portable Document Format) layout document detection method and device, equipment and a medium, and relates to the technical field of computer vision, the method comprises the following steps: obtaining various historical PDF documents, and converting pages of the historical PDF documents into pictures; labeling a target object in the picture according to a preset labeling box to obtain a labeled picture and target labeling information; training the target detection point network according to the labeled picture and the target labeling information to obtain a training model; the target detection point network is a network for target detection based on key points in the picture; and inputting the to-be-detected PDF document into the training model to perform layout detection on the to-be-detected PDF document. Therefore, according to the method, the training model is used for detecting various PDF documents, and the titles can be distinguished more meticulously; according to the method, the page of the historical PDF document is converted into the picture, so that the scanning type PDF document can be detected; a target detection network based on key points is utilized, and the precision and speed of a layout detection algorithm model are improved.

Description

technical field [0001] The present invention relates to the technical field of computer vision, and in particular, to a method, device, device and medium for detecting the layout of a PDF document. Background technique [0002] At present, with the development of informatization, more and more offices use the portable document format PDF (Portable Document Format) electronic documents to communicate and communicate. The high-speed graphics and images are encapsulated in a file, and the integration degree and safety and reliability are high. This feature makes it an ideal document format for electronic document distribution and digital information dissemination on the network. [0003] Although these electronic documents are easy to use and disseminate, due to the complexity of understanding the document layout and extracting information using this format, and the different languages ​​of PDF documents (Chinese, English, etc.), different layout formats, different document for...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06V30/414G06N3/08G06N3/04G06K9/62G06V10/44G06V10/764G06V10/82
CPCG06N3/08G06N3/045G06F18/241
Inventor 祝蕾吴杰
Owner 中电科网络安全科技股份有限公司
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More