Unlock instant, AI-driven research and patent intelligence for your innovation.

Method and system for extracting information of PDF (Portable Document Format) document in security and futures scene

A document and futures technology, applied in the field of information extraction of PDF documents in the securities and futures scenario, can solve the problem of low integrity of information extraction, and achieve the effect of improving the restoration ability.

Active Publication Date: 2022-07-29
浙商期货有限公司
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The embodiment of the present application provides an information extraction method and system for PDF documents in the securities and futures scenarios, so as to at least solve the problem of low completeness in the information extraction of PDF documents in the securities and futures scenarios in the related art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for extracting information of PDF (Portable Document Format) document in security and futures scene
  • Method and system for extracting information of PDF (Portable Document Format) document in security and futures scene
  • Method and system for extracting information of PDF (Portable Document Format) document in security and futures scene

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048] In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application. Based on the embodiments provided in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

[0049] Obviously, the accompanying drawings in the following description are only some examples or embodiments of the present application. For those of ordinary skill in the art, the present application can also be applied to the present application according to these drawings without any creative effort. other similar situations. In addition, it will also be appreciated t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to an information extraction method and system for a PDF document in a security and futures scene, and the method comprises the steps: reading a document object of the PDF document through page-by-page traversal; according to starting and ending positions, font forms and character sizes of document objects of a current page in the PDF document, the document objects are recognized and classified, text objects and non-text objects are obtained, and if the text objects comprise the text objects, the table objects and the picture objects, the non-text objects are recognized and classified if the text objects comprise the text objects, the table objects and the picture objects; and performing further subdivision processing identification on the text object, the table object and the picture object obtained by extraction, thereby solving the problem of low integrity of information extraction of the PDF document in a security and futures scene, realizing further subdivision processing identification on the roughly extracted document object in the PDF document, and improving the accuracy of information extraction of the PDF document. And the capability of restoring the information in the PDF document is improved.

Description

technical field [0001] The present application relates to the technical field of data processing, and in particular, to a method and system for extracting information from PDF documents in a securities and futures scenario. Background technique [0002] In the field of securities and futures, a large number of research reports or announcements are published in the form of PDF, which contain both general text information, some tables, pictures and other information. How to identify these unstructured data and convert them into structured data It is a problem that needs to be solved urgently. [0003] At present, the structural information analysis of PDF documents is mostly based on open source tools such as pdfbox. pdfbox provides the function of traversing and reading the basic information of document objects (including text, pictures, tables, attachments, etc.) of PDF documents by page number, such as: providing character content, character encoding, font size, and positi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06V30/412G06V30/413G06N3/04G06N3/08G06V10/82
CPCG06N3/08G06N3/045Y02D10/00
Inventor 杨胜利吴福文康维鹏唐逐时
Owner 浙商期货有限公司