Unlock instant, AI-driven research and patent intelligence for your innovation.

Document parsing method and device

A parsing method and document technology, applied in the field of document parsing methods and devices, can solve problems such as difficulty in constructing a knowledge structure and disordered content.

Inactive Publication Date: 2021-06-25
爱因互动科技发展(北京)有限公司
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] But in fact, a large number of Portable Document Format (PDF) documents will have complex document layouts, and simple parsing often leads to a large amount of content confusion
Moreover, industry documents often contain business-related multi-level structure knowledge, and it is difficult to construct a complete knowledge structure by only extracting keywords or sentences

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document parsing method and device
  • Document parsing method and device
  • Document parsing method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] The drawings are for illustration only and should not be construed as limiting the invention. The technical solutions of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0033] figure 1 is a flowchart of the document parsing method according to the present invention.

[0034] Those skilled in the art should understand that the documents targeted by the document parsing method of the present invention are generally PDF documents. PDF is the abbreviation of Portable Document Format, which means "portable document format". It is a file format developed by Adobe Systems for file exchange in a way independent of applications, operating systems, and hardware. In the following preferred embodiments, the steps of parsing the PDF document will be explained in detail. exist figure 1 In, only the generic document parsing method according to the present invention is generally explained.

[0035] Such as fig...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The disclosure provides a document analysis method and device. In the document parsing method (100) according to the present disclosure, the following steps are included: performing content parsing on the document to detect text lines (S110); based on a machine learning model, performing text sorting on the text lines (S120); based on machine learning model, performing text classification on the sorted text (S130); based on the result of text classification, performing structural processing on document content (S140). According to the document parsing technology of the present disclosure, the machine learning model and natural language processing technology are used to correct the preliminary parsing results, and then the parsed content is classified by machine learning technology to improve the efficiency and accuracy of the final structure.

Description

technical field [0001] The present invention relates to document processing based on machine learning, and more particularly relates to a document parsing method and device. Background technique [0002] In industries such as insurance and law, a large number of business documents are kept. It is a common requirement to analyze these unstructured or semi-structured documents and obtain structured data from them, but there are many problems in actual implementation. [0003] Existing document knowledge extraction methods usually require that the content format of the document is relatively simple, such as only processing documents in the DOC or DOCX format of Microsoft Office software, so that many problems are avoided in document parsing. Or, only extract simple content from the document, such as specific keywords, or some sentences that meet the rules. [0004] But in fact, a large number of Portable Document Format (PDF) documents have complex document layouts, and simpl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F16/25
Inventor 钟翰廷韩警吴金龙王守崑
Owner 爱因互动科技发展(北京)有限公司