PDF document structured message extraction method and device
A technology of information extraction and document structure, applied in the fields of instruments, character and pattern recognition, computer parts, etc., can solve the problem of inability to easily obtain the structured information of PDF documents, and achieve the effect of avoiding manual processing.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0050] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings.
[0051] Please refer to figure 1 , in a specific embodiment, the PDF document structured information extraction method includes:
[0052] S100 acquires the original page of the PDF document.
[0053] S200 Extracting at least one actual page including text content or title from the original page.
[0054]S300 Extract titles of various levels and text content belonging to the titles from the actual page.
[0055] S400 Structured storage of each title and text content belonging to the title.
[0056] Structured information means that the information can be decomposed into multiple interrelated components after analysis, and there is a clear hierarchical structure among the components. In this application, the structured information of a PDF docume...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


