A PDF document content text paragraph aggregation method based on a neural network

A neural network and document technology, applied in the field of neural network-based aggregation of PDF document content text paragraphs, can solve a large number of human resources and other problems, and achieve the effects of saving labor costs, facilitating reuse, and improving efficiency

Active Publication Date: 2019-06-28
武汉汉王数据技术有限公司
View PDF7 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This requires a lot of human resources

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A PDF document content text paragraph aggregation method based on a neural network
  • A PDF document content text paragraph aggregation method based on a neural network
  • A PDF document content text paragraph aggregation method based on a neural network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] In order to facilitate the understanding and implementation of the present invention by those of ordinary skill in the art, the present invention will be further described in detail with reference to the accompanying drawings and embodiments. It should be understood that the implementation examples described here are only used to illustrate and explain the present invention, and are not intended to limit this invention.

[0021] Please see figure 1 , A method for aggregating text paragraphs of PDF document content based on neural network provided by the present invention includes the following steps:

[0022] Step 1: For a number of PDF documents, extract the line text information features of each PDF document;

[0023] In this embodiment, the line text information features include line left margin, line right margin, number of characters, line maximum character height, line minimum character height, line maximum character width, line minimum character width, line maximum char...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a PDF document content text paragraph aggregation method based on a neural network, and the method comprises the steps: defining dozens of features of a row of texts, converting the features into multi-dimensional vectors, generating a sample data set, designing an algorithm model, carrying out the continuous training of the model, and finally outputting the trained algorithm model. For two input lines of texts, the algorithm model is used to accurately determine whether the two lines of texts should be merged into the same paragraph. Based on an artificial intelligencetechnology of a neural network, a research and development application program automatically aggregates line characters extracted from PDF into paragraphs, original sentences and paragraph structureinformation of the characters are restored, and repeated utilization of PDF content data is facilitated. The automatic aggregation efficiency of the artificial intelligence program cannot be achievedthrough manual processing, manual work is replaced by machines, the labor cost is saved, and the efficiency is greatly improved.

Description

Technical field [0001] The invention belongs to the technical field of artificial intelligence, and relates to a method for aggregating text paragraphs of PDF document content, and in particular to a method for aggregating text paragraphs of PDF document content based on a neural network. Background technique [0002] PDF (Portable Document Format) is a file format for presenting documents in a way independent of applications, hardware, and operating systems. This file format has nothing to do with the operating system platform. It can display PDF documents with the same display effect in operating systems such as Windows, Unix and Mac OS. PDF documents support a variety of tools and browsers to open, easy to read, transfer and store, is currently one of the most commonly used document formats. [0003] Although PDF documents can guarantee the same presentation effect, it is not easy to re-edit the published PDF documents. When the PDF document is published, because of the need t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00G06K9/46
CPCY02D10/00
Inventor 聂昱
Owner 武汉汉王数据技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products