Document paragraph recognition method and device and electronic equipment

A paragraph and document technology, applied in the computer field, can solve problems such as insufficient accuracy of results, inconsistent results of manual recognition, position deviation, etc., so that the recognition results are close to the results of manual recognition, and the results of recognition and manual recognition are more accurate. The effect of improving accuracy

Active Publication Date: 2020-06-02
BEIJING KINGSOFT OFFICE SOFTWARE INC +2
View PDF7 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, the positional deviation may occur in the area of ​​the paragraphs analyzed by using the page parameters and format, and it is easy to recognize multiple paragraphs as one paragraph or recognize one paragraph as multiple paragraphs, which makes the accuracy of the recognition result not high enough, and may appear If it does not match the results of manual identification,

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document paragraph recognition method and device and electronic equipment
  • Document paragraph recognition method and device and electronic equipment
  • Document paragraph recognition method and device and electronic equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment approach

[0088] As an implementation manner of the embodiment of the present invention, the device further includes:

[0089] The training unit includes:

[0090] The sample acquisition module is used to acquire multiple training samples; each training sample includes the document image and the real coordinates of the rectangular area where the paragraph in the document image is located.

[0091] The input module is used to input a preset number of document images into a paragraph recognition model to be trained; the recognition model to be trained is a preset initial convolutional neural network model.

[0092] The calculation module is used to use the paragraph recognition model to be trained to calculate the coordinates of the rectangular area where the paragraph is located in each document image.

[0093] The loss value calculation module is used to calculate the loss value by using the calculated coordinates of the rectangular area where the paragraph is located in each input document image...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention provides a method and device for recognizing a document paragraph and electronic equipment, and the method comprises the steps: obtaining a to-be-processed document, generating a to-be-processed document image, inputting the to-be-processed document image into a paragraph recognition model based on a convolutional neural network, and obtaining a paragraph recognition result of the to-be-processed document image. The paragraph recognition model is obtained by training the document image sample and the paragraph position in the document image sample; therefore, compared with the prior art, the model representing the relationship between the document image features and the paragraph positions can be established more accurately, the accuracy of document paragraph recognition is improved, the recognition result is closer to the result of manual recognition, and subsequent document editing and typesetting are facilitated.

Description

Technical field [0001] The present invention relates to the field of computer technology, in particular to a method, device and electronic equipment for identifying paragraphs of a document. Background technique [0002] At present, in the process of editing the portable document format PDF, it is often necessary to identify paragraphs in the document, so as to typeset the text in the paragraphs more quickly. The usual way to identify paragraphs in a PDF document is to obtain the page parameter information of the document through the PDF software, such as the position of the header and footer, the left and right margins, the font of the text object, the font size, etc., and then combine the indentation and indentation of the text line. Punctuation marks parse out the position of the text to determine the area of ​​the paragraph. [0003] However, the area where the paragraphs analyzed by using page parameters and formatting may have positional deviation, it is easy to recognize mu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00G06K9/62
CPCG06V30/414G06F18/214
Inventor 邓斌
Owner BEIJING KINGSOFT OFFICE SOFTWARE INC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products