Method and device for document segmentation based on text lines

A text line and text technology, applied in the field of text processing, can solve the problems of no segmentation and unfriendly table output, and achieve the effect of improving performance

Active Publication Date: 2017-11-24
科来网络技术股份有限公司
View PDF4 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Most of the existing PDF and HTML text extractions directly output the extracted text line by line without segmentation; or perform segment

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for document segmentation based on text lines

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0046] All features disclosed in this specification, or steps in all methods or processes disclosed, may be combined in any manner, except for mutually exclusive features and / or steps.

[0047] Any feature disclosed in this specification, unless specifically stated, can be replaced by other alternative features that are equivalent or have similar purposes. That is, unless expressly stated otherwise, each feature is one example only of a series of equivalent or similar features.

[0048] Related explanations of the present invention:

[0049] 1. No content refers to text line units without content such as spaces and carriage returns;

[0050] 2. curr_line.gap_top refers to the line spacing between the current text line unit and the adjacent text line unit;

[0051] 3. prev_line.gap_top refers to the line spacing between the previous text line unit and the adjacent text line unit;

[0052] 4. prev_line_start refers to the front-end text line unit;

[0053] 5. prev_line_start.g...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of text processing. In view of the problems in the prior art, the invention provides a method and device for document segmentation based on text lines. Whether or not text line units are merged into one paragraph is judged according to a merged score of the text line units; when the score of the text line units does not satisfy a merging need, merging of a current paragraph is ended, and processing of a new paragraph begins. According to the method and device for the document segmentation based on the text lines, the problems existing in the prior art can be solved simply and effectively; pages and document data structures can be extracted by using the method, and text line information is extracted from the document data structure corresponding to each text line; each document data structure including text lines is traversed in a full text, and context information of the full text and context information of each page can be calculated in a statistical mode according to a text line information list formed by the text line information of the document data structures separately; based on n text line unit structure lists in each page, by combining other context information, segmentation is conducted on the text line units in each page according to a segmentation algorithm.

Description

technical field [0001] The invention relates to the field of text processing, in particular to a text line-based document segmentation method and device. Background technique [0002] With the development of technology, more and more text processing relies on the automatic realization of machines, and among the existing document formats, there are PDF and PDF-like HTML documents, and the text in these documents is formed by lines instead of being directly merged For paragraphs, only through the visual style to ensure the effect of segmentation when people read. In order to allow the computer to automatically integrate the text lines in these documents into text paragraphs, so as to facilitate subsequent further processing of the text content in paragraph units, a feasible solution is proposed here. [0003] Most of the existing PDF and HTML text extractions directly output the extracted text line by line without segmentation; or perform segmentation based on the discovery o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/21G06F17/22
CPCG06F40/10G06F40/12
Inventor 林康罗鹰张鑫阳
Owner 科来网络技术股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products