Textual Semantics-Based Text Structure Analysis Method

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technology of structural analysis and text, applied in the field of document semantic information analysis, to achieve the effect of general and widely used method framework

Active Publication Date: 2020-06-02

合肥图谱智能科技有限公司

View PDF2 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] The purpose of the present invention is to provide a text semantics-based discourse structure analysis method to solve the technical problems of restoring the document structure information of plain text and laying the foundation for text mining tasks.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0030] The present invention will be described in detail below in conjunction with embodiments.

[0031] One, data acquisition

[0032] 101 to obtain plain text data. Obtain plain text TXT data in machine-unreadable formats such as PDF, images, etc. Open source tools can be used to convert documents to be processed into machine-readable TXT format. For example, use PDFBOX to parse PDF documents into TXT documents, or use OCR technology to convert scanned files in JEPG format into TXT documents.

[0033] 2. Text extraction

[0034] 102 Noise content filtering. Filter noise content for structure extraction tasks, such as blank lines, headers and footers, table content, etc. The header and footer can be filtered based on the repeated information of each page, or the header and footer of specific types of documents can be filtered based on rules. The content of the table may affect the judgment of the hierarchical structure, and the table needs to be identified and eliminated.

[0035...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a text structure analysis method based on text semantics. The text structure analysis method comprises the following steps: 1, acquiring data; 2, performing content extraction; 3, recognizing and extracting a title; and 4, establishing a hierarchical structure. By adopting the text structure analysis method, the technical problems that text structure information reduction is carried out for a pure text, a base is made for text mining tasks, and the like, are solved.

Description

Technical field [0001] The invention relates to a method for analyzing document semantic information. Background technique [0002] The text structure is a kind of natural document semantic information, which can assist the reader to understand the level of the document. Document writers usually use a combination of visual and semantic methods to design the document structure. Visual information such as font style, page layout, etc. Semantic information includes the use of multi-level headings, distinguishing between headings and text, paragraph order arrangement, etc. [0003] Text Mining technology refers to the use of computer programs to automatically process text content, mining and extracting valuable text information. Text mining is a comprehensive computer technology, involving linguistic models, natural language processing technology, machine learning algorithms, etc. [0004] In terms of the semantic role of the document content, the document chapter structure generally ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityPatents(China)

IPC IPC(8): G06F40/14G06F40/30

CPCG06F40/14G06F40/30

Inventor张梦迪郑锦光段清华吴珂皓鲍捷马新磊

Owner合肥图谱智能科技有限公司

Textual Semantics-Based Text Structure Analysis Method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology