Textual Semantics-Based Text Structure Analysis Method

A technology of structural analysis and text, applied in the field of document semantic information analysis, to achieve the effect of general and widely used method framework

Active Publication Date: 2020-06-02
合肥图谱智能科技有限公司
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a text semantics-based discourse structure analysis method to solve the technical problems of restoring the document structure information of plain text and laying the foundation for text mining tasks.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Textual Semantics-Based Text Structure Analysis Method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] The present invention will be described in detail below in conjunction with embodiments.

[0031] One, data acquisition

[0032] 101 to obtain plain text data. Obtain plain text TXT data in machine-unreadable formats such as PDF, images, etc. Open source tools can be used to convert documents to be processed into machine-readable TXT format. For example, use PDFBOX to parse PDF documents into TXT documents, or use OCR technology to convert scanned files in JEPG format into TXT documents.

[0033] 2. Text extraction

[0034] 102 Noise content filtering. Filter noise content for structure extraction tasks, such as blank lines, headers and footers, table content, etc. The header and footer can be filtered based on the repeated information of each page, or the header and footer of specific types of documents can be filtered based on rules. The content of the table may affect the judgment of the hierarchical structure, and the table needs to be identified and eliminated.

[0035...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text structure analysis method based on text semantics. The text structure analysis method comprises the following steps: 1, acquiring data; 2, performing content extraction; 3, recognizing and extracting a title; and 4, establishing a hierarchical structure. By adopting the text structure analysis method, the technical problems that text structure information reduction is carried out for a pure text, a base is made for text mining tasks, and the like, are solved.

Description

Technical field [0001] The invention relates to a method for analyzing document semantic information. Background technique [0002] The text structure is a kind of natural document semantic information, which can assist the reader to understand the level of the document. Document writers usually use a combination of visual and semantic methods to design the document structure. Visual information such as font style, page layout, etc. Semantic information includes the use of multi-level headings, distinguishing between headings and text, paragraph order arrangement, etc. [0003] Text Mining technology refers to the use of computer programs to automatically process text content, mining and extracting valuable text information. Text mining is a comprehensive computer technology, involving linguistic models, natural language processing technology, machine learning algorithms, etc. [0004] In terms of the semantic role of the document content, the document chapter structure generally ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/14G06F40/30
CPCG06F40/14G06F40/30
Inventor 张梦迪郑锦光段清华吴珂皓鲍捷马新磊
Owner 合肥图谱智能科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products