Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Text structure analysis method based on text semantics

A technology for structural analysis and text, applied in semantic analysis, word processing, special data processing applications, etc., to achieve the effect of wide application and general method framework

Active Publication Date: 2017-09-08
合肥图谱智能科技有限公司
View PDF2 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a text semantics-based discourse structure analysis method to solve the technical problems of restoring the document structure information of plain text and laying the foundation for text mining tasks.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text structure analysis method based on text semantics

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] The present invention will be described in detail below in conjunction with the examples.

[0030] 1. Data acquisition

[0031] 101 plain text data. Get plain text TXT data in machine-unreadable formats such as PDF, images, etc. Documents to be processed can be converted to machine-readable TXT format using open source tools. For example, use PDFBOX to parse PDF documents into TXT documents, or use OCR technology to convert scanned files in JPEG format into TXT documents.

[0032] 2. Text extraction

[0033] 102 noise content filtering. Filter noise content for structure extraction tasks, such as blank lines, headers and footers, table content, etc. Header and footer can be filtered according to the repeated information of each page, or based on rules to filter the header and footer of specific types of documents. The content of the table may affect the judgment of the hierarchical structure, and table identification and elimination are required.

[0034] 103 dir...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a text structure analysis method based on text semantics. The text structure analysis method comprises the following steps: 1, acquiring data; 2, performing content extraction; 3, recognizing and extracting a title; and 4, establishing a hierarchical structure. By adopting the text structure analysis method, the technical problems that text structure information reduction is carried out for a pure text, a base is made for text mining tasks, and the like, are solved.

Description

technical field [0001] The invention relates to a document semantic information analysis method. Background technique [0002] Text structure is a kind of natural document semantic information, which can assist readers to understand the document hierarchy. Document writers usually use a combination of visual and semantic means to design document structures. Visual information such as font style, page layout, etc. Semantic information includes the use of multi-level headings, distinguishing between headings and text, and the order of paragraphs. [0003] Text mining (Text Mining) technology refers to the use of computer programs to automatically process text content to mine and extract valuable text information. Text mining is a comprehensive computer technology, involving linguistic models, natural language processing technology, machine learning algorithms, etc. [0004] From the perspective of the semantic role of document content, document chapter structure generally i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/22G06F17/27
CPCG06F40/14G06F40/30
Inventor 张梦迪郑锦光段清华吴珂皓鲍捷马新磊
Owner 合肥图谱智能科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products