Text segmentation and topic annotation for document structuring

A text segmentation, text technology, applied in natural language analysis, special data processing applications, instruments, etc., can solve the problem of segment correlation that cannot be inserted into models and cannot be extended for long distances.

Inactive Publication Date: 2007-01-10
KONINKLIJKE PHILIPS ELECTRONICS NV
View PDF1 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] The basic strategy of first assigning sentences to clusters and then determining segment boundaries from cluster changes has several disadvantages: the method cannot be extended to capture information on longer ranges such as correlations about more distant segments, since these only appear at the completion of After cluster allocation
At the same time, substructures within segments (e.g. typical start phrases) cannot be captured in a sentence-by-sentence cluster assignment
Furthermore, explicit models for typical segment lengths cannot be interpolated in this method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text segmentation and topic annotation for document structuring
  • Text segmentation and topic annotation for document structuring
  • Text segmentation and topic annotation for document structuring

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0055] attached figure 1 Indicates including multiple words w 1 ...w N 100 block diagrams of text. The text 100 is divided into a plurality of segments 102 . For example, the first paragraph 102 begins with the first word w of the text 1 104 begins and ends with the word w x 106 over. The next section 102 is the next word w of this word flow x+1 start with the word w y End. The segment boundaries for the remaining segments 102 are defined in a similar manner. Segment 102 is defined by its segment boundaries, starting with the first word w 1 104 position and last word w x The position of 106 is characteristic. Here, the expression word refers to words, numbers, letters or other types of text symbols.

[0056] A segment 102 defined as a linked sequence of words 101 is also assigned to a topic 108 . Topics 108 are also associated with at least one tag 110 . Typically, a topic 108 refers to a set of tags 110 , 112 , 114 . Topic 108 represents the semantic meaning of...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method, a computer program product and a computer system for structuring an unstructured text by making use of statistical models trained on annotated training data. Each section of text in which the text is segmented is further assigned to a topic which is associated to a set of labels. The statistical models for the segmentation of the text and for the assignment of a topic and its associated labels to a section of text explicitly accounts for: correlations between a section of text and a topic, a topic transition between sections, a topic position within the document and a (topic-dependent) section length. Hence structural information of the training data is exploited in order to perform segmentation and annotation of unknown text.

Description

technical field [0001] The present invention relates to the field of generating structured documents from unstructured text by dividing the unstructured text into sections and assigning each section a semantic topic. Background technique [0002] Dividing text into segments and assigning each segment a label representing the segment's content is a fundamental but common task for constructing text documents. By utilizing relevant tags or headings, it is easy to retrieve within a document passages of text that are clearly relevant to the reader. Based on the tags, readers can quickly and efficiently identify the content relevance of text segments. Unfortunately, there are a large number of text documents that provide only insufficient structure or no structure at all. [0003] The collection of information provided by unstructured or weakly structured documents requires extensive reading and / or detailed description searching, which is tiring and very time-consuming for the r...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30G06F40/20
CPCG06F17/27G06F17/2765G06F40/20G06F40/279
Inventor J·比德斯C·迈耶D·克拉科E·马图索夫
Owner KONINKLIJKE PHILIPS ELECTRONICS NV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products