Unlock instant, AI-driven research and patent intelligence for your innovation.

Paragraph segmentation method and paragraph segmentation device

A paragraph and document segmentation technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as difficult paragraph segmentation

Inactive Publication Date: 2016-09-28
HITACHI LTD
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, in the existing methods related to paragraph segmentation and text segmentation, it is difficult to correctly segment paragraphs when a document contains multiple paragraphs containing sentences with similar meanings, that is, sentences with similar feature quantities.
As a result, automatic summarization of documents or automatic keyword extraction for document retrieval cannot be performed efficiently

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Paragraph segmentation method and paragraph segmentation device
  • Paragraph segmentation method and paragraph segmentation device
  • Paragraph segmentation method and paragraph segmentation device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] The first embodiment is an embodiment of a paragraph segmentation method, device and program using document vectors in similarity calculation and word vectors in similar document retrieval. In this embodiment, a document vector is a vector having all the documents included in the corpus unit of the segmentation device as dimensions.

[0042] Before describing this embodiment in detail, an example of document vectors and word vectors will be described.

[0043] Figure 6 represents an example of a document vector. exist Figure 6 In the example, the total number of documents included in the corpus is set to ten. And, when the documents obtained as the retrieval result are 1, 3, 4, and 8, the Figure 6 Document vectors are represented as document vectors 601 shown in (a) in (a). Similarly, when a search score is obtained as a search result, it can be expressed as the Figure 6 The document vector 602 shown in (b) in .

[0044] Figure 7 represents an example of a ...

Embodiment 2

[0088] Embodiment 2 is an embodiment of a paragraph segmentation method, device, and program that use word vectors in similarity calculations and also use word vectors in similar document retrieval.

[0089] Figure 4 It is a functional block diagram of the paragraph dividing device of the second embodiment. The hardware structure of the paragraph segmentation device of this figure is also the same as that of embodiment 1 Figure 1A The same can of course be done by Figure 1B The illustrated computer is implemented, and the illustration of the hardware structure is omitted here.

[0090] The input unit 402, the sentence segmentation unit 403, the paragraph update unit 408, the output unit 409, the sentence storage unit 410, the feature storage unit 412, the paragraph storage unit 413, and the morpheme analysis unit 414 are the same as the corresponding modules of Embodiment 1, so only the description The corpus unit 411 , the feature quantity calculation unit 404 , the simi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

PROBLEM TO BE SOLVED: To solve such a problem that in the conventional method, it is difficult to correctly divide a passage when a plurality of passages containing sentences with kindred meaning and similar feature quantity are included in one document.SOLUTION: A passage division device 100, under control of a control unit 101, divides a document input from an input unit 102 into sentence units at a sentence division unit 103. A feature quantity calculation unit 104, with the divided sentence as a query, performs associative retrieval of a document which is stored beforehand in a corpus unit 111 and acquires a document vector. A similarity calculation unit 105 retrieves two document vectors whose similarity becomes maximum, and when the similarity is equal to or larger than a prescribed threshold, a retrieval query generation unit 106 consolidates the two sentences to generate a query as a common element. The feature quantity calculation unit 104 regenerates a document vector by using this query. A feature quantity update unit 107 updates the feature quantity on the basis of its reliability, and connects corresponding sentences sequentially to make a passage while updating the feature quantity.

Description

technical field [0001] The invention relates to the processing of electronic files, in particular to the paragraph segmentation technology of electronic files. Background technique [0002] In recent years, digitization and databaseization of documents have been advanced, and natural language processing technology has also made great progress. For example, a large number of researches have been carried out for automatic summarization of documents and automatic keyword extraction for document retrieval. However, in many cases, a document that is a target of these techniques is assumed to be a document that is divided into each paragraph, that is, each topic or content summary unit of meaning, or includes only a single paragraph. Therefore, for documents containing multiple paragraphs, it is effective to pre-split paragraphs. Conventionally, as such a paragraph segmentation method, a text segmentation method described in Patent Document 1 or Patent Document 2 or the like is k...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/27
Inventor 柿下容弓服部英春村上智一今一修
Owner HITACHI LTD