Text abstraction method based on TF-IDF

A TF-IDF and text technology, which is applied in unstructured text data retrieval, text database browsing/visualization, special data processing applications, etc., can solve problems such as huge computing resources and long-term training of RNN

Active Publication Date: 2019-07-02
BEIJING UNIV OF TECH
View PDF7 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But training RNN takes a long time and requires huge computing resources

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text abstraction method based on TF-IDF
  • Text abstraction method based on TF-IDF
  • Text abstraction method based on TF-IDF

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0053] The embodiment of the invention is described in conjunction with the attached drawings, and the Chinese text abstract is mainly divided into the following steps,

[0054] S1 Chinese word segmentation

[0055] Chinese refers to dividing a continuous sequence composed of Chinese characters and other regular characters into individual words according to the Chinese understanding method. During the implementation process, the jieba word segmentation tool can be used to segment the text. The sentence after word segmentation is as follows: figure 2 As shown, you can see that the sentence is split into individual words

[0056] S2 to stop words

[0057] Normal Chinese text usually contains special symbols such as periods, commas, and semicolons. After the word segmentation is completed, these punctuation marks do not need to continue to exist. Secondly, the sentence contains some words that have little impact on the importance of the sentence, such as 的, 了, not only, but als...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text abstraction method based on TF-IDF, which comprises the following steps of: carrying out Chinese word segmentation; removing unused words; computing TF-IDF of the words;computing the TF-IDF of the sentences; calculating position characteristics of the sentences; calculating the importance degree of the sentences; screening key sentences; outputting the text abstract; and taking the TFIDF value of the keyword contained in the sentence as a weight, and giving different weights to the core word keyword and the general keyword. Meanwhile, in order to prevent the influence of the sentence length inconsistency on the result, a sliding window is introduced, the importance degree of the maximum sliding window in the sentences is used as the sentence importance degree, the sentences are ranked by combining the characteristics of the sentence length, the sentence position and the like, and a good effect is achieved on a plurality of corpora.

Description

technical field [0001] The invention belongs to the field of automatic text summarization in natural language processing, and in particular relates to the innovation of an extractive text summarization method. Background technique [0002] In terms of text summarization, there are two mainstream summarization methods, extraction and generation. [0003] 1 Status Quo of Extractive Text Summarization [0004] The extraction method refers to the evaluation of the importance of the sentence by a certain method on the basis of the original text, and according to the importance of the sentence, one or more sentences that are most similar to the meaning of the original text are found as abstracts. At this stage, the research on extractive summary generation methods is relatively mature. Extractive text summarization assumes that an article can express its meaning through the more important sentences in the article, so the summary task becomes to find the most important sentences i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/34G06F17/27
CPCG06F40/242G06F40/205G06F40/289
Inventor 张涛陈才
Owner BEIJING UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products