Automatic microblog text abstracting method based on unsupervised key bigram extraction

A technology of automatic summarization and binary words, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as low accuracy of summarization and unrobust noise

Active Publication Date: 2014-12-17
INST OF AUTOMATION CHINESE ACAD OF SCI
View PDF2 Cites 54 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In order to overcome the deficiency that the existing microblog text automatic summarization method is not robust to noise, which leads to the low accuracy rate of the extracted summaries, the present invention provides a microblog text automatic summ

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic microblog text abstracting method based on unsupervised key bigram extraction
  • Automatic microblog text abstracting method based on unsupervised key bigram extraction
  • Automatic microblog text abstracting method based on unsupervised key bigram extraction

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0061] Before preprocessing:

[0062] TG Shuge: After the heavy rain in Beijing, there is only one kind of weather. . . . sun exposure. . . sun exposure. . . sun exposure. . . No deadline. . . . Crazy_Crazy second stuff is nothing more than that zm I am here: http: / / t.cn / zj5UkoJ

[0063] After sentence segmentation:

[0064] After the heavy rain in Beijing, there is only one kind of weather. sun exposure. No deadline.

[0065] After word segmentation to stop words:

[0066] After the torrential rain in Beijing, there is no end to a kind of weather exposure

example 2

[0068] ·Before pretreatment:

[0069] Muyi nj: [Xiao Jingteng, can you not come during the college entrance examination? 】On June 7th, Xiao Jingteng, who is known as the "Rain God", appeared at the Beijing Airport, and the capital, which had little rain, also had a heavy rain. And today also coincides with the first day of the college entrance examination, so some netizens ridiculed: "The Rain God really deserves his reputation! But, can you not come during the college entrance examination?"

[0070] After the sentence is segmented:

[0071] 1: Xiao Jingteng, can you not come during the college entrance examination?

[0072] 2: On June 7th, Xiao Jingteng, who is known as the "Rain God", appeared at the Beijing airport, and the capital, which had very little rain, also had a heavy rain.

[0073] 3: Today also coincides with the first day of the college entrance examination, so some netizens ridiculed: "The God of Rain really lives up to its reputation! But, can you not come ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an automatic microblog text abstracting method based on unsupervised key binary word extraction. The automatic microblog text abstracting method comprises the steps of preprocessing a microblog; standardizing a binary word; extracting a key binary word based on a mixed TF-IDF (term frequency-inverse document frequency), TexRank and an LDA (local data area); sequencing sentences based on the intersection similarity and a mutual information strategy; extracting abstract sentences based on a similarity threshold value; generating abstract by reasonably combining the abstract sentences. According to the automatic microblog text abstracting method, the binary word is used as a minimum vocabulary unit, and the binary word has richer text information than words, so that the sentences based on the key binary word is higher in noise immunity and accuracy than the sentences based on key word extraction; meanwhile, when the abstract sentences are extracted, the similarity threshold value is introduced to control redundancy, so that the abstract is higher in recall rate. The abstract generated by the method is accurate, simple and comprehensive; the efficiency and the quality that a user acquires knowledge are obviously improved, and the time of the user is greatly saved.

Description

technical field [0001] The invention relates to a method for automatically summarizing short texts in social media such as microblogs, in particular to a method for automatically summarizing microblog texts based on non-supervised key word string (bigram) extraction. Background technique [0002] At present, there are not many methods for automatic summarization of a large amount of Weibo text generated by social media platforms such as Twitter and Sina Weibo. Most of the existing summarization methods based on microblog text features score or rank sentences directly based on the bag-of-words model, and then extract the top-ranked sentences to combine into a summary output (for example, refer to the article Inouye, D., Kalita, J.K. "Comparing twitter summarization algorithms for multiple post summaries", Social Computing, 2011, 298-306). However, for Weibo, which is an extremely non-standard dialogue-like short text, it is easy to introduce a lot of noise, resulting in a lo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
Inventor 徐博吴玉芳张恒郝红卫刘成林
Owner INST OF AUTOMATION CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products