Article topic keyword extraction method and apparatus based on low-rank matrix decomposition

A low-rank matrix and extraction method technology, applied in the field of article topic keyword extraction based on low-rank matrix decomposition, can solve problems such as heavy workload

Inactive Publication Date: 2016-08-31
BEIJING JIAOTONG UNIV +1
View PDF3 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

There are some content related to pornography, horror

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Article topic keyword extraction method and apparatus based on low-rank matrix decomposition
  • Article topic keyword extraction method and apparatus based on low-rank matrix decomposition
  • Article topic keyword extraction method and apparatus based on low-rank matrix decomposition

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0069] The embodiment of the present invention provides a flow chart of a method for extracting article topic keywords based on low-rank matrix decomposition. figure 1 As shown, the method includes the following steps:

[0070] Step S110: Perform data preprocessing of cleaning, word segmentation, and removal of stop words on the text in the article to be processed, so as to obtain text that is convenient for keyword extraction of subsequent events. The aforementioned articles may be news, microblogs, blogs, comments, etc.

[0071] In the text preprocessing stage, the present invention mainly performs the following text preprocessing: remove URL links, emoticons, and invalid characters in the article text; since there are no spaces between Chinese words, word segmentation of the text is required before keyword extraction , the present invention uses an open source natural language processing toolkit with good effect——HanLP to carry out word segmentation; then remove stop words...

Embodiment 2

[0097] This embodiment provides a device for extracting article topic keywords based on low-rank matrix decomposition. The specific structure of the device is as follows: image 3 shown, including:

[0098] The data preprocessing module 31 is used to represent the word as a real value vector. Before the text after the preprocessing of the tool training data, it also includes: performing data preprocessing on the article text to be processed, the data preprocessing includes cleaning, word segmentation, and removing stops. use words.

[0099] The word vectorized file generation module 32 is used to use the article text after the tool training data preprocessed to represent the word as a real value vector to obtain the word vectorized file, which includes a plurality of word vectors, and the word vectorized file includes a plurality of word vectors. Contains keywords and non-keywords;

[0100]The keyword matrix building module 33 is used to use the keyword extraction algorithm ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Embodiments of the present invention provide an article topic keyword extraction method and apparatus based on low-rank matrix decomposition. The method mainly comprises training an article text after data pre-processing by using a tool representing words as real-value vectors, obtaining a word vectorization file, extracting keywords of each event of a specific topic in the article text after data pre-processing by using a keyword extraction algorithm based on a text graph model, querying the word vectorization file according to the extracted keywords, and establishing a keyword matrix of the specific topic; and solving the low-rank decomposition problem of the keyword matrix by using an augmented lagrange multiplier algorithm, obtaining a keyword low-rank matrix, and finally generating the keywords of the specific topic in the article text after data pre-processing. The keywords of article topics in microblogs are generated by using the low-rank matrix decomposition method, the sparsity problems of the article topic keywords in microblogs is effectively solved, and interference of non-keyword data noise is largely reduced.

Description

technical field [0001] The invention relates to the technical field of article keyword extraction, in particular to a method and device for extracting article topic keywords based on low-rank matrix decomposition. Background technique [0002] Now that we have entered the era of Web 3.0, information is growing exponentially, how to improve the efficiency of information access has become an increasingly important issue. In order to effectively organize, compress and retrieve massive information, people urgently hope to summarize or index the information well through several words. The emerging media represented by Weibo has become an important channel for people to communicate and share. A keyword extraction system is of great significance to how to quickly find topics that users are interested in and how to supervise the content of topics. [0003] Compared with traditional news texts, microblog texts have fewer words, and there are more types of microblog topics, and the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
CPCG06F16/3335G06F16/3344G06F40/30
Inventor 郎丛妍何伟明于兆鹏冯松鹤王涛杜雪涛张晨
Owner BEIJING JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products