News automatic labeling method based on LDA model

A labeling and automatic technology, applied in the field of text processing, can solve the problems of not being comprehensive enough, not considering the context, and not being able to represent the original information of the data well, so as to achieve good correlation and improve the effect of accuracy

Pending Publication Date: 2019-10-18
TAIYUAN UNIV OF TECH
View PDF9 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Commonly used methods for automatic labeling of keywords are statistical methods, such as TFIDF, which are fast and simple, but only based on word frequency, which is not comprehensive enough and ignores semantic information. From the semantic aspect, there are topic-based methods, such as LDA The model is very effective for the extraction of semantic information and dimensionality reduction of feature space. There is also TextRank, which does not require training data and is fast, but it ignores the correlation between semantics and does not consider the relationship between contexts. relation
Although the LDA model is widely used, it still has some shortcomings. The LDA model will carry out topic labels on all terms, which cannot represent the original information of the data well.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • News automatic labeling method based on LDA model
  • News automatic labeling method based on LDA model
  • News automatic labeling method based on LDA model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] In order to have a clearer understanding of the technical features, purposes and effects of the present invention, the specific implementation manners of the present invention will now be described in detail with reference to the accompanying drawings.

[0036] Such as figure 1 as shown, figure 1 It is an algorithm schematic diagram of an LDA model-based news automatic labeling method provided by the present invention, and the steps of the method include:

[0037] Preprocess the text that needs to be automatically tagged; the preprocessing method includes at least Chinese word segmentation and stop word removal.

[0038] Specifically, after sorting out various stop words lists such as "HIT Stop Words Thesaurus", "Baidu Stop Words List", "Sichuan University Machine Learning Intelligence Laboratory Stop Words List", and then use stuttering to segment the text Word segmentation to get the "text-term" matrix.

[0039] The preprocessed text is modeled using the LDA model,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a news automatic labeling method based on an LDA model. The news automatic labeling method includes the steps: extracting text data features at a semantic level, and having better effect in practical application; proposing improvements for an LDA model, utilizing point mutual information for quantizing the subject term relation, obtaining the co-occurrence relation betweensubject terms by calculating the weights of the subject terms, and setting a threshold value to select the optimal subject term. For the news automatic labeling method, keywords with high accuracy are selected according to the strength of the representation ability of vocabularies corresponding to different topics, and mutual information can be introduced to improve a topic-lexical item matrix, so that the accuracy of an LDA model in news document automatic label application is improved, and the correlation between subject terms is better described.

Description

technical field [0001] The invention relates to the technical field of text processing, and more specifically, relates to an LDA model-based automatic labeling method for news. Background technique [0002] With the development of the information network, information overload, news texts exploded, and most of the texts are long. If you can get a general understanding of what the article is about before reading it carefully, you can save time and quickly find the news content that you care about. Select Read a certain piece of news carefully. The task of automatic news labeling is to characterize the text content, and then filter out useful information. How to more accurately extract the information to be expressed in the text is one of the important topics of current research and has been widely It is applied to natural language processing tasks such as text classification, clustering, news recommendation, machine translation, and paper indexing. The LDA topic model is a co...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/21G06F17/27
CPCG06F40/117G06F40/258G06F40/284
Inventor 谢珺郝晓燕梁凤梅续欣莹靳红伟
Owner TAIYUAN UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products