Theme analysis method and system based on kernel principal component analysis and LDA

A technology of core principal component analysis and analysis methods, applied in the field of text mining, can solve the problems of lack of algorithms and lack of global perspective to analyze the evolution of topic trends, etc., to achieve comprehensive and accurate analysis, reduce space complexity, and improve quality

Active Publication Date: 2021-09-03
SHENZHEN GRADUATE SCHOOL TSINGHUA UNIV
View PDF9 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the current topic analysis methods are mainly aimed at short texts such as Weibo comments, and there is a lack of algorithms with good performance in processing longer texts; Literature review under specific perspectives such as , interdisciplinary fields, etc., lack of analysis from a global perspective and research on the evolution of topic trends

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Theme analysis method and system based on kernel principal component analysis and LDA
  • Theme analysis method and system based on kernel principal component analysis and LDA
  • Theme analysis method and system based on kernel principal component analysis and LDA

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0070] Such as figure 1 As shown, the present embodiment provides a topic analysis method based on kernel principal component analysis and LDA, comprising the following steps:

[0071] 1) Obtain the document corpus D, and preprocess each article in the document corpus D, including deleting punctuation marks, deleting English characters, word segmentation and removing stop words, etc.

[0072] 2) According to the preprocessed document corpus D, establish a KPCA-LDA topic model, specifically:

[0073] 2.1) Extract the vocabulary of each article in the preprocessed document corpus D:

[0074] By scanning the document corpus D, mutually exclusive words in the article are added to the vocabulary in turn, and the vocabulary w of the article collection is obtained L =(w 1 ,w j ,...,w W ), where W is the vocabulary length; w j for the vocabulary w L The jth word in .

[0075] 2.2) Generate the document-term matrix of the document corpus D:

[0076] 2.2.1) Suppose there are M ...

Embodiment 2

[0139] This embodiment provides a theme analysis system based on kernel principal component analysis and LDA, including:

[0140] The data acquisition module is used to acquire the document corpus and preprocess each article in the document corpus.

[0141] The model construction module is used to establish a KPCA-LDA topic model according to the preprocessed document corpus.

[0142] The text representation determining module is used to use the established KPCA-LDA topic model to perform topic analysis on the articles in the document corpus, and determine the text representation of the articles in the document corpus.

[0143] The topic generation module is used to train and estimate the parameters of the KPCA-LDA topic model by using the Gibbs sampling algorithm, solve the parameters of the KPCA-LDA topic model, and generate several topics represented by words.

[0144] In a preferred embodiment, the model building blocks include:

[0145] A vocabulary extraction unit is u...

Embodiment 3

[0149] This embodiment provides a processing device corresponding to the topic analysis method based on kernel principal component analysis and LDA provided in Embodiment 1. The processing device may be a processing device for a client, such as a mobile phone, a notebook computer, a tablet computer, Desktop computer etc., to carry out the method of embodiment 1.

[0150] The processing device includes a processor, a memory, a communication interface and a bus, and the processor, the memory and the communication interface are connected through the bus to complete mutual communication. A computer program that can run on the processor is stored in the memory, and the processor executes the topic analysis method based on kernel principal component analysis and LDA provided in Embodiment 1 when running the computer program.

[0151] In some implementations, the memory may be a high-speed random access memory (RAM: Random Access Memory), and may also include a non-volatile memory (non...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a theme analysis method and system based on kernel principal component analysis and LDA, and the method is characterized in that the method comprises the following steps: 1) obtaining a literature corpus, and carrying out the preprocessing of all articles in the literature corpus; 2) according to the preprocessed literature corpus, establishing a KPCA-LDA theme model; 3) performing theme analysis on articles in the literature corpus by adopting the established KPCA-LDA theme model, and determining text representation of the articles in the literature corpus; and 4) carrying out training and parameter estimation on the KPCA-LDA theme model by adopting a Gibbs sampling algorithm, solving parameters of the KPCA-LDA theme model, and generating a plurality of themes represented by words. The method and system can be widely applied to the field of text mining.

Description

technical field [0001] The invention relates to a subject analysis method and system based on kernel principal component analysis and LDA, belonging to the field of text mining. Background technique [0002] At present, a relatively mature method system has been formed for mining research topics and evolution from scientific literature. The main research methods can be roughly divided into four categories: word frequency analysis, co-word analysis, citation analysis and text mining. With the rapid development of natural language and the rapid growth of text data, topic model, as an efficient text data analysis tool, has gradually become one of the core methods in the field of text mining. By extracting topics from the scientific literature corpus, researchers obtained two probability distributions-topic-word multinomial distribution φ and document-topic multinomial distribution θ, and proposed a generative probabilistic topic model, namely LDA (Hidden Dirichlet Distribution,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06F40/216G06N3/04G06N3/08
CPCG06F40/216G06N3/08G06N3/045G06F18/2135
Inventor 李秀许菁王梦凯
Owner SHENZHEN GRADUATE SCHOOL TSINGHUA UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products