Advertising blog post recognition method based on stacked denoising autoencoder

A recognition method and self-encoding technology, applied in natural language data processing, unstructured text data retrieval, text database clustering/classification, etc., can solve problems such as feature redundancy

Active Publication Date: 2021-01-05
KUNMING UNIV OF SCI & TECH
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, none of the above methods select features when building the model, and the features used are more or less redundant.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Advertising blog post recognition method based on stacked denoising autoencoder
  • Advertising blog post recognition method based on stacked denoising autoencoder
  • Advertising blog post recognition method based on stacked denoising autoencoder

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0057] Embodiment 1: as figure 1 Shown, based on the advertising blog post recognition method of the stacked noise reduction autoencoder, the specific steps of the advertising blog post recognition method based on the stacked noise reduction autoencoder are as follows:

[0058] Step1. First crawl to the Weibo corpus, obtain the training set and test set by manually marking the corpus, and then preprocess the corpus;

[0059] Step2. Construct microblog text feature vectors to represent blog posts, then put the feature vectors into maximum entropy classification for training and modeling, and obtain an advertising blog post recognition model based on microblog text feature vectors;

[0060] Step3. Construct artificially defined feature vectors to represent blog posts, and then put them into maximum entropy classification for training and modeling to obtain an advertising blog post recognition model based on manually defined feature vectors;

[0061] Step4. Construct the combin...

Embodiment 2

[0063] Embodiment 2: as Figure 1-2 As shown, based on the stacked noise reduction autoencoder advertising blog post recognition method, this embodiment is the same as Embodiment 1, wherein:

[0064] As a preferred solution of the present invention, the specific steps of the step Step1 are:

[0065] Step1.1. First, manually write a crawler program, crawl Weibo to obtain Weibo corpus;

[0066] Step1.2. Filter and deduplicate the crawled Weibo corpus to obtain non-repetitive Weibo corpus, and store the Weibo corpus in the database;

[0067] The present invention considers that there may be repeated blog posts in the crawled microblog corpus, and these blog posts increase the workload without much meaning, so they need to be filtered and deduplicated to obtain non-repetitive microblog blog corpus, which are stored in the database for the purpose of Facilitate data management and use.

[0068] Step1.3. Manually mark the corpus in the database to obtain the training set and test...

Embodiment 3

[0070] Embodiment 3: as Figure 1-2 As shown, based on the stacked noise reduction autoencoder advertising blog post recognition method, this embodiment is the same as Embodiment 2, wherein:

[0071] As a preferred solution of the present invention, the specific steps of the step Step2 are:

[0072] Step2.1. First use word2vec to process the microblog text to obtain the text vector of the microblog;

[0073] The present invention considers that Sina Weibo adjusts the character limit of the text from the original 140 characters to 2000 characters, so that the feature words of the text are correspondingly enlarged, and there are a large number of synonyms in it, and the context dependence is serious. In order to avoid feature word redundancy problem, the invention first uses word2vec to process the text, and converts each word in the text into a vector representation with the help of word2vec's superiority in semantic information representation, and then accumulates the corresp...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an advertisement Weibo article identification method based on a stacking noise reduction own-coding machine, and belongs to the technical field of natural language processing.The method comprises the following steps that: firstly, crawling Weibo data, and obtaining a training set corpus and a test set corpus through manual annotation; secondly, analyzing a advertisement Weibo article to construct the texture feature vector representation of Weibo and artificially defined feature vector representation, using the stacking noise reduction own-coding machine to carry outfeature selection on the two feature vectors to obtain two processed feature vectors, putting the two feature vectors into maximum entropy to independently obtain an optimal advertisement identification model based on a text feature vector and an artificially defined feature vector; thirdly, combining the feature vectors of the above two optimal models to obtain a combined feature vector, and obtaining the advertisement identification model based on the combined feature vector; and finally, finding a model identification advertisement Weibo article with a best classification effect. By use ofthe method, the problem of feature redundancy is solved, the identification rate of the model is improved, and the difficulty of an application is lowered.

Description

technical field [0001] The invention relates to an advertisement blog article recognition method based on a stacked noise-reduction autoencoder, and belongs to the technical field of natural language processing and microblog advertisement recognition. Background technique [0002] Advertising blog posts are written by professionals, with scattered content and various forms. It is difficult to identify and remove them through simple methods such as statistical screening. Advertising blog posts not only affect user experience, but also adversely affect related research based on Weibo (such as public opinion analysis, opinion leader mining, topic discovery, etc.). At present, there are mainly the following methods for removing advertising blog posts at home and abroad. By analyzing the advertising blog posts, determine the characteristics of the advertising blog posts, add the value of each feature and set a threshold to filter the advertising blog posts. Using text data as f...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/126G06F40/30G06F40/289G06F16/35
CPCG06F16/951G06F40/279
Inventor 黄青松李帅彬栾杰郎冬冬郭勃刘骊付晓东宋莉娜
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products