Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Advertisement Weibo article identification method based on stacking noise reduction own-coding machine

A recognition method and self-encoding technology, which can be used in natural language data processing, special data processing applications, network data retrieval, etc., and can solve problems such as feature redundancy.

Active Publication Date: 2018-02-09
KUNMING UNIV OF SCI & TECH
View PDF4 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, none of the above methods select features when building the model, and the features used are more or less redundant.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Advertisement Weibo article identification method based on stacking noise reduction own-coding machine
  • Advertisement Weibo article identification method based on stacking noise reduction own-coding machine
  • Advertisement Weibo article identification method based on stacking noise reduction own-coding machine

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0057] Example 1: Such as figure 1 As shown, the advertising blog post recognition method based on the stacked noise reduction self-encoder, the specific steps of the advertising blog post recognition method based on the stacked noise reduction self-encoder are as follows:

[0058] Step1. First crawl the Weibo corpus, obtain the training set and test set by manually labeling the corpus, and then preprocess the corpus;

[0059] Step2: Construct the feature vector of the Weibo text to represent the blog post, and then put the feature vector into the maximum entropy classification to train and model, and obtain an advertising blog post recognition model based on the feature vector of the Weibo text;

[0060] Step3. Construct a manually defined feature vector to represent the blog post, and then put it into the maximum entropy classification to train and model, and obtain an advertisement blog post recognition model based on the manually defined feature vector;

[0061] Step4. Construct a...

Embodiment 2

[0063] Example 2: Such as Figure 1-2 As shown, the advertising blog post recognition method based on the stacked noise reduction self-encoder, this embodiment is the same as the embodiment 1, in which:

[0064] As a preferred solution of the present invention, the specific steps of Step 1 are:

[0065] Step1.1. First, manually write a crawler program, crawl Weibo to obtain Weibo corpus;

[0066] Step1.2. Filter and de-duplicate the crawled Weibo corpus to obtain non-repetitive Weibo corpus, and store the Weibo corpus in the database;

[0067] The present invention takes into account that there may be repeated blog posts in the crawled microblog corpus. These blog posts increase the workload and are of little significance. Therefore, it is necessary to filter and de-duplicate to obtain the non-repetitive microblog blog corpus, which is stored in the database in order to be able to Convenient data management and use.

[0068] Step1.3. Manually label the corpus in the database to obtain ...

Embodiment 3

[0070] Example 3: Such as Figure 1-2 As shown, the method for identifying advertisement blog posts based on stacked noise reduction self-encoders, this embodiment is the same as embodiment 2, in which:

[0071] As a preferred solution of the present invention, the specific steps of Step 2 are:

[0072] Step2.1. First use word2vec to process the Weibo text to get the text vector of the Weibo;

[0073] The present invention takes into account that Sina Weibo has adjusted the word limit of the text from 140 to 2000, so that the feature words of the text are correspondingly expanded, and there are a large number of synonyms, and the context is heavily dependent, in order to avoid feature word redundancy. The invention first uses word2vec to process the text. With the help of word2vec in semantic information representation, each word in the text is converted into a vector representation, and then the corresponding dimension in the vector of each word in the blog post is accumulated and d...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to an advertisement Weibo article identification method based on a stacking noise reduction own-coding machine, and belongs to the technical field of natural language processing.The method comprises the following steps that: firstly, crawling Weibo data, and obtaining a training set corpus and a test set corpus through manual annotation; secondly, analyzing a advertisement Weibo article to construct the texture feature vector representation of Weibo and artificially defined feature vector representation, using the stacking noise reduction own-coding machine to carry outfeature selection on the two feature vectors to obtain two processed feature vectors, putting the two feature vectors into maximum entropy to independently obtain an optimal advertisement identification model based on a text feature vector and an artificially defined feature vector; thirdly, combining the feature vectors of the above two optimal models to obtain a combined feature vector, and obtaining the advertisement identification model based on the combined feature vector; and finally, finding a model identification advertisement Weibo article with a best classification effect. By use ofthe method, the problem of feature redundancy is solved, the identification rate of the model is improved, and the difficulty of an application is lowered.

Description

Technical field [0001] The invention relates to an advertisement blog post recognition method based on a stacked noise reduction self-encoder, which belongs to the technical field of natural language processing, and microblog advertisement recognition. Background technique [0002] Advertising blog posts are written by professionals, with scattered content and various forms. It is difficult to identify and remove them by simple methods such as statistical screening. Advertising blog posts not only affect the user experience, but also have an adverse effect on microblog-based related research (such as public opinion analysis, opinion leader mining, topic discovery, etc.). At present, there are several methods for removing advertisement blog posts at home and abroad. By analyzing the advertising blog posts, determine the characteristics of the advertising blog posts, add up each characteristic value and set a threshold to filter the advertising blog posts. Use text data as featur...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
CPCG06F16/951G06F40/279
Inventor 黄青松李帅彬栾杰郎冬冬郭勃刘骊付晓东宋莉娜
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products