Advertising blog post recognition method based on stacked denoising autoencoder
A recognition method and self-encoding technology, applied in natural language data processing, unstructured text data retrieval, text database clustering/classification, etc., can solve problems such as feature redundancy
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0057] Embodiment 1: as figure 1 Shown, based on the advertising blog post recognition method of the stacked noise reduction autoencoder, the specific steps of the advertising blog post recognition method based on the stacked noise reduction autoencoder are as follows:
[0058] Step1. First crawl to the Weibo corpus, obtain the training set and test set by manually marking the corpus, and then preprocess the corpus;
[0059] Step2. Construct microblog text feature vectors to represent blog posts, then put the feature vectors into maximum entropy classification for training and modeling, and obtain an advertising blog post recognition model based on microblog text feature vectors;
[0060] Step3. Construct artificially defined feature vectors to represent blog posts, and then put them into maximum entropy classification for training and modeling to obtain an advertising blog post recognition model based on manually defined feature vectors;
[0061] Step4. Construct the combin...
Embodiment 2
[0063] Embodiment 2: as Figure 1-2 As shown, based on the stacked noise reduction autoencoder advertising blog post recognition method, this embodiment is the same as Embodiment 1, wherein:
[0064] As a preferred solution of the present invention, the specific steps of the step Step1 are:
[0065] Step1.1. First, manually write a crawler program, crawl Weibo to obtain Weibo corpus;
[0066] Step1.2. Filter and deduplicate the crawled Weibo corpus to obtain non-repetitive Weibo corpus, and store the Weibo corpus in the database;
[0067] The present invention considers that there may be repeated blog posts in the crawled microblog corpus, and these blog posts increase the workload without much meaning, so they need to be filtered and deduplicated to obtain non-repetitive microblog blog corpus, which are stored in the database for the purpose of Facilitate data management and use.
[0068] Step1.3. Manually mark the corpus in the database to obtain the training set and test...
Embodiment 3
[0070] Embodiment 3: as Figure 1-2 As shown, based on the stacked noise reduction autoencoder advertising blog post recognition method, this embodiment is the same as Embodiment 2, wherein:
[0071] As a preferred solution of the present invention, the specific steps of the step Step2 are:
[0072] Step2.1. First use word2vec to process the microblog text to obtain the text vector of the microblog;
[0073] The present invention considers that Sina Weibo adjusts the character limit of the text from the original 140 characters to 2000 characters, so that the feature words of the text are correspondingly enlarged, and there are a large number of synonyms in it, and the context dependence is serious. In order to avoid feature word redundancy problem, the invention first uses word2vec to process the text, and converts each word in the text into a vector representation with the help of word2vec's superiority in semantic information representation, and then accumulates the corresp...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com