Network data automatic cleaning method and system based on webpage label distribution characteristics

A technology for distributing features and network data, applied in the field of data cleaning, it can solve the problems of slow result feedback, high cost, inconsistent update of official account templates, etc., and achieve the effect of improving the efficiency of cleaning and the accuracy of cleaning.

Pending Publication Date: 2021-01-26
北京钛氪新媒体科技有限公司
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

A large number of manual configuration templates are required to extract the modules that need to be cleaned. There are a lot of thumbnails, advertisement pictures, recommended reading links, promotion links, gif pictures and other noises in the network news data. Commonly used cleaning strategies use regularization or pattern matching to lose information and differ from the public. Inconsistencies in number templates require a lot of human resources to update frequently, and the feedback of problems is very slow
The existing technical solutions are not suitable or the cleaning accuracy is not ideal, the cost is high, and the result feedback is slow, etc.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Network data automatic cleaning method and system based on webpage label distribution characteristics
  • Network data automatic cleaning method and system based on webpage label distribution characteristics
  • Network data automatic cleaning method and system based on webpage label distribution characteristics

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0031] see Figure 1-3 , the present invention provides a technical solution: a method for automatically cleaning network data based on the distribution characteristics of webpage labels, characterized in that: comprising the following steps:

[0032] Step 1: Use the offline crawler system to crawl network news data:

[0033] That is, through the crawler collection system, collect articles and network news data according to the list page principle, and then obtain offline news data;

[0034] Step 2: Analyze the tree nodes of the crawled offline news data, and extract attribute information such as tag names, attributes, texts, links, etc. in the nodes;

[0035] Step 3: Use the idea based on n-gram2vec to predict other node block information through the current node, and obtain the word embedding information of the label through training:

[0036] Based on the idea of ​​n-gram2vec, data model training is carried out, and the original text with html tag is processed by feature ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and system for automatically cleaning network news data based on webpage label distribution characteristics, and the method comprises the steps of crawling the networknews data through an offline crawler system, namely carrying out tree node analysis on the crawled offline news data, and extracting the attribute information, such as label name, attribute, text, link, etc., in a node; predicting the information of other node blocks through a current node by adopting an idea based on n-gram2vec, obtaining the word embedding information of the label through training, constructing an intelligent model discrimination system based on the pre-trained word embedding information, and determining the leaving of tiled nodes, namely dividing an intelligent model intoa text discrimination model and a picture discrimination model according to the type of an article label, training the two types of models by adopting different feature engineering, finally predicting, and combining the results of the two types of models according to the previous node sequence.

Description

technical field [0001] The invention relates to the technical field of data cleaning, and specifically relates to a method and system for automatically cleaning network data based on distribution characteristics of web page tags. Background technique [0002] Conventional data collection steps, collection business logic writing, task distribution to download webpage content through the downloader, write rules according to the style of each article, and clean out the required content. A large number of manual configuration templates are required to extract the modules that need to be cleaned. There are a lot of thumbnails, advertisement pictures, recommended reading links, promotion links, gif pictures and other noises in the network news data. Commonly used cleaning strategies use regularization or pattern matching to lose information and differ from the public. Inconsistent updating of number templates requires a lot of human resources, and the feedback of problems is very ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/216G06K9/62G06N3/04G06N3/08G06F16/951G06F16/35
CPCG06F40/216G06N3/084G06F16/951G06F16/35G06N3/045G06F18/214
Inventor 朱俊杰
Owner 北京钛氪新媒体科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products