Webpage screening method and device thereof

A screening method and webpage technology, applied in the field of information retrieval, can solve the problems of high webpage crawling failure risk or site ban risk, low crawling webpage success rate, etc., so as to reduce the crawling failure risk or site ban risk, and improve the success rate. rate, and the effect of ensuring the quality of web pages

Active Publication Date: 2013-05-22
人民数据管理(北京)有限公司
View PDF2 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] It can be seen that under the premise of ensuring the quality of the webpage, the existing webpage filtering method will br

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage screening method and device thereof
  • Webpage screening method and device thereof
  • Webpage screening method and device thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0061] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0062] In the prior art, under the premise of ensuring the quality of the webpage, the webpage screening method will bring a higher risk of webpage crawling failure or site banning risk, and ultimately lead to a lower success rate of webpage crawling. Therefore, in order to solve the problems in the prior art, the embodiments of the present invention provide a web page screening method and device.

[0063] A method for screening webpages provided...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a webpage screening method and a webpage screening device. The method comprises that preset seed webpage is captured; uniform resource locator (URL) information included by the seed webpage is captured; webpage mass fraction corresponding to the URL information is calculated; the URL information is divided into corresponding candidate gather according to preset network address information; the URL information which is not greater than the preset pressure quota is screened out from each candidate gather, the URL information which is screened out and corresponding to the webpage mass fraction which is not lower than the webpage mass fraction and corresponding to arbitrary residual URL information in the relative candidate gather is screened out. The captured pressure value corresponding to the network address is ensured based on the preset pressure quota. The webpage corresponding to the URL information which is screened out is regarded as the target captured webpage. The method lowers the risk of the capturing webpage failure or the risk of banning site so that the goal of improving the success rate of capturing the webpage is achieved.

Description

technical field [0001] The invention relates to the technical field of information retrieval, in particular to a webpage screening method and device suitable for a web crawler system. Background technique [0002] A web crawler, an important component of a search engine, is a program that automatically extracts web pages, and it downloads web pages from the Internet for search engines. In order to meet the needs of search engines to quickly and comprehensively cover valuable information on the Internet, crawlers need to crawl a large number of web pages every day. [0003] Because there is a large amount of webpage information in the Internet, and the crawling ability of the web crawler is limited, therefore, in order to screen out webpages with higher webpage quality, the existing webpage screening method comprises: after the web crawler grabs one or several seed webpages, Extract the URL information on the seed webpage, calculate the webpage quality score corresponding to...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 张恒崔世起杨青
Owner 人民数据管理(北京)有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products