Method, device and system for collecting effective information web pages in website information

A technology of effective information and website information, applied in the network field, can solve the problems of unstable crawling results and large resource consumption of the web crawler system, and achieve the effect of solving the problem of resource consumption, reducing interference, and improving utilization

Inactive Publication Date: 2013-01-09
BEIJING QIHOO TECH CO LTD +1
View PDF4 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] In view of this, the technical problem to be solved by this application is to provide a device and method for collecting effective information webpages in website information, so as to solve the problem that the crawling results in the web crawler system are unstable and the web crawler system consumes a lot of resources

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method, device and system for collecting effective information web pages in website information
  • Method, device and system for collecting effective information web pages in website information
  • Method, device and system for collecting effective information web pages in website information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 2

[0099] The specific implementation of the device for collecting valid information webpages in the website information described in the second embodiment can be executed with reference to the above content. ) for background operations on such websites can refer to the method mentioned above, and will not be described in detail here.

[0100] Such as image 3 As shown, it is a system for collecting valid information webpages in website information according to the third embodiment of the present invention, including: a content management device (CMS, Content Management System) 301, a link library (URLDB) 302 and a web page collection device (Crawler ) 303; among them,

[0101] The content management device 301 is coupled with the link library 302, and is used for pre-configured list page URL link templates and product page URL link templates, wherein the pre-configured product page URL link templates include Product attribute information.

[0102] Wherein, in the URL link tem...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method, a device and a system for collecting effective information web pages in website information. The method mainly comprises the following steps of: identifying list pages in collected websites according to a website link template of the pre-configured list pages and getting all internal website links contained in each list page; performing matching in all the internal website links contained in each list page according to the website link template of pre-configured commodity pages and getting the web links to the commodity pages, wherein the website link template of the pre-configured commodity pages contains product attribute information; and collecting the website links to all the got commodity pages. According to the method, the device and the system, disclosed by the invention, the problems of unstable result of crawling in a web crawler system and great consumption of resources of the web crawler system can be solved.

Description

technical field [0001] The invention belongs to the field of network technology, and in particular relates to a method, device and system for collecting effective information webpages in website information. Background technique [0002] The web crawler is a program for automatically obtaining webpage content, and is an important part of a search engine. Therefore, search engine optimization is largely optimized for crawlers. [0003] When a web crawler crawls a website, in order to crawl the entire website as completely as possible, it usually takes the form of deep traversal on the entire website. In some vertical fields, there will be inefficiency, loop crawling, etc. question. In the field of vertical crawling, for crawling a website, it is usually not necessary to crawl all the web pages of the entire website, but only to crawl some key pages, and then extract effective information. [0004] Especially, for some shopping websites on the Internet, the webpages that are...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 周雷高扬
Owner BEIJING QIHOO TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products