Webpage mark extracting method

A webpage identification and storage structure technology, applied in the field of search engines, can solve the problems of occupation, more memory, and time-consuming processing, and achieve the effects of reducing excessive occupation, increasing processing speed, and reducing the number

Active Publication Date: 2007-07-04
TENCENT TECH (SHENZHEN) CO LTD
View PDF0 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011] For each specified URL, a combination of multiple hash functions is used to generate a corresponding decision vector, and the corresponding decision vector is used to determine whether the specified URL is in the captured URL set, but the decision vector is used to judge It will cause a certain degree of misjudgment. If you want to reduce the probability of misjudgment, you need to select a variety of hash functions with good performance to combine, and it is required to generate a judgment vector with a large number of series, which will make it possible to judge whether the URL is in The processing process in the URL collection that has been captured is very time-consuming, and the decision vector with a large number of series will also occupy more memory

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage mark extracting method
  • Webpage mark extracting method
  • Webpage mark extracting method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] The main realization principle, specific implementation process and beneficial effects of the present invention will be described in detail below with reference to each accompanying drawing.

[0050] Please refer to Fig. 1, which is a flow chart of the main realization principle of the method for grabbing webpage identifiers of the present invention, wherein in the realization process of the method of the present invention, at least one first storage structure needs to be set in advance to store a specified number of the most recent Preferably, the hash value of the captured web page identification is required to set the first storage structure of this setting in a storage medium with a higher processing speed and a higher price, such as a storage medium such as a memory; at the same time, it is necessary to set at least one The second storage structure is used to store all the hash values ​​of webpage identifications that have been crawled, wherein the set second storag...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a net page identification grabbing method that includes setting the first storing structure which is used to store the net page identification hash value of definite number newest grabbing; the second storing structure which is used to store the net page identification hash value of grabbed, the second storing structure includes original sub storing structure and conflict-avoiding sub storing structure which is corresponding with every node in the original sub storing structure; thereinto, the net page identification hash value of conflict in original sub storing structure can be resolved by the conflict-avoiding sub storing structure. The invention can increase the speed of judging whether the grabbing net page is in the net page identification set, and decrease the superabundant occupation of memory resource during the grabbing cause.

Description

technical field [0001] The invention relates to the technical field of search engines in Internet systems, in particular to a method for grabbing webpage identifiers. Background technique [0002] Search engine technology is a very popular network search technology in recent years. Web search, news search, music search, picture search and map search technologies based on it have great practical value and commercial value respectively. Among them, the crawler subsystem (Crawler, refers to the subsystem in the search engine system responsible for grabbing raw Internet data resources) is a very important part of the search engine system, and its role is to provide the search engine system with the most original source of Internet data. , such as providing web pages, mp3, pictures, emails, documents or software resources, etc., to greatly expand the application of search engine technology in various occasions. Among them, a well-designed and well-structured crawler is the prere...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 杨卫
Owner TENCENT TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products