Method and device for resource search and scheduling

A technology of resource search and scheduling method, which is applied in the field of resource search and scheduling method and device, which can solve the problems of reducing the collection coverage of the Spider system, and achieve the effects of improving collection coverage, avoiding missing links, and saving network traffic resources

Active Publication Date: 2019-04-05
BEIJING QIHOO TECH CO LTD
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] Leaky links in the scheduling process will reduce the collection coverage of the Spider system

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for resource search and scheduling
  • Method and device for resource search and scheduling
  • Method and device for resource search and scheduling

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0081] The resource search and scheduling method provided by Embodiment 1 of the present invention, its flow is as follows figure 1 shown, including the following steps:

[0082] Step S101: Obtain the current subject link of the index page to be scheduled.

[0083] For each scheduling of the index page, parse the index page web page extraction and record the main link it finds. Specifically, for the index page to be scheduled, the current body link of the index page to be scheduled is obtained by determining the largest similar block in the index page to be scheduled.

[0084] Wherein, the index page refers to a web page whose main body is a link rather than content text. The subject link refers to a set of links corresponding to the subject on the index page webpage. For example, figure 2 Shown is a screenshot of an index page, the main link in the index page http: / / roll.news.sina.com.cn / news / gnxw / gdxw1 / index.shtml is as follows figure 2 , which includes links to the t...

Embodiment 2

[0096] The resource search and scheduling method provided by Embodiment 2 of the present invention has a process as follows image 3 shown, including the following steps:

[0097] Step S201: Obtain the current subject link of the index page to be scheduled.

[0098] For the index page to be scheduled, the current subject link of the index page to be scheduled is obtained by determining the largest similar block in the index page to be scheduled, wherein the determination of the largest similar block in the index page to be scheduled specifically includes:

[0099] Get the extensible markup language (eXtensible Markup Language, XML) path, referred to as XPath, the same node, get the similar block; according to the location and area of ​​the similar block, determine the largest similar block in the index page to be scheduled.

[0100] Optionally, the position of the similar block is determined according to the width, height, top margin, and left margin of the similar block in t...

Embodiment 4

[0139] The resource search and scheduling method provided by Embodiment 4 of the present invention has a process as follows Figure 7 shown, including the following steps:

[0140] Step S401: Obtain the current subject link of the index page to be scheduled.

[0141] Step S402: According to the release time information contained in each link in the obtained current main body link, sequentially extract the time series.

[0142] In this embodiment, before comparing the obtained current main body link with the historical main body link of the index page to be scheduled, determine whether each link in the obtained current main body link has a chronological order or a reverse order arrangement rule, and if not, the They are arranged to facilitate subsequent comparisons of subject links.

[0143] Such as Figure 4 As shown, in the large box below, the underlined is the release time corresponding to the subject link, and each subject link has a release time information correspondi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a resource searching and scheduling method and device. The resource searching and scheduling method comprises the following steps: obtaining the current body link of an index page to be scheduled; comparing the obtained current body link with a historical body link of the index page to be scheduled; if a comparison result determines that link omissions are in the presence, carrying out page turning scheduling on the index page to be scheduled, and executing a subsequent scheduling operation until no link omissions are in the presence; and if the comparison result determines that no link omissions are in the presence, executing the subsequent scheduling operation. A link omission phenomenon in a resource searching and scheduling process can be avoided, and a resource containing coverage rate is improved.

Description

technical field [0001] The invention relates to the technical field of data search, in particular to a resource search scheduling method and device. Background technique [0002] In the network data search technology, the Spider system is located at the most upstream of the search engine data flow, and is responsible for collecting resources on the Internet to the local area and providing them for subsequent retrieval. It is one of the most important data sources of search engines. The goal of the spider system is to discover and crawl all valuable webpages on the Internet. To achieve this goal, the first thing is to find links to valuable webpages. The current spider system has a certain scheduling mechanism to discover resource links as quickly and completely as possible. [0003] For example: when scheduling resource links, the following mechanisms can be set: [0004] Mechanism 1: Schedule the excavated seeds according to a certain period (for example, 20 times a day), ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/958G06F16/951
CPCG06F16/951G06F16/958
Inventor 郑燕琴
Owner BEIJING QIHOO TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products