Method and device for resource search and scheduling
A technology of resource search and scheduling method, which is applied in the field of resource search and scheduling method and device, which can solve the problems of reducing the collection coverage of the Spider system, and achieve the effects of improving collection coverage, avoiding missing links, and saving network traffic resources
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0081] The resource search and scheduling method provided by Embodiment 1 of the present invention, its flow is as follows figure 1 shown, including the following steps:
[0082] Step S101: Obtain the current subject link of the index page to be scheduled.
[0083] For each scheduling of the index page, parse the index page web page extraction and record the main link it finds. Specifically, for the index page to be scheduled, the current body link of the index page to be scheduled is obtained by determining the largest similar block in the index page to be scheduled.
[0084] Wherein, the index page refers to a web page whose main body is a link rather than content text. The subject link refers to a set of links corresponding to the subject on the index page webpage. For example, figure 2 Shown is a screenshot of an index page, the main link in the index page http: / / roll.news.sina.com.cn / news / gnxw / gdxw1 / index.shtml is as follows figure 2 , which includes links to the t...
Embodiment 2
[0096] The resource search and scheduling method provided by Embodiment 2 of the present invention has a process as follows image 3 shown, including the following steps:
[0097] Step S201: Obtain the current subject link of the index page to be scheduled.
[0098] For the index page to be scheduled, the current subject link of the index page to be scheduled is obtained by determining the largest similar block in the index page to be scheduled, wherein the determination of the largest similar block in the index page to be scheduled specifically includes:
[0099] Get the extensible markup language (eXtensible Markup Language, XML) path, referred to as XPath, the same node, get the similar block; according to the location and area of the similar block, determine the largest similar block in the index page to be scheduled.
[0100] Optionally, the position of the similar block is determined according to the width, height, top margin, and left margin of the similar block in t...
Embodiment 4
[0139] The resource search and scheduling method provided by Embodiment 4 of the present invention has a process as follows Figure 7 shown, including the following steps:
[0140] Step S401: Obtain the current subject link of the index page to be scheduled.
[0141] Step S402: According to the release time information contained in each link in the obtained current main body link, sequentially extract the time series.
[0142] In this embodiment, before comparing the obtained current main body link with the historical main body link of the index page to be scheduled, determine whether each link in the obtained current main body link has a chronological order or a reverse order arrangement rule, and if not, the They are arranged to facilitate subsequent comparisons of subject links.
[0143] Such as Figure 4 As shown, in the large box below, the underlined is the release time corresponding to the subject link, and each subject link has a release time information correspondi...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com