Method and device for collecting webpage data of direction site based on internet

A web page data and Internet technology, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as the inability to guarantee data collection at the collection site, and achieve the effect of effective data collection

Active Publication Date: 2011-07-06
NEW FOUNDER HLDG DEV LLC +3
View PDF0 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to solve the problem that the prior art file collection system cannot guarantee the timely and effective data collection of the collection site, the embodiment of the present invention provides a method for collecting web page data based on the Internet-based directional site, including:

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for collecting webpage data of direction site based on internet
  • Method and device for collecting webpage data of direction site based on internet
  • Method and device for collecting webpage data of direction site based on internet

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0015] In order to solve the problem that the collection system in the prior art cannot guarantee the timely and effective data collection of the collection site, the embodiment of the present invention provides a method for collecting web page data based on the Internet-based directional site, especially for the priority and collection of URLs. The priority management of the queue (that is, the queue to be accessed in the collection system) specifically includes: configuring the collection task, including the starting URL and the collection depth. Collect web page data according to the specified starting URL, set different priorities for the new URLs analyzed (i.e. URLs to be collected) according to the URL classification mechanism, and insert into corresponding priority queues. The URLs to be collected in this embodiment are Refers to the URL to be collected and added to the queue of URLs to be accessed.

[0016] When the web page download module requests an available URL fr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for collecting webpage data of a direction site based on internet, in order to solve the problem of the prior art that a file acquisition system cannot be used for timely and efficiently acquiring data of an acquisition site. The method comprises the following steps: according to the priority of to-be-acquired URL (uniform resource locator), adding the to-be-acquired URL into a URL array having the corresponding priority; confirming each URL array weighted value according to the URL amount, the URL array priority value and a weight factor in each URL array, wherein the weight factor is the new URL linkage amount in a list page which is updated for updating and acquiring content page linkage; acquiring the URL from the URL array having the highest weighted value; calculating the URL array weighted value; accessing the URL in the to-be-accessed URL array having the highest weighted value; and acquiring the page data according to the accessed URL, thereby realizing timely and efficiently acquiring data.

Description

technical field [0001] The invention belongs to the technical field of computer Internet, and in particular relates to a method and a device for collecting web page data of an Internet-based directional site. Background technique [0002] The Internet has been in a state of rapid development, and the amount of information has expanded rapidly. More people search for relevant information through the Internet. Although various information can be searched using public search engines, there are many defects in the results of these search engines: real-time performance is not enough, detailed text information cannot be directly seen, etc. As a result, many network acquisition systems have been born. These collection systems generally configure the site to be collected, set the initial URL, and the collection system automatically sets the priority of the URL according to the level of the website structure, and crawls the webpage according to this level. This mechanism basically...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 吴新丽杨建武蓝康泰尹小刚
Owner NEW FOUNDER HLDG DEV LLC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products