Method for rapidly collecting dynamic script website data

A dynamic script and website technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as unsuitable for practical applications, slow speed, etc., achieve the effect of reducing the number of times and improving the collection speed

Inactive Publication Date: 2010-01-13
PEKING UNIV
View PDF0 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Traditional crawlers have inherent defects in dealing with existing websites: mainstream search engines generally adopt three attitudes: evasion, hard-coding and some websites reserved for search engine interfaces when dealing with dynamic script websites
Some research institutions have proposed the technology of simulating user behavior and clicking all the page elements of the page to crawl the dynamic script website, but this method is very slow and not suitable for practical application

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for rapidly collecting dynamic script website data
  • Method for rapidly collecting dynamic script website data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0012] The present invention will be described in detail below in conjunction with the accompanying drawings and embodiments.

[0013] In view of the non-processing and hard-coded methods used in the prior art to treat dynamic script websites, the execution process of the method of the present invention includes two parts, the first part is training, and the second part is crawling. Through the similarity training of pages, it is possible to know which events should be triggered on which page elements of various types of pages. Crawling can be carried out after the training is completed. The crawling process of the present invention can adopt multiple crawling strategies. In the breadth-first crawling method in this embodiment, each time an event is triggered, it will fall back to the original page. Until all the events that need to be triggered on the original page are triggered, other pages will be processed.

[0014] Such as figure 1 Shown, the training steps of the prese...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for rapidly collecting dynamic script website data, which is characterized by comprising the following steps: (i) obtaining and storing an Index page and adding the Index page into a queue to be processed; (ii) judging whether the queue to be processed is empty or not, if so, indicating capturing finish and quitting the capturing process, or else randomly selecting a page from the queue to be processed, obtaining the type of the current page by the similarity of pages and confirming an event on a page element, which needs to be triggered by the type according to the XPath characteristics extracted from the training step; (iii) judging whether the current page is provided with an untriggered event or not, if not, turning to the step (ii), or else triggering the event; judging whether the current page is changed or not and the changed page is a new page or not, if not, turning to the step (v), or else continuously executing the step (iv); (iv) storing the new page and adding the new page into the queue to be processed in the step (i); (v) returning to the page state before the event is triggered and turning to the step (iii).

Description

technical field [0001] The invention relates to a network data collection method, in particular to a dynamic script website data fast collection method. Background technique [0002] With the advent of the Web 2.0 era, the Internet increasingly uses dynamic scripts for interaction between the server and the client. The content of the webpage has changed from the previous static method to dynamically generated data obtained from the database. On the one hand, after the main page of the web page is downloaded locally, it needs to interact with the server several times to obtain all the data. For example, the number of readings and comments of the Sina blog is obtained by sending a request to the server after the page is loaded; On the one hand, many links to web content are no longer the traditional " ” tag, but using the JavaScript method, such as Tencent Forum, NetEase Forum, etc. page turning method is controlled by JavaScript. [0003] Crawlers are the first step of s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 夏冰高军王腾蛟杨冬青
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products