Dynamic webpage data acquisition method based on WebKit browser engine

A browser engine and dynamic page technology, applied in the field of computer information, can solve problems such as complex, unsuitable and small-scale real-time data collection requirements, and the data collection support effect is not ideal, so as to enhance robustness and be useful for reference. effect of value

Inactive Publication Date: 2011-10-12
SUN YAT SEN UNIV
View PDF4 Cites 37 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there are still some problems in the current research: First, the current research is mainly aimed at designing a general method for large-scale web crawlers to crawl dynamic web pages. Acquisition) support effect is not ideal; Second, most of the schemes are more complicated to implement and not suitable for small-scale real-time data acquisition needs
[0006] At present, it is mainly aimed at designing a general method for large-scale web crawlers to crawl dynamic web pages, and the support effect for some targeted and directional data collection (such as the collection of product information on specific forums or commercial websites) is not ideal; secondly, most of the solutions are relatively complicated to implement. , not suitable for small-scale real-time data collection needs

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dynamic webpage data acquisition method based on WebKit browser engine
  • Dynamic webpage data acquisition method based on WebKit browser engine
  • Dynamic webpage data acquisition method based on WebKit browser engine

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

[0030] This paper expands on a simple crawler that crawls forum structure data, and proposes a scheme for collecting dynamic page data based on the WebKit browser engine. By adopting the Qt framework, the program has good reliability and cross-platform; by separating the interface from the configuration file, the program has good scalability; a timeout waiting mechanism is designed for complex network environments, and the program The robustness has been greatly im...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a dynamic webpage data acquisition method based on a WebKit browser engine. The dynamic webpage data acquisition method comprises the following steps of: sending an http request to a server, receiving original webpage data and constructing a document object module (DOM) tree, wherein the step of sending the http request, receiving the original webpage data, resolving js and constructing the DOM tree is realized by a WebKit bottom layer; aiming at different websites, maintaining corresponding configuration files, wherein the configuration files comprise js codes which trigger corresponding events and are transmitted to js execution interfaces provided by the WebKit in the form of a character string; and the WebKit updates the DOM tree according to the corresponding events; calling an I/O interface of the WebKit, converting the DOM tree into an html format and outputting the DOM tree in the form of the character string. By the method, the requirement of expandability is met in a configuration file manner, asynchronous parallel processing between a browser and the server is realized, the burden of the server is relieved, and a user experience is enhanced.

Description

technical field [0001] The invention relates to the field of computer information technology, in particular to a method for collecting dynamic page data based on a WebKit browser engine. Background technique [0002] With the rise of Web2.0, AJAX (Asynchronous JavaScript and XML, asynchronous JavaScript and XML) technology is all the rage. The way of asynchronous interaction between the client and the server not only reduces the pressure on the server, but also brings a better user experience. . However, a large number of dynamic web pages generated by using this technology have brought new difficulties to network data acquisition. Traditional web data collection tools such as web crawlers for collecting static web pages can capture far less content than the content presented on the page. A large number of dynamic web pages The inability to obtain the useful information in the network makes it difficult to carry out the work with network data as the main processing object, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/44G06F17/30
Inventor 李飞燕陈曦杨艾琳
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products