Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Dynamic webpage crawler method and system based on substitute work mode

A technology of dynamic web pages and crawlers, which is applied in the fields of search engines and Internet information retrieval. It can solve problems such as inability to meet the needs of large-scale crawling tasks, low efficiency of crawler methods, and simple task scheduling, and achieve fast crawling and high delivery efficiency. , the effect of improving access efficiency

Active Publication Date: 2020-09-11
CHONGQING UNIV OF POSTS & TELECOMM
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The technical problem to be solved by the present invention is that the existing crawler method is inefficient, the task scheduling is simple, and cannot meet the needs of large-scale crawling tasks. The purpose is to provide a proxy-based A method and system for dynamic webpage crawling in industrial mode, solving the problem of simple and efficient large-scale completion of crawling tasks

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dynamic webpage crawler method and system based on substitute work mode
  • Dynamic webpage crawler method and system based on substitute work mode
  • Dynamic webpage crawler method and system based on substitute work mode

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0029] This embodiment is a dynamic web crawler system based on OEM mode, such as figure 1 As shown, the system structure is divided into seven parts: business interface module, business scheduling module, business crawler, production scheduling module, production crawler, storage module and export module. The main functions of each part are as follows:

[0030] (1) Business interface module: As a business-related user interface, it receives user input, configures crawler business-related parameters, conducts business evaluation, and makes relevant preparations;

[0031] (2) Business scheduling module: according to business-related information, allocate system resources at a specified time, and initiate several business crawlers of independent processes;

[0032] (3) Business crawler: use the simulated browser mode to crawl the original URL of the dynamic webpage, and return the URL of the target static data content;

[0033] (4) Production scheduling module: receive crawlin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a dynamic webpage crawler method and system based on a work substitution mode. The method comprises the steps of receiving service information, configuring crawler parameters,performing service evaluation, and performing preparatory work; allocating system resources, and initiating service crawlers of a plurality of independent processes; crawling the original URL of the dynamic webpage by adopting a simulated browser mode, and returning the URL of the target static data content; reviewing the validity and non-repeatability of the URL, reviewing the crawling task afterreviewing, constructing a production task message list, and initiating a production crawler of a plurality of threads; crawling a static URL page by adopting an automatic program mode, and returningtarget data and an attachment file; processing and storing the returned content; deriving data. According to the method, the business crawler and the production crawler are constructed respectively, different crawling strategies are adopted for the dynamic webpage and the static content on the basis of the substitution mode, system resources are utilized to the maximum extent, and large-scale andrapid crawling of dynamic webpage data is achieved.

Description

technical field [0001] The invention relates to the technical fields of Internet information retrieval and search engines, in particular to a dynamic web page crawling method and system based on an OEM mode. Background technique [0002] Web crawlers are an important part of Internet search engines, and are mainly used to crawl data in web pages on the Internet and build indexes for search engines. Whether the amount of crawling is large enough determines whether the content of the search engine is rich and whether the crawling is immediate, directly affects the overall effect of the search engine. In the context of big data, web crawlers are also widely used to capture network data such as Internet public opinion, commodity transactions, cultural and sports entertainment, and provide massive basic data for further data mining and data analysis. [0003] The working principle of a general web crawler is to obtain the HTML data of the web page by accessing the URL of the tar...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/9566Y02P90/30
Inventor 杨杰程克非吴渝李红波叶雯静刘钟书刘洋旗
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products