Method and device for capturing webpage content

A webpage content and webpage technology, applied in the field of webpage content crawling, can solve the problems of low efficiency of webpage content crawling and high complexity of webpage content crawling, and achieve the effect of reducing complexity and improving efficiency

Inactive Publication Date: 2015-08-26
SMART CITY INFORMATION TECH
View PDF3 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Embodiments of the present invention provide a method and device for capturing webpage content, which are used to solve the problems of high comple

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for capturing webpage content
  • Method and device for capturing webpage content
  • Method and device for capturing webpage content

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0031] In order to solve the problems of high complexity of crawling web content and low efficiency of crawling web content in the current process of crawling different types of web content. In the embodiment of the present invention, when a webpage to be crawled is detected, the URL of the webpage to be crawled is searched from a preset crawling rule library, and when there is no crawling rule corresponding to the URL in the crawling rule library , To analyze the content of the webpage to be crawled, and generate crawling rules for the webpages to be crawled that meet the conditions. By adopting the technical scheme of the present invention, the content of the webpage to be crawled is analyzed, and the crawling rules corresponding to the webpage to be crawled are automatically generated according to the analysis result. There is no need to manually set the crawling rules, which effectively reduces the complexity of crawling web content and improves Improve the efficiency of we...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for capturing webpage content, and aims to solve the problems in the prior art that in a process of capturing different types of webpage content, the webpage content capturing complexity is high, and the webpage content capturing efficiency is low. According to method provided by the embodiment of the invention, when a webpage to be captured is detected, the URL of the webpage to be captured is searched from a preset capturing rule base; when no capturing rule corresponding to the URL exists in the capturing rule base, the content of the webpage to be captured is analyzed, and a capturing rule is generated for the qualified webpage to be captured. Through the adoption of the technical scheme, the content of the webpage to be captured is analyzed, the capturing rule corresponding to the webpage to be captured is automatically generated according to the analysis result, the capturing rule does not need to be manually set, the webpage content capturing complexity is effectively reduced, and the webpage content capturing efficiency is improved.

Description

technical field [0001] The invention relates to the technical field of computer applications, in particular to a method and device for grabbing webpage content. Background technique [0002] Web crawlers are a fundamental part of search engine technology. Web crawler technology starts from the URL (Uniform Resource Locator, Uniform Resource Locator) of one or several initial webpages, and obtains the URLs on the initial webpage. Pull new URLs on the page and put them in the queue until some kind of stop condition is met. Then, the captured web page information is stored in the server of the search engine, thereby speeding up the user's search speed. [0003] At present, in the process of crawling webpages using web crawler technology, the crawling rules are set manually. For different types of webpages, corresponding crawling rules need to be manually set. When there are many types of webpages to be crawled, it will consume a lot of More manpower is required to set up cra...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 狄东杰孙德山姚臻
Owner SMART CITY INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products