Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Webpage content crawling method and device

A web content, configured technology, applied in the web domain

Active Publication Date: 2016-09-21
考拉征信服务有限公司
View PDF4 Cites 42 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The present disclosure provides a method and device for crawling webpage content, so as to solve the technical problem in the prior art of crawling websites that need to be logged in to query relevant information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage content crawling method and device
  • Webpage content crawling method and device
  • Webpage content crawling method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0081] Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the concept of example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus repeated descriptions thereof will be omitted.

[0082] Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of embodiments of the present disclosure. However, those skilled in ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a webpage content crawling method and device. The method comprises the following steps: scheduling a crawling task; when querying that the crawling task has proxy setting, obtaining a proxy IP queue; carrying out resource management; carrying out data capture by a data capture engine through adopting a multi-thread parallel processing manner; and carrying out data analysis on the data captured by each thread by an analysis engine, and carrying out persistent operation on the analyzed data. According to the webpage content crawling method and device, the crawling work of each function of common webpages and websites needing to be logined is solved, the crawling task is finished by a crawling assembly, rapidness and correctness are realized, the hierarchical relationship of target websites can be analyzed so as to form a clear crawling target structure chart and establishing clear relationship for the crawled data, and an anti-crawling technology is realized through anti-monitoring management so that the obstacles are removed for finally obtaining target data.

Description

technical field [0001] The present disclosure generally relates to the technical field of webpages, and in particular, relates to a method and device for crawling webpage content. Background technique [0002] In recent years, with the explosive growth of Web information, how to effectively obtain useful information in the Web has become extremely difficult. At present, Internet website crawler technology plays an important role in enriching company data and obtaining multi-source data. At the same time, crawler technology is also an indispensable tool for data mining. Among them, crawler technology is widely used in the field of search engines, but as the network becomes more and more complex, these general-purpose search engines sometimes lose their way in information navigation, so it is still a good idea to apply crawler technology only to search engines. far from enough. [0003] For a large-scale web content crawling system, the commonly used strategies for crawling ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/951G06F16/9535
Inventor 孔祥旭张泽斌周勇
Owner 考拉征信服务有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products