Vertical intelligent crawler data collecting method based on webpage data capture

A webpage data and data collection technology, applied in the field of data collection, can solve the problems of high maintenance cost, inconvenient maintenance and function expansion, efficiency, etc., and achieve the effect of convenient expansion

Inactive Publication Date: 2015-02-11
TONGCHENG NETWORK TECH
View PDF5 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, in the past, the traditional vertical grabbing program needed to strongly couple the parsing and grabbing logic to the entire module, which was inconvenient for later maintenance and function expansion, high maintenance cost, low efficiency, and no framework scalability

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Vertical intelligent crawler data collecting method based on webpage data capture

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0015] like figure 1 The shown vertical intelligent crawler data collection method based on webpage data capture is characterized in that it includes the following steps: Step ①, through the start-stop entry configuration module, configure the initial entry address of the crawler into the start-up module. In step ②, the crawler control system performs a depth-first algorithm to traverse and crawl webpages according to the set crawling rules and crawling process. In step ③, the crawler parses and extracts the page data through the rule sequence pairs of the rule configuration system, and stores the extracted two-dimensional structure data.

[0016] As far as a preferred embodiment of the present invention is concerned, in order to facilitate subsequent configuration and use, the configuration module and the start-up module are located in the server, and the initial entry address of the crawler is statically imported through the specified crawler URL list file, or, through the c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a vertical intelligent crawler data collecting method based on webpage data capture. The vertical intelligent crawler data collecting method is characterized by comprising the steps that firstly, a crawler data initial inlet address is configured to a starting module through starting / stopping of an inlet configuration module; then a crawler control system performs a depth-first algorithm to perform traversal webpage capture according to set capture rules and a set capture process; finally, crawlers perform analysis extraction on webpage data and storing extracted two-dimensional structure data through a rule sequence pair of a rule configuration system. Therefore, the universality requirements of the crawlers can be met, analysis rule configuration, webpage depth and thread capturing, database configuration or index configuration can be increased on specific service logic, and intelligent information capture can be started. An intelligent crawler frame can be effectively formed, meanwhile automatic data archival and classified storage can be achieved, and a mode that distributed key values are adopted for database storage can be adopted.

Description

technical field [0001] The invention relates to a data collection method, in particular to a vertical intelligent crawler data collection method based on webpage data capture. Background technique [0002] Crawler, also known as spider, is not the name of an insect, but a computer program that people use to continuously extract links to web pages through customized entry URLs on the Internet, and then crawl and extract deeper unknown links based on these links. Going forward, such program crawling is described as a crawler-like action, which is called a crawler. A crawler is a program that automatically obtains web content and is an important part of a search engine. [0003] Regarding vertical crawlers, the core technology of vertical search is actually the technology of intelligent crawlers, which is how to capture directional or non-directional web pages and analyze them to obtain formatted data. It is mainly used to accurately extract regular two-dimensional table data,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 王专张海龙马和平郭凤林王晓钟庞绍进王祚德靳彩娟
Owner TONGCHENG NETWORK TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products