Universal distributed acquisition system

A collection system, distributed technology, applied in the field of real-time data collection system, high-efficiency, general-purpose distributed collection system, can solve problems such as large-scale expansion of difficult machine performance, and achieve the effect of avoiding webpage collection and using efficient and reasonable

Active Publication Date: 2017-09-08
ANHUI BORYOU INFORMATION TECH
View PDF4 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For the collection of a large amount of data in the current network environment, the traditional manual data acquisition and single-node crawler acquisition of data can no longer meet the needs. At present, there are some distributed collection systems, but there are certain problems in each process of data collection. Bottleneck, it is difficult to efficiently and rationally utilize the performance of the machine and expand on a large scale

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Universal distributed acquisition system
  • Universal distributed acquisition system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] figure 1 It is a general distributed collection system, including seed warehouse, task scheduling module, data capture module, and text page warehouse; the seed warehouse is used to store the URL of the demand site and set the information source category and collection time interval; the task scheduling module It is used to coordinate the task load of each collection node; the data capture module is used to capture the information of the allocated collection tasks, which is divided into list page capture and text page capture; both the task scheduling module and the data capture module include Both the server and the client adopt a distributed communication framework; the text page warehouse is used to store the parsed text web page links and provide site access for the text page capture in the data capture module.

[0043] figure 2 It is a schematic diagram of a dynamic hash task allocation algorithm based on machine performance. It is assumed that A, B, and C are ph...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a universal distributed acquisition system which comprises a seed storehouse, a task scheduling module, a data capture module and a regular text page storehouse, wherein the seed storehouse is used for storing a URL (Uniform Resource Locator) of a required site and setting an information source category and an acquisition time interval; the task scheduling module is used for coordinating a task load of each acquisition node; the data capture module is used for performing information capture on the distributed acquisition tasks, wherein the information capture is divided into list page capture and regular text page capture; the task scheduling module and the data capture module comprise servers and clients and both adopt distributed communication frameworks; and the regular text page storehouse is used for storing the parsed regular text web page link and providing a site gate for the regular text page in the data capture module.

Description

technical field [0001] The present invention relates to a distributed high-concurrency collection system for Internet-wide data, in particular to an efficient and real-time data collection system in a big data environment, and in particular to a general distributed collection system. Background technique [0002] In recent years, with the rapid development and popularization of computer and information technology, the scale of industrial application systems has expanded rapidly, and the data generated by industrial applications has grown explosively. The current total data volume of Baidu has exceeded 1000PB, and the webpage data that needs to be processed every day reaches 10PB~100PB; the cumulative transaction data volume of Taobao is as high as 100PB; Twitter publishes more than 200 million messages every day, and Sina Weibo posts 80 million posts every day; China Mobile's telephone communication record data in a province can reach 0.5PB to 1PB per month; the road vehicle...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F9/50
CPCG06F9/5083G06F16/27G06F16/33G06F16/951G06F16/986
Inventor 胡淦周银行杨东董郑江陈焕郑中华
Owner ANHUI BORYOU INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products