Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Quick collection system and method for distributed internet data

A collection system and distributed network technology, which is applied in the field of distributed Internet data rapid collection system and collection, can solve the problems of low real-time performance, many data sources, and large amount of data collection, and achieve high real-time performance and high scalability. , the effect of strong scalability

Active Publication Date: 2017-03-08
SOUTHWEAT UNIV OF SCI & TECH
View PDF5 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In view of this, the present invention provides a distributed Internet data fast collection system and collection method for the problems of large amount of data collection, many data sources and low real-time performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Quick collection system and method for distributed internet data
  • Quick collection system and method for distributed internet data
  • Quick collection system and method for distributed internet data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0017] The implementation of the present invention will be described in detail below with examples, so as to fully understand and implement the implementation process of how the present invention uses technical means to solve technical problems and achieve technical effects.

[0018] A distributed Internet data fast acquisition system of the present invention, such as figure 1 As shown, there are five layers including torrent website setting node, hyperlink collection layer, real-time queue, webpage download and analysis layer, webpage data storage layer,

[0019] Among them, the torrent website setting node is used to set various parameters of the storage data source and extraction rules, etc., and is a single node; the torrent website setting node uses a relational database.

[0020] Among them, the hyperlink collection layer is used to request the hyperlink list webpage of the data source and extract the hyperlink of the target webpage; the hyperlink collection layer is com...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a quick collection system for distributed internet data. The system comprises five layers including a seed website setting node, a hyperlink collection layer, a real-time queue, a webpage downloading and parsing layer and a webpage data storage layer, wherein the seed website setting node is used for setting each parameter and each extracting rule for storing a data source; the hyperlink collection layer is used for requesting the hyperlink list webpage of the data source and extracting the hyperlink of a target webpage; the real-time queue is used for accessing a URL (Uniform Resource Locator) hyperlink extracted by the hyperlink collection layer, the extraction rule corresponding to the URL hyperlink and the accessed URL hyperlink; the webpage downloading and parsing layer is used for requesting and parsing the URL hyperlink which is not accessed in the real-time queue and carrying out formatting extraction on specific data; and the webpage data storage layer is used for storing target data obtained by the formatting extraction carried out by the webpage downloading and parsing layer. By use of the system, data collection is carried out by a distributed layered cooperation way, and the system application requirements including high data collection quantity, more data sources and high instantaneity requirements can be coped with.

Description

technical field [0001] The invention belongs to the technical field of Internet big data collection, and in particular relates to a distributed Internet data fast collection system and collection method. Background technique [0002] The rapid development of the Internet has brought society into the information age with highly developed and open data, and the era of big data has come. Data plays an extremely important role in business operations, government decision-making, and social dynamic analysis, and how to collect data on a large scale and quickly has become a technical focus. However, from the perspective of existing technical solutions, data collection methods need to be improved. Traditional Internet data collection mainly uses web crawlers as the main tool, and takes structured or semi-structured text data as the object of data collection. A web crawler is a program or script that automatically walks and crawls Internet text web pages according to certain rules. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 张晖杨春明李晓伟李波赵旭剑
Owner SOUTHWEAT UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products