Network data collection method

A collection method and network data technology, which is applied in the field of big data, can solve problems such as time-consuming, messy content, and complicated network information capture methods, and achieve the effect of improving the collection and capture speed and storage speed

Inactive Publication Date: 2018-05-04
ANHUI KECHUANG INTELLIGENT INTPROP SERVICE CO LTD
View PDF2 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] The technical problem to be solved by the present invention is that the network data is large in quantity and messy in content. The existing big data data collection technology is more complicated for the network information capture method and takes more time. The purpose is to provide a network data collection method , improve the speed of crawling and storing network information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0029] A network data collection method, comprising:

[0030] Step 1, crawling webpage content from the Internet through a web crawler, and extracting required attribute content;

[0031] Step 2, provide the crawler with the URL that needs to grab the data network through the URL queue;

[0032] Step 3, process the content captured by the crawler through the data processing module;

[0033] Step 4, store the URL information of the website that needs to grab data, the data extracted from the webpage by the crawler, and the data processed by the DP through the data storage module.

[0034] Step 1 includes:

[0035] Step 11, write the URL information of the website that needs to grab data into the URL queue;

[0036] Step 12, the crawler obtains the site URL information of the website that needs to grab data from the URL queue;

[0037] Step 13, the crawler crawls the corresponding web page content from the Internet, and extracts the content value of a specific attribute;

...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a network data collection method. The method comprises the steps that 1, web page content is captured from Internet through web crawlers, and needed attribute content are extracted; 2, by means of a URL queue, URL needing to capture a data network is provided for the crawlers; 3, by means of a data processing module, the content captured by the crawlers is processed; 4, bymeans of a data storage module, URL information needing to capture the data network, data extracted from the web page by the crawlers and data obtained after DP processing are stored. Accordingly, data information is obtained from the website in the mode of the web crawlers or website public API, unstructured data can be extracted from the web page and stored to be a uniform local data file, and stored in a structured mode, collection of files or attachments of pictures, audio, video and the like are supported, the attachments and text can be correlated automatically, network information collection and capture speed is increased, and meanwhile the storage speed of the captured information is increased.

Description

technical field [0001] The invention relates to the field of big data, in particular to a network data collection method. Background technique [0002] Similar terms have appeared in the history of data development, including ultra-large-scale data and massive data. "Super large-scale" generally refers to data corresponding to GB (1GB=1024MB), "massive" generally refers to data at the level of TB (1TB=1024GB), and the current "big data" refers to PB (1PB=1024TB), EB (1EB=1024PB), or even data above the ZB (1ZB=1024EB) level. In 2013, Gartner predicted that the data stored in the world would reach 1.2ZB. If these data were recorded on CD-R discs and piled up, the height would be five times the distance from the earth to the moon. Behind the different scales are different technical problems or challenging research problems. [0003] Big data refers to a collection of data that cannot be captured, managed and processed by conventional software tools within a certain period o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9566G06F16/951
Inventor 石文威
Owner ANHUI KECHUANG INTELLIGENT INTPROP SERVICE CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products