A network data acquisition method capable of automatically removing useless information and repeated information

A technology of useless information and repeated information, applied in the field of big data, can solve problems such as time-consuming, cluttered content, and complicated grabbing methods, and achieve the effect of improving collection and grabbing speed and storage speed.

Inactive Publication Date: 2019-04-30
河南大瑞物联网科技有限公司
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The technical problem to be solved by the present invention is that the network data is large in quantity and messy in content. The existing big data data acquisition technology is more complicated for the network information capture method and takes a lot of time. The purpose is to provide automatic elimination of useless information and duplicate Information network data acquisition method, improving the speed of capturing and storing network information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] A network data collection method that automatically eliminates useless and repetitive information, including:

[0021] Step 1, crawling webpage content from the Internet through a web crawler, and extracting required content containing keywords;

[0022] Step 2, provide the crawler with the URL that needs to grab the data network through the URL queue;

[0023] Step 3: Process the content captured by the crawler through the data processing module; the data processing module includes a big data cleaning task unit library, Spark SQL module, Spark-ETL SDK module, pipeline configuration module, and consists of a web client and a web server The webpage service platform is characterized in that the user adds the required cleaning unit and the algorithm task to be executed through the webpage server, and the Spark SQL module receives the required cleaning unit and the algorithm task to be executed from the webpage server and realizes the data cleaning function , the cleaning ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a network data acquisition method capable of automatically removing useless information and repeated information, which comprises the following steps of: 1, capturing webpage contents from the Internet through a web crawler, and extracting required contents containing keywords; 2, providing URLs (uniform resource locators) needing to capture data networks for the crawlers through the URL queues; 3, the content captured by the crawler is processed through a data processing module; 4, storing the URL information of the website needing to capture the data, the data extracted from the webpage by the crawler and the data processed by the DP through a data storage module. According to the invention, the data information is obtained from the website in a web crawler or website public API mode. According to the method, the unstructured data can be extracted from the webpage, stored as a unified local data file and stored in a structured mode, collection of pictures, audios and videos is supported, attachments and the text can be automatically associated, the collection and capture speed of network information is increased, and meanwhile the storage speed of the captured information is increased.

Description

technical field [0001] The invention relates to the field of big data, in particular to a network data collection method for automatically eliminating useless information and repeated information. Background technique [0002] Network data contains a large number of pictures, videos, data and other content. The stringers are very large, large in number and messy in content. The existing big data data collection technology is more complicated and time-consuming to capture network information. This leads to very low work efficiency. If enterprises want to obtain a large amount of data as soon as possible, they need a large number of developers and the cost is very high. Contents of the invention [0003] The technical problem to be solved by the present invention is that the network data is large in quantity and messy in content. The existing big data data collection technology is more complicated for the network information capture method and takes a lot of time. The purpos...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/215G06F16/25G06F16/951
Inventor 韩金花
Owner 河南大瑞物联网科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products