A network data acquisition method capable of automatically removing useless information and repeated information

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of useless information and repeated information, applied in the field of big data, can solve problems such as time-consuming, cluttered content, and complicated grabbing methods, and achieve the effect of improving collection and grabbing speed and storage speed.

Inactive Publication Date: 2019-04-30

河南大瑞物联网科技有限公司

View PDF0 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] The technical problem to be solved by the present invention is that the network data is large in quantity and messy in content. The existing big data data acquisition technology is more complicated for the network information capture method and takes a lot of time. The purpose is to provide automatic elimination of useless information and duplicate Information network data acquisition method, improving the speed of capturing and storing network information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0020] A network data collection method that automatically eliminates useless and repetitive information, including:

[0021] Step 1, crawling webpage content from the Internet through a web crawler, and extracting required content containing keywords;

[0022] Step 2, provide the crawler with the URL that needs to grab the data network through the URL queue;

[0023] Step 3: Process the content captured by the crawler through the data processing module; the data processing module includes a big data cleaning task unit library, Spark SQL module, Spark-ETL SDK module, pipeline configuration module, and consists of a web client and a web server The webpage service platform is characterized in that the user adds the required cleaning unit and the algorithm task to be executed through the webpage server, and the Spark SQL module receives the required cleaning unit and the algorithm task to be executed from the webpage server and realizes the data cleaning function , the cleaning ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a network data acquisition method capable of automatically removing useless information and repeated information, which comprises the following steps of: 1, capturing webpage contents from the Internet through a web crawler, and extracting required contents containing keywords; 2, providing URLs (uniform resource locators) needing to capture data networks for the crawlers through the URL queues; 3, the content captured by the crawler is processed through a data processing module; 4, storing the URL information of the website needing to capture the data, the data extracted from the webpage by the crawler and the data processed by the DP through a data storage module. According to the invention, the data information is obtained from the website in a web crawler or website public API mode. According to the method, the unstructured data can be extracted from the webpage, stored as a unified local data file and stored in a structured mode, collection of pictures, audios and videos is supported, attachments and the text can be automatically associated, the collection and capture speed of network information is increased, and meanwhile the storage speed of the captured information is increased.

Description

technical field [0001] The invention relates to the field of big data, in particular to a network data collection method for automatically eliminating useless information and repeated information. Background technique [0002] Network data contains a large number of pictures, videos, data and other content. The stringers are very large, large in number and messy in content. The existing big data data collection technology is more complicated and time-consuming to capture network information. This leads to very low work efficiency. If enterprises want to obtain a large amount of data as soon as possible, they need a large number of developers and the cost is very high. Contents of the invention [0003] The technical problem to be solved by the present invention is that the network data is large in quantity and messy in content. The existing big data data collection technology is more complicated for the network information capture method and takes a lot of time. The purpos...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F16/215G06F16/25G06F16/951

Inventor 韩金花

Owner 河南大瑞物联网科技有限公司

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

A network data acquisition method capable of automatically removing useless information and repeated information

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology