A network data acquisition method capable of automatically removing useless information and repeated information
A technology of useless information and repeated information, applied in the field of big data, can solve problems such as time-consuming, cluttered content, and complicated grabbing methods, and achieve the effect of improving collection and grabbing speed and storage speed.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Examples
Embodiment Construction
[0020] A network data collection method that automatically eliminates useless and repetitive information, including:
[0021] Step 1, crawling webpage content from the Internet through a web crawler, and extracting required content containing keywords;
[0022] Step 2, provide the crawler with the URL that needs to grab the data network through the URL queue;
[0023] Step 3: Process the content captured by the crawler through the data processing module; the data processing module includes a big data cleaning task unit library, Spark SQL module, Spark-ETL SDK module, pipeline configuration module, and consists of a web client and a web server The webpage service platform is characterized in that the user adds the required cleaning unit and the algorithm task to be executed through the webpage server, and the Spark SQL module receives the required cleaning unit and the algorithm task to be executed from the webpage server and realizes the data cleaning function , the cleaning ...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com