Check patentability & draft patents in minutes with Patsnap Eureka AI!

A distributed Internet data acquisition method and system

A data collection system and technology for data collection, applied in network data indexing, network data retrieval, other database retrieval, etc., can solve problems such as unsolvable, cumbersome data processing for secondary AJAX requests, and achieve the effect of avoiding repeated crawling

Active Publication Date: 2021-09-03
张魏
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0012] The above method of configuring collection rules can collect most of the simple website data. For more complex websites, such as data that requires login, POST requests, special header information, and some information that requires secondary AJAX requests, etc., it is cumbersome to process
Unable to solve the situation where frequent switching of dynamic IP agents is required
Unable to handle data that requires special processing to be recognized, such as parsing data from pictures, parsing data from audio, etc.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A distributed Internet data acquisition method and system
  • A distributed Internet data acquisition method and system
  • A distributed Internet data acquisition method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0059] Such as figure 2 Shown, the present invention is a kind of Internet data acquisition method based on distribution, comprises the following steps:

[0060] S101: Receive a user's request for creating a data collection task and create the data collection task, assign the user-created data collection task to multiple crawler threads, and start the multiple crawler threads.

[0061] First, receive the user's request to create a data collection task, and manage the data collection task, including creating, editing, and deleting the collection task, specifying the name of the collection task, the URL deduplication rule, the interval between two crawling URLs, and the working thread number, number of retries, etc. Collection tasks are managed in a tree structure.

[0062] Define the collection task according to the user's instructions, specify the collection plate list, IP proxy usage strategy, web page data download mode (support download based on browser kernel and HttpCl...

Embodiment 2

[0076] The present invention is a kind of Internet data acquisition system based on distribution, comprising:

[0077] The control center receives a user's request for creating a data collection task and creates a data collection task, distributes the data collection task created by the user to multiple crawler threads, and starts the multiple crawler threads;

[0078] The collection center, after each crawler thread receives the data collection task and is started, obtains the URL from the queue of URLs to be grabbed, and downloads the webpage from the website specified by the data collection task, and executes the data processing plug-in specified by the data collection task , performing data extraction, and performing data analysis on the extracted data; the data processing plug-in specified by the data collection task is a data processing plug-in with different functions selected according to the specified website type.

[0079] The data center is used to store the extract...

Embodiment 3

[0085] This embodiment takes the collection of bidding data of the "Shandong Provincial Government Website" as an example. After the control center receives the user instruction, it starts to create the data collection task. Since hundreds of collection tasks will run on the computer system, for the convenience of management, the collection tasks are classified and managed in a tree structure. Such as Figure 4 As shown, the left side is the collection task tree classification, and the right side is the collection task list under this category. The specific configuration information of the collection task can be configured from the page, or directly imported from the xml configuration file. Figure 4 The list of task names shown has "Shandong Provincial Government Procurement Network", and the name of the collection task is specified, that is, the website designated for this collection task is "Shandong Government Procurement Network".

[0086] Figure 5 Run the rule config...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention is a distributed-based Internet data collection method, which receives a user's request for creating a data collection task and creates a data collection task, distributes the data collection task created by the user to multiple crawler threads, and starts the multiple crawler threads ; After each crawler thread receives the data collection task and is started, obtain the URL from the URL queue to be grabbed, and download the webpage from the website specified by the data collection task, execute the data processing plug-in specified by the data collection task, and carry out Data extraction; the data processing plug-in specified by the data acquisition task is a data processing plug-in with different functions selected according to the specified website type; the extracted target data is stored in the designated database for subsequent processing, and the extracted target data Crawl the URL and push it to the queue to be crawled. Realize massive data crawling of multi-type and complex websites.

Description

technical field [0001] The invention relates to the technical field of Internet-based data collection, in particular to a distributed Internet-based data collection method and system. Background technique [0002] With the rapid development of the network, the Internet has become the carrier of a large amount of information, including public opinion information, social events, policy responses, various industry information, employment information, etc., which are the data basis of big data public opinion analysis systems and macroeconomic analysis systems. How to effectively extract and utilize this information has become a huge challenge. The web crawler is a very important part of the data analysis system. It is responsible for collecting web pages from the Internet and collecting information. These web page information is used to build indexes to provide support for search analysis. It determines whether the content of the entire data analysis system is rich. Whether the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/9566
Inventor 廖尚围刘遥周庚新
Owner 张魏
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More