A distributed Internet data acquisition method and system
A data collection system and technology for data collection, applied in network data indexing, network data retrieval, other database retrieval, etc., can solve problems such as unsolvable, cumbersome data processing for secondary AJAX requests, and achieve the effect of avoiding repeated crawling
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0059] Such as figure 2 Shown, the present invention is a kind of Internet data acquisition method based on distribution, comprises the following steps:
[0060] S101: Receive a user's request for creating a data collection task and create the data collection task, assign the user-created data collection task to multiple crawler threads, and start the multiple crawler threads.
[0061] First, receive the user's request to create a data collection task, and manage the data collection task, including creating, editing, and deleting the collection task, specifying the name of the collection task, the URL deduplication rule, the interval between two crawling URLs, and the working thread number, number of retries, etc. Collection tasks are managed in a tree structure.
[0062] Define the collection task according to the user's instructions, specify the collection plate list, IP proxy usage strategy, web page data download mode (support download based on browser kernel and HttpCl...
Embodiment 2
[0076] The present invention is a kind of Internet data acquisition system based on distribution, comprising:
[0077] The control center receives a user's request for creating a data collection task and creates a data collection task, distributes the data collection task created by the user to multiple crawler threads, and starts the multiple crawler threads;
[0078] The collection center, after each crawler thread receives the data collection task and is started, obtains the URL from the queue of URLs to be grabbed, and downloads the webpage from the website specified by the data collection task, and executes the data processing plug-in specified by the data collection task , performing data extraction, and performing data analysis on the extracted data; the data processing plug-in specified by the data collection task is a data processing plug-in with different functions selected according to the specified website type.
[0079] The data center is used to store the extract...
Embodiment 3
[0085] This embodiment takes the collection of bidding data of the "Shandong Provincial Government Website" as an example. After the control center receives the user instruction, it starts to create the data collection task. Since hundreds of collection tasks will run on the computer system, for the convenience of management, the collection tasks are classified and managed in a tree structure. Such as Figure 4 As shown, the left side is the collection task tree classification, and the right side is the collection task list under this category. The specific configuration information of the collection task can be configured from the page, or directly imported from the xml configuration file. Figure 4 The list of task names shown has "Shandong Provincial Government Procurement Network", and the name of the collection task is specified, that is, the website designated for this collection task is "Shandong Government Procurement Network".
[0086] Figure 5 Run the rule config...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More - R&D
- Intellectual Property
- Life Sciences
- Materials
- Tech Scout
- Unparalleled Data Quality
- Higher Quality Content
- 60% Fewer Hallucinations
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2025 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com



