Distributed internet data acquisition system and method based on event-driven model
A data acquisition system and event-driven technology, applied in the field of network search, can solve problems such as unsuitable for distributed deployment, high technical requirements for users, and high cost of system expansion
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0080] Such as Figure 4 Described, is a kind of distributed Internet data acquisition system based on event-driven model, including console module, data acquisition engine module, data storage module, log service module;
[0081] The entire data system runs on the container orchestration engine;
[0082] The console module configures data collection, including configuring crawler scheduling and parsing rules, triggering various events such as crawler running and stopping, and completing related configuration of data storage.
[0083] The data acquisition engine module completes data acquisition according to the configuration of the console module; the data acquisition engine module captures and parses relevant web pages from the corresponding website according to the rules configured by the user, and outputs structured data and parsed pages.
[0084] The data storage module is connected with the data acquisition engine module, and completes the data storage according to the ...
Embodiment 2
[0092] Such as Figure 5 As shown, on the basis of Embodiment 1, the data collection engine module includes a scheduling component, a download analysis component and a data verification component; the scheduling component cooperates with the download analysis component to complete data collection; the download analysis component cooperates with the data verification component to check the data Collect for verification.
[0093] Among them, the scheduling component is used to generate tasks to be crawled and manage the task status; the download analysis component calls various services to efficiently complete the page download analysis work, and according to different configuration requirements, the new link may continue to be downloaded; the data verification component uses It is used to conduct conformity inspection on the data before entering the database to improve the data quality.
Embodiment 3
[0095] Such as Figure 6 As shown, on the basis of the second embodiment, the scheduling component includes a crawler scheduling service and a link scheduling service.
[0096] A scheduling event message queue is provided between the crawler scheduling service and the link scheduling service; the link scheduling service is also connected to the crawling event message queue.
[0097] After the crawler scheduling service checks that the current time meets the execution time and cycle of the project configured by the user, it obtains the crawling task meta information of the task from the corresponding configuration database according to the unique identification number of the task, including the target website, data parsing rules and storage fields , data verification method, project execution time and period, database configuration and other information, and package the meta-information as a data scheduling event and put it in the scheduling event message queue.
[0098] The l...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com