Distributed web crawler data extraction system and method based on micro-service architecture

A data extraction and microservice technology, applied in network data retrieval, network data indexing, unstructured text data retrieval, etc. Problems such as unclear division of functional modules, to achieve the effect of improving scalability and rapid deployment capabilities, clear division, and improving overall throughput performance

Inactive Publication Date: 2020-06-02
NANJING UNIV OF POSTS & TELECOMM
View PDF0 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In the current traditional web crawler system, the division of functional modules is not clear, the coupling between functions is high, and it is impossible to have efficient data throughput and crawling efficiency in the face of large amounts of data.
There is no isolation division and system fuse processing between functions similar to microservice modules, which leads to the possibility of avalanche of the entire system if a part of the function logic collapses

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed web crawler data extraction system and method based on micro-service architecture

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] The technical solution of the present invention is described in further detail below in conjunction with the accompanying drawings: the present embodiment is implemented on the premise of the technical solution of the present invention, and detailed implementation methods and specific operating procedures are provided, but the protection authority of the present invention does not Limited to the following examples.

[0031] This embodiment proposes a distributed web crawler data extraction system based on a microservice architecture, including a data extraction module, a request preprocessing module, a data distributed storage module, and a download module. The main function of the data extraction module is to extract the specified information from the downloaded page information according to the data extraction rules specified by the user. The request preprocessing module is mainly used to deliver the crawler task request to the message queue through the load balancing...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a distributed web crawler data extraction system and method based on a micro-service architecture. A leading-edge micro-service architecture concept in the current industry is used; the whole crawler system is split into a data extraction module; based on the system and the cloud architecture, a user can realize quickquickquickquick deployment of the distributed crawler system, support horizontal expansion and containerized deployment, and greatly improve the expandability and quickquickquickquick deployment capability of the crawler system.

Description

technical field [0001] The invention relates to a distributed web crawler data extraction system and method based on a microservice architecture, and belongs to the technical field of big data distribution. Background technique [0002] With the increasing popularity of the Internet in people's lives, more and more new technologies have been born, and web crawlers are one of the widely used technologies. According to statistics, nearly 80% of the traffic in today's Internet world comes from Web crawlers developed by major Internet companies or individual developers. With the development of webpage technology, the data on the Internet also shows an explosive growth rate. At the same time, people have higher and higher requirements for information extraction from webpages. The requirements are getting higher and higher, which has given birth to a wide variety of crawler systems. The current crawler systems can be divided into general crawler crawlers, domain-specific vertical...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/9538G06F16/958G06F16/31G06F9/50
CPCG06F9/5038G06F16/31G06F16/951G06F16/9538G06F16/986
Inventor 葛又嘉章韵
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products