Data crawling implementation method based on distributed crawler technology

A technology of crawler technology and implementation method, which is applied in the field of data crawling implementation based on distributed crawler technology, and can solve problems such as incomprehension of retrieval, stereotyped results, and inability for people with different backgrounds to provide different search results, etc.

Inactive Publication Date: 2021-03-12
安徽经邦软件技术有限公司
View PDF6 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, general-purpose crawlers only provide text-related content (HTML, Word, PDF), etc., but cannot provide multimedia files (music, pictures, videos) and binary files (programs, scripts). People in the background field provide different search results and cannot understand human semantic retrieval, and bs4 can only parse data in html format. In summary, the present invention provides a data crawling implementation method based on distributed crawler technology to solve the above problems

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0015] The following clearly and completely describes the technical solutions in the embodiments of the present invention. Apparently, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0016] The present invention provides a technical solution: a method for implementing data crawling based on distributed crawler technology, which adopts focused crawling (a network crawler program "oriented to specific subject requirements", which focuses crawlers on implementing webpage crawling When fetching, the content will be processed and screened, try to ensure that only the webpage information related to the requirement is fetched), scrappy, regular expressions are used in the fetching process (regular expressions are a special...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data crawling implementation method based on a distributed crawler technology, which relates to the technical field of data crawling, and comprises the following steps of S1,appointing a URL, finding an address which finally needs to find grabbed data according to a given address, and obtaining a corresponding index code, and S2, initiating a request, splicing websites according to the codes obtained in the step S1, judging whether the websites are required captured data or not, if the websites are data pages, calling back to detail pages, and otherwise, continuing to search the data websites circularly. According to the method, focused capture and script are adopted, regularization, json data and a plurality of data with data frequencies of year, month, quarterand the like are used in the capture process, incremental updating and batch insertion into an Oracle database are carried out, contents are processed and screened when webpage capture is carried out,only webpage information related to requirements is captured as much as possible, and whether a character string is matched with a certain mode or not can be conveniently checked.

Description

technical field [0001] The invention relates to the technical field of data crawling, in particular to a method for implementing data crawling based on distributed crawler technology. Background technique [0002] Common types of crawlers in the prior art include: general crawlers, focused crawlers and incremental crawlers, and several commonly used data analysis methods include regularization, Bs4, and Xpath. [0003] However, general-purpose crawlers only provide text-related content (HTML, Word, PDF), etc., but cannot provide multimedia files (music, pictures, videos) and binary files (programs, scripts). People in the background field provide different search results and cannot understand human semantic retrieval, and bs4 can only parse data in html format. In summary, the present invention provides a data crawling implementation method based on distributed crawler technology to solve the above problems . Contents of the invention [0004] Aiming at the deficiencies ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951
CPCG06F16/951
Inventor 陈绪龙王军凯卢文琳
Owner 安徽经邦软件技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products