Data crawling implementation method based on distributed crawler technology

A technology of crawler technology and implementation method, which is applied in the field of data crawling implementation based on distributed crawler technology, and can solve problems such as incomprehension of retrieval, stereotyped results, and inability for people with different backgrounds to provide different search results, etc.
CN112487268AInactive Publication Date: 2021-03-12安徽经邦软件技术有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
安徽经邦软件技术有限公司
Publication Date
2021-03-12
Estimated Expiration
Not applicable · inactive patent
Patent Text Reader

Abstract

The invention discloses a data crawling implementation method based on a distributed crawler technology, which relates to the technical field of data crawling, and comprises the following steps of S1,appointing a URL, finding an address which finally needs to find grabbed data according to a given address, and obtaining a corresponding index code, and S2, initiating a request, splicing websites according to the codes obtained in the step S1, judging whether the websites are required captured data or not, if the websites are data pages, calling back to detail pages, and otherwise, continuing to search the data websites circularly. According to the method, focused capture and script are adopted, regularization, json data and a plurality of data with data frequencies of year, month, quarterand the like are used in the capture process, incremental updating and batch insertion into an Oracle database are carried out, contents are processed and screened when webpage capture is carried out,only webpage information related to requirements is captured as much as possible, and whether a character string is matched with a certain mode or not can be conveniently checked.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to the technical field of data crawling, in particular to a method for implementing data crawling based on distributed crawler technology. Background technique

[0002] Common types of crawlers in the prior art include: general crawlers, focused crawlers and incremental crawlers, and several commonly used data analysis methods include regularization, Bs4, and Xpath.

[0003] However, general-purpose crawlers only provide text-related content (HTML, Word, PDF), etc., but cannot provide multimedia files (music, pictures, videos) and binary files (programs, scripts). People in the background field provide different search results and cannot understand human semantic retrieval, and bs4 can only parse data in html format. In summary, the present invention provides a data crawling implementation method based on distributed crawler technology to solve the above problems . Contents of the invention

[0004] Aiming at the deficiencies ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More