Data crawling implementation method based on distributed crawler technology
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 安徽经邦软件技术有限公司
- Publication Date
- 2021-03-12
- Estimated Expiration
- Not applicable · inactive patent
Abstract
Description
technical field
[0001] The invention relates to the technical field of data crawling, in particular to a method for implementing data crawling based on distributed crawler technology. Background technique
[0002] Common types of crawlers in the prior art include: general crawlers, focused crawlers and incremental crawlers, and several commonly used data analysis methods include regularization, Bs4, and Xpath.
[0003] However, general-purpose crawlers only provide text-related content (HTML, Word, PDF), etc., but cannot provide multimedia files (music, pictures, videos) and binary files (programs, scripts). People in the background field provide different search results and cannot understand human semantic retrieval, and bs4 can only parse data in html format. In summary, the present invention provides a data crawling implementation method based on distributed crawler technology to solve the above problems . Contents of the invention
[0004] Aiming at the deficiencies ...