Supercharge Your Innovation With Domain-Expert AI Agents!

Universal low-code crawler method and system for news blog website

A technology for websites and blogs, applied in general low-code crawler methods and systems, can solve problems such as high development and learning costs, high debugging difficulty, multiple memory consumption, etc., to improve development and maintenance efficiency, have versatility, and improve crawling efficiency effect

Pending Publication Date: 2022-05-13
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF5 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] However, this method is not perfect in practical applications. A mature framework means high development and learning costs. Since the crawler framework takes into account the needs of various crawlers, many functions still need to be written and developed by yourself when facing specific fields and requirements. The development cost of crawler programs for multiple websites is still high
In addition, framework-based writing may also bring more memory consumption and higher debugging difficulty

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Universal low-code crawler method and system for news blog website
  • Universal low-code crawler method and system for news blog website
  • Universal low-code crawler method and system for news blog website

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0063] This embodiment proposes a general low-code crawler system for news blog websites developed based on Node.js, the architecture is as follows figure 2 As shown, it includes configuration loading module, page resource loading module, data extraction module, data storage module, asynchronous multi-task management module and log and progress management module; among them,

[0064] Such as image 3 As shown, the page resource loading module includes a URL intelligent splicing processing module, a dynamic page loading module based on puppeteer and a static page loading module based on axios;

[0065] Such as Figure 4 As shown, the data extraction module includes a selector type identification module, a data object generation module, a data extraction module based on css and a data extraction module based on xpath;

[0066] Such as Figure 5 As shown, the data storage module includes a data object verification module, a json file storage module, a csv file storage module,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a universal low-code crawler method and system for news blog websites, and belongs to the technical field of web crawlers. The method comprises the following core steps: creating configuration files of all websites to be crawled; selecting an operation mode and loading various configurations of the target website; extracting links and category names of all classified navigations in a navigation bar of the target website; for each classification navigation link, extracting all article links of an article list page and adding the article links into a to-be-crawled list; extracting each item of information in the article page resource of each article in the to-be-crawled list, and performing persistent storage; and the execution is repeated until all crawling tasks are completed. The capabilities provided by the crawler system mainly comprise self-defined function extension, multi-task management, multi-mode persistent storage, compatibility with different types of websites, log and progress management and the like. According to the method, article crawling meeting basic requirements can be completed only by adding various configurations of the to-be-crawled website, and the crawler program development and maintenance efficiency is greatly improved.

Description

technical field [0001] The invention belongs to the technical field of web crawlers, and in particular relates to a general low-code crawler method and system for news blog websites. Background technique [0002] In the Internet era of information explosion, it is no longer feasible to rely on manual data collection, and web crawler programs have become an important means of obtaining various network data resources. Among them, crawling various news / article data (note: news and articles refer to the same concept in the patent description of the present invention) from various news blog websites (especially news websites) is a major data collection method. The data can usually be used for database index construction, news resource integration, data mining or AI model training, etc. [0003] At present, the mainstream crawling technology direction is divided into the following two types: [0004] The first is to develop specific crawlers for different websites. [0005] Dif...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/954
CPCG06F16/951G06F16/954
Inventor 杨国武谈振伟杜佩佩孙相鹏董广县
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More