Real-time information crawler method and device based on intelligent page analysis, and equipment

A page and information technology, applied in the real-time information crawler method, computer equipment and storage media, and device fields, can solve problems such as large amount of development engineering, cannot be missed, and crawler failures, so as to improve the accuracy of crawlers, improve accuracy and efficiency Effect

Pending Publication Date: 2022-01-28
宁波深擎信息科技有限公司 +1
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The development of the crawler system will face many problems. The label structure of the webpage lacks a unified standard, which leads to the need to write a crawler parsing code for each label style of the webpage to extract structured information, which requires a large amount of development work and is difficult to maintain. When the original When the structure of the web page is revised, it will lead to crawler failures; due to regulatory requirements, it must be traceable, that is, structural information such as title / release time / source / content must be highly consistent with the original, when the title or text of the original web page occurs When changes are made, it is necessary to be able to reflect such changes in a timely manner
[0003] However, the current crawler technology has flaws: one is that information such as title / source / publishing time cannot be obtained; the other is that it cannot correctly parse the text of SMS articles, but instead crawls to company profiles or navigation pages, navigation lists and other content ; Can not achieve high accuracy (neither can contain page advertisements, navigation, and other spam information, nor can some paragraphs be missed); For web pages that dynamically obtain background data and render through asynchronous interfaces, they cannot be automated according to html text Analysis, unable to cope with changes in the content of the original page (such as retraction, content revision, etc.), low accuracy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Real-time information crawler method and device based on intelligent page analysis, and equipment
  • Real-time information crawler method and device based on intelligent page analysis, and equipment
  • Real-time information crawler method and device based on intelligent page analysis, and equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0050] In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

[0051] In one embodiment, such as figure 1 As shown, a real-time information crawling method based on intelligent page parsing is provided, including the following steps:

[0052] Step 102, obtaining a list of pages to be climbed; the list of pages to be climbed includes the URL of the real-time information list page and the XPath of the details list page; according to the URL of the real-time information list page and the XPath of the details list page, the URL of the details page is obtained.

[0053] The real-time information list page contains various pages, such as a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a real-time information crawler method and device based on intelligent page analysis and computer equipment. The method comprises the following steps: analyzing a URL of a detail page according to XPath configuration of each website, and logging a queue to be crawled into a table of a database through an SQL query component; crawling a URL of the detail page by utilizing a crawler assembly line; performing page rendering through a preset automatic page analysis algorithm or by calling a headless browser; extracting an HTML document of the detail page to obtain a text title, a source and release time of an article of the detail page; calculating MD5 of the detail page article and writing the MD5 into a database, where the text titles, the sources and the release time of the detail page articles and the MD5 of the detail page articles form crawler moment page snapshots; and polling the crawled detail pages according to a preset timed task and a crawler moment page snapshot to obtain a changed article list. By adopting the method, the crawler accuracy can be improved.

Description

technical field [0001] The present application relates to the technical field of data processing, in particular to a real-time information crawler method, device, computer equipment and storage medium based on intelligent page analysis. Background technique [0002] The official media releases real-time information on the webpage. It needs to use crawler technology to extract the article title, text, release time, source / author and other structured information on the webpage, and provide it to end users through its own APP. The development of the crawler system will face many problems. The label structure of the webpage lacks a unified standard, which leads to the need to write a crawler parsing code for each label style of the webpage to extract structured information, which requires a large amount of development work and is difficult to maintain. When the original When the structure of the web page is revised, it will lead to crawler failures; due to regulatory requirement...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/955G06F16/84G06F16/25
CPCG06F16/951G06F16/955G06F16/86G06F16/254
Inventor 徐毅
Owner 宁波深擎信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products