Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Theme web portal crawler method

A portal website and theme technology, applied in the direction of network data indexing, network data retrieval, special data processing applications, etc., can solve the problems of increasing database load, wasting CPU resources, etc., and achieve the effect of convenient analysis

Pending Publication Date: 2021-05-04
大连海关技术中心
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to solve the problems that the above-mentioned current web theme crawler system captures webpages containing the same content once and downloads them multiple times, which wastes a lot of cpu resources and increases the load on database access, the present invention aims at content capture and The deduplication of the two links of incremental update proposes an efficient deduplication strategy, which is superior to traditional methods in terms of performance and scalability

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Theme web portal crawler method
  • Theme web portal crawler method
  • Theme web portal crawler method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0035] Such as Figure 1-5 As shown, a theme portal crawler method, including

[0036] Web page link analysis and extraction: design regular expressions according to the theme website to identify the parent page and child page links, and judge whether the page belongs to the link in the theme website, only process the link in the theme website, if it is identified as the parent page page, extract the subpage link in the parent page, if it is identified as a subpage, then extract the body content of the subpage, and the regular expression used to determine whether the extracted link is a parent page or a subpage is: http : / / www\\.(.*\\.)? agri\\.cn / .*(htm)$; the regular expression used to extract all parent pages and child pages is: http: / / www\\.(.*\\.)? agri\\.cn / .+, and can judge whether it belongs to the link in the theme website according to this regular expression;

[0037]Web page content extraction: extract the text content under the subpage link, and store the extrac...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the technical field of network information capture, in particular to a topic portal website crawler method. The method comprises the following steps of: analyzing and extracting a webpage page link, and designing a regular expression according to a theme website so as to identify a parent page link and a child page link; webpage content extraction: extracting the text content under the sub-page link, and storing the extracted text content in a static class; the data persistence storage being used for storing the text content extracted from each sub-page link; and incremental capturing: capturing the updated content in the theme webpage, re-extracting the link of the home page of the theme webpage during each incremental updating, and only processing the new link. The page obtained through the crawler program is almost not repeated, the required theme can be accurately obtained, the webpage containing the same content can be effectively prevented from being downloaded for multiple times, a large number of cpu resources are prevented from being wasted, and the load brought by database access is relieved.

Description

technical field [0001] The invention relates to the technical field of network information grabbing, in particular to a theme portal website crawling method. Background technique [0002] In the open environment of the Internet, the explosive growth of shared network information has provided people with a large number of information resources. However, this has also brought huge challenges. There are many types of information. more and more difficult. At this time, the search engine began to be born. Searching network information through keywords greatly facilitates people to search for information effectively and can meet most of the information needs. However, most search engines are based on horizontal search, and the main disadvantage of this method is that the returned search results have low accuracy and contain a lot of interference information. With the development of information diversification, this search strategy can no longer meet the specific needs of users. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/955
Inventor 徐静韦婷婷包先雨黄大亮徐天赵清月李妍
Owner 大连海关技术中心
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products