Unlock instant, AI-driven research and patent intelligence for your innovation.

Link deduplication method, device, equipment and storage medium based on web crawler

A web crawler and link feature technology, applied in the Internet field, can solve the problems of reducing the accuracy of duplicate checking, low conflict probability, and high memory space occupancy, so as to improve user experience, reduce misjudgment rate, and improve performance.

Active Publication Date: 2022-02-08
SOUTH CENTRAL UNIVERSITY FOR NATIONALITIES
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, with more and more URLs, the memory space occupancy rate will become higher and higher, and the feature of low conflict probability will reduce the accuracy of duplicate checking, so it will seriously affect the performance of web crawlers
[0004] Although the storage deduplication method based on the hash algorithm is fast and accurate, it needs to design a good hash function and maintain the hash table
In addition, as the scale of crawling web pages increases, the memory consumption will be too high, which will seriously affect the performance of web crawlers
[0005] Although the link deduplication method based on the Bloom filter can solve the problem of space complexity, it has certain misjudgments and cannot delete existing elements.
In other words, the more elements there are, the greater the false positive rate will be, which will seriously affect the performance of the web crawler

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Link deduplication method, device, equipment and storage medium based on web crawler
  • Link deduplication method, device, equipment and storage medium based on web crawler
  • Link deduplication method, device, equipment and storage medium based on web crawler

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0056] It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0057] refer to figure 1 , figure 1 It is a schematic structural diagram of a network crawler-based link deduplication device related to the hardware operating environment of the solution of the embodiment of the present invention.

[0058] Such as figure 1 As shown, the web crawler-based link deduplication device may include: a processor 1001 , such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002 , a user interface 1003 , a network interface 1004 , and a memory 1005 . Wherein, the communication bus 1002 is used to realize connection and communication between these components. The user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the technical field of the Internet, and discloses a network crawler-based link deduplication method, device, equipment and storage medium. The method includes: when receiving a data capture request of an agricultural product to be analyzed, extracting the first URL link of the platform to be accessed from the data capture request; sending an access request to the platform to be accessed according to the first URL link ; After receiving the response from the platform to be accessed according to the access request, grab the data information in the page corresponding to the first URL link; analyze the data information to obtain the second URL link embedded in the page, and convert the second URL link The URL link is added to the URL queue to be crawled; the counting Bloom filter of the link feature is used, and combined with multiple hashes, the second URL link in the URL queue to be crawled is jointly deduplicated. The present invention improves the performance of the web crawler by optimizing the link deduplication mode, thereby ensuring that the web crawler can quickly obtain the information required by people and improving user experience.

Description

technical field [0001] The present invention relates to the technical field of the Internet, in particular to a web crawler-based link deduplication method, device, equipment and storage medium. Background technique [0002] Web crawlers will inevitably encounter repeated downloads of web pages when crawling web pages. In order to prevent the efficiency of web crawlers from repeatedly crawling and wasting server resources, it is necessary to implement Uniform Resource Locator (Uniform Resource Locator, URL) to filter and deduplicate. At present, common link deduplication methods include: link compression deduplication based on the fifth-generation message digest algorithm (message-digest algorithm 5, MD5), storage deduplication based on hash algorithm, and link deduplication based on Bloom filter Etc. to deduplicate links. [0003] Although, the MD5-based link compression and deduplication method solves the problem that a Uniform Resource Locator (Uniform Resource Locator,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/9566G06F16/955
Inventor 雷建云王锦群郑禄毛腾跃孙翀马尧张蕾
Owner SOUTH CENTRAL UNIVERSITY FOR NATIONALITIES