Unlock instant, AI-driven research and patent intelligence for your innovation.

Domain name resolution buffering method for web crawler

A domain name resolution and web crawler technology, which is applied in the domain name resolution buffer field of web crawlers, can solve problems such as low space efficiency, impact on crawler performance, and inconformity with application characteristics, so as to reduce memory usage, improve overall performance, and improve domain name resolution performance Effect

Pending Publication Date: 2022-03-08
SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

There are two problems with this data structure. First, the space efficiency is low. Distributed crawlers need to process a large number of domain names. If IVM’s cache data structure is used to store these first-level domain names, only the redundant storage of TLD needs 5×3 ×109=1.5GB; Considering that a website usually has multiple sub-domain names, at this time, TLD redundancy requires more space (assuming that each website has an average of 3 sub-domain names, TLD redundancy requires close to 4.5GB of memory space)
Secondly, the data of this structure is only shared within the same JVM, which does not conform to the application characteristics of distributed crawler multi-node sharing and distributed parallelism, and the simultaneous access of this data structure needs to be blocked by synchronous locks, which cannot well support high-level Concurrency will also affect crawler performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Domain name resolution buffering method for web crawler
  • Domain name resolution buffering method for web crawler

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work belong to the protection of the present invention. scope.

[0026] Based on an in-depth analysis of the DNS working mechanism of the operating system and JVM, the present invention designs and implements an efficient DNS caching mechanism DQCache (DistributedQuickCache) for the parallelization of distributed crawlers and the uniformity of multi-node tasks. In the double-cache method, the forward cache uses a com...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a domain name resolution buffering method for web crawlers, which belongs to the technical field of data processing, and comprises the following steps of: on the basis of deeply analyzing a domain name resolution working mechanism of an operating system and a JVM (Java Virtual Machine), combining the characteristics of parallelization, task multi-node homogenization and the like of distributed crawlers, and adopting a positive and negative double-cache mode to realize the domain name resolution buffering of the web crawlers. A specific data structure and an expiration strategy are respectively designed and adopted for the two caches. Experiments show that the method can effectively improve the domain name resolution performance, reduce the memory occupation of each crawler node and the influence of the domain name resolution request on the network bandwidth, and improve the overall performance of the distributed crawler.

Description

technical field [0001] The invention relates to the technical field of data processing, in particular to a domain name resolution buffering method for web crawlers. Background technique [0002] Search engines are currently the most efficient way to get information from the Internet. As the basis of search engines, distributed crawlers have been extensively researched and applied. They usually consist of multiple components such as URL analyzers, DNS caches, and rate control. When a crawler crawls a web page, it needs to use DNS (DomainName Service) to convert the domain name of the target host into an IP address. Studies have shown that this link is one of the main performance bottlenecks of crawlers. [0003] At home and abroad, in-depth research has been carried out on the strategy and performance of single-machine crawlers. The research on distributed crawlers mainly focuses on crawler strategies such as task scheduling and resource allocation. Research on the performa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/951G06F16/901G06F12/0895G06F12/0891H04L61/30H04L61/4511H04L61/58
CPCG06F16/951G06F16/9014G06F12/0895G06F12/0891G06F2212/1016G06F2212/154
Inventor 李涛孙思清孙兴艳
Owner SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD