Crawler system IO optimization method and device

A crawler system and optimization method technology, applied in the field of crawler system IO optimization, can solve problems such as affecting retrieval efficiency and low IO efficiency.

Inactive Publication Date: 2018-04-20
广州探迹科技有限公司
View PDF6 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The embodiment of the present invention provides a crawler system IO optimization method and device, which are used to solve the problem of low IO efficiency and affecting retrieval efficiency in the existing crawler task-based storage of results.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Crawler system IO optimization method and device
  • Crawler system IO optimization method and device
  • Crawler system IO optimization method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0026] figure 1 It exemplarily shows a schematic flowchart of a crawler system IO optimization method provided by an embodiment of the present invention. like figure 1 As shown, a crawler system IO optimization method provided by an embodiment of the present invention includes the following steps:

[0027] Step 101, the first result processor caches the received first crawler, wherein the first crawler includes at least one crawling result, and the first res...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a crawler system IO optimization method and device, relates to the field of software engineering, and aims at solving the problem that existing result storage work carried outby taking crawler tasks as units is low in IO efficiency and influences the retrieval efficiency. The method comprises the following steps of: caching received first crawlers by a first result processor, and when the fact that the quantity of cached crawling results exceeds an aggregation threshold value is determined, writing the plurality of crawling results into an aggregation file according toan end-to-end splicing method and recording a position offset of each crawling result; generating an aggregation path stored in a big file object storage system according to a content of the aggregation file and sending the aggregation file to the aggregation path; and generating an aggregation log which comprises each crawling result, the position offset of each crawling result, the aggregationpath and a number of each crawler according to the aggregation file and sending the aggregation log to a log processor.

Description

technical field [0001] The present invention relates to the field of software engineering, and in particular to a crawler system IO optimization method and device. Background technique [0002] At present, in the field of software engineering, most of the crawler tasks are used as the basic unit to store the results. For example, in the open source crawler framework scrapy-redis, the results are generally abstracted into items and placed in the result queue, and the files are written one by one or Write the processing method of the database. The disadvantage of this approach is: disk or network IO operations are particularly frequent; if the method of saving one file for one result is used, a large number of disk fragments will be generated, and if the number of files reaches the million level, folder traversal will take up a lot of memory , and consume a lot of time, and at the same time cause other disk IO operations on the machine to be blocked, causing the system to fre...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/181G06F16/951
Inventor 陈开冉邓楚健
Owner 广州探迹科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products