An Object Storage Based Crawler Network Path Tracing Method

A technology of object storage and network path, which is applied in the field of path tracking research in software engineering, can solve the problem of serious disk IO load, achieve the effect of improving IO efficiency, decoupling, and ensuring retrieval efficiency
CN107451261BActive Publication Date: 2020-06-09广州探迹科技有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
广州探迹科技有限公司
Publication Date
2020-06-09

Smart Images

  • Figure 1
    Figure 1
Patent Text Reader

Abstract

The invention discloses a crawler network path tracing method based on object storage. The method comprises the steps that an object storage system and a log recorder are established, wherein the log recorder generates a result path log, and indexes from the source URL of a crawling result to a crawler result file on the object storage system are recorded in the result path log; when an external system needs to call the data in the database, the crawler result file on the object storage system is directly obtained through the indexes. According to the method, the object storage system is introduced so that the file reading and writing speed can be increased; the result path log is established so that the data can be retrieved in the log when the external system calls the data and does not need to be searched in the database, and accordingly the possibility of reading and writing conflicts is avoided.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention belongs to the research field of path tracing in software engineering, in particular to a crawler network path tracing method based on object storage. Background technique

[0002] A web crawler is a program or script that automatically captures information on the World Wide Web according to certain rules. In the current path tracing, most of the crawler network path tracing is based on the crawler task as the basic unit. For example, the open source crawler framework pyspider, the default The action is to store the result into the database. If the external system needs to retrieve the data in the database, there is no convenient retrieval method. It can only scan the database, and it is necessary to modify the status of the result data in the database so that these processed data will be excluded in the next processing. result. As a result, the data in the database needs to be maintained by the two systems together, causing great uncer...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More