Mass small file distributed caching method oriented to AI (Artificial Intelligence) training

A technology of distributed caching and massive small files, which is applied in the direction of file system, file access structure, storage system, etc., can solve the problems of affecting data access rate, waiting for a long time, missing cache, etc., so as to improve data access rate and increase Cache hit rate, the effect of solving random access problems

Pending Publication Date: 2022-01-07
HANGZHOU DIANZI UNIV
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The system proposes a scheme to merge massive small files into data blocks for storage, and proposes intra-group shuffle to replace random access for AI training. However, the system still needs to wait for a long time due to cache misses when AI tasks access data blocks for the first time. Disk I / O time affects data access rate

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Mass small file distributed caching method oriented to AI (Artificial Intelligence) training
  • Mass small file distributed caching method oriented to AI (Artificial Intelligence) training
  • Mass small file distributed caching method oriented to AI (Artificial Intelligence) training

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0018] Below with the accompanying drawings, specific embodiments of the present invention will be further described in detail.

[0019] The invention includes the following steps:

[0020] Step 1: Create Local Cache and Alluxio cache.

[0021] To create key-value key-value store in the Local Cache client and create Alluxio cache in a distributed storage device.

[0022] Alluxio Cache: Alluxio support block buffer memory means, storing the main data chunk combined, whenever an application is not Alluxio access to a cache when the chunk, the chunk will removed from the underlying storage, and restored to Alluxio cache so follow-up visits.

[0023] Local Cache: Local Cache key in key-value store, in the client, main memory chunk parsing all the small files, a chunk whenever removed from the cache Alluxio, small chunk file is parsed and stored in the Local Cache.

[0024] Step 2: AI in the training data storage stage, the data set based on a merge operation chunk Batch Size fitting c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an AI (artificial intelligence) training-oriented distributed caching method for massive small files, which is used for realizing high-performance distributed caching of the massive small files. The method comprises the following steps: firstly, combining small files into chunk according to a rule of fitting Batch Size features in AI training; secondly, analyzing the cache state of chunk, and carrying out double-layer shuffle operation on the small file sequence; and finally, during AI training and data reading, adopting Local Cache short-circuit reading for repeated I / O, and starting asynchronous grouping pre-reading at the Local Cache short-circuit reading time. According to the method, the problem of random reading of massive small files oriented to AI training is solved by efficiently utilizing the cache, and under the scene oriented to AI training, the data access rate and the cache hit rate are remarkably improved, and the iteration time of AI training is shortened.

Description

Technical field [0001] The present invention is directed to smart city, for face recognition, video search, intelligent storage and other scenes, the massive design of small files distributed caching oriented AI training methods to achieve high performance distributed cache massive small files. Background technique [0002] In recent years, with the rapid development of the global socio-economic and science and technology, and large-scale application of AI technology in the landing field of security, strongly boosting the peace smart city development. At the same time, the wisdom of peace in the city, such as face recognition, tracking and other cross-border pedestrian scene of AI technology challenges. AI is usually the number of documents required training mission in millions or even tens of millions of the order, such as Google OpenImages dataset contains 9,000,000 images, Tencent ML-Images dataset contains nearly 1,769 ten thousand images. Usually requires massive scale of te...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/172G06F16/13G06F12/0871G06F12/0862
CPCG06F16/172G06F16/13G06F12/0871G06F12/0862G06F2212/1044G06F2212/1021G06F2212/154
Inventor 路锦曾艳赵乃良张纪林袁俊峰万健张雪容沈鸿辉
Owner HANGZHOU DIANZI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products