Data file processing method and device, electronic equipment and storage medium
By parsing the identifiers and storage status of data files and optimizing the caching strategy for data files, the problem of slow reading speed for massive data files was solved, achieving efficient file reading and improved system stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TENCENT TECHNOLOGY (SHENZHEN) CO LTD
- Filing Date
- 2021-04-07
- Publication Date
- 2026-06-12
Smart Images

Figure CN113704204B_ABST
Abstract
Description
Technical Field
[0001] This application relates to computer data technology, and more particularly to a data file processing method, apparatus, electronic device, and computer-readable storage medium. Background Technology
[0002] In the era of big data explosion, there are usually massive amounts of data files. For example, computer vision requires hundreds of millions of data files for model training, e-commerce applications need tens of thousands of product images for product display, and personnel management systems need thousands of personnel photos for check-in, etc.
[0003] Faced with the explosion of data, related technologies store data files on file servers. However, when file servers store tens of millions of data files, the read speed is far from meeting the actual reading requirements when a specific data file needs to be read, and related technologies lack efficient file reading methods. Summary of the Invention
[0004] This application provides a data file processing method, apparatus, electronic device, and computer-readable storage medium, which can accelerate the reading efficiency of data files.
[0005] The technical solution of this application embodiment is implemented as follows:
[0006] This application provides a data file processing method, including:
[0007] In response to a read request for a data file, the file read interface is called to parse the read request for the data file and obtain the identifier of the data file;
[0008] Based on the identifier of the data file, traverse the first storage space to determine the storage status of the data file;
[0009] When the storage status of the data file indicates that the data file has been cached, the metadata of the data file is obtained from the first storage space based on the identifier of the data file;
[0010] Based on the metadata of the data file, the file data of the data file is obtained from the second storage space.
[0011] In the above technical solution, after determining the storage status of the data file by traversing the first storage space based on the identifier of the data file, the method further includes:
[0012] When the storage status of the data file indicates that the data file is not cached, the file reading interface is called to retrieve the metadata of the data file from the metadata server based on the identifier of the data file;
[0013] Based on the metadata of the data file, the file reading interface is called to retrieve the file data of the data file from the file server.
[0014] In the above technical solution, after retrieving the metadata of the data file from the metadata server by calling the file reading interface based on the identifier of the data file, the method further includes:
[0015] The metadata of the data file is stored in the first storage space, and
[0016] The data file of the data file is stored in the second storage space.
[0017] In the above technical solution, storing the metadata of the data file in the first storage space and storing the file data of the data file in the second storage space includes:
[0018] Traverse the historical logs of the data file to determine the reading frequency of the data file;
[0019] When the reading frequency of the data file is greater than the reading frequency threshold, the metadata of the data file is stored in the first storage space, and the file data of the data file is stored in the second storage space.
[0020] In the above technical solution, storing the metadata of the data file in the first storage space and storing the file data of the data file in the second storage space includes:
[0021] The data file is subjected to feature extraction processing to obtain the feature information of the data file;
[0022] Based on the feature information of the data file, a prediction process is performed to obtain the cache level of the data file;
[0023] When the cache level of the data file indicates that the data file needs to be cached, the metadata of the data file is stored in the first storage space, and the file data of the data file is stored in the second storage space.
[0024] The method in the above technical solution further includes:
[0025] When the available storage space of the first storage space is less than the first storage space threshold, or when the set first cache cleanup time arrives, delete the metadata of some of the data files in the first storage space until the available storage space of the first storage space is not less than the first storage space threshold.
[0026] When the available storage space of the second storage space is less than the second storage space threshold, or when the set second cache cleanup time arrives, delete part of the file data of the data files in the second storage space until the available storage space of the second storage space is not less than the second storage space threshold.
[0027] In the above technical solution, deleting part of the file data of the data file in the second storage space includes:
[0028] Based on the duration during which the file data has not been read, the file data of the data files in the second storage space is sorted in descending order, and the file data of the data files that appear first in the descending sort is deleted; or...
[0029] Based on the number of times the file data is read, the file data of the data file in the second storage space is sorted in ascending order, and the file data of the data file that is first in the ascending order is deleted.
[0030] This application provides a data file processing apparatus, including:
[0031] The calling module is used to respond to a read request for a data file by calling the file read interface to parse the read request for the data file and obtain the identifier of the data file;
[0032] The determination module is used to traverse the first storage space based on the identifier of the data file to determine the storage status of the data file;
[0033] The first reading module is used to, when the storage status of the data file indicates that the data file has been cached, retrieve the metadata of the data file from the first storage space based on the identifier of the data file; and retrieve the file data of the data file from the second storage space based on the metadata of the data file.
[0034] In the above technical solution, the device further includes:
[0035] The second reading module is used to retrieve the metadata of the data file from the metadata server by calling the file reading interface based on the identifier of the data file when the storage status of the data file indicates that the data file is not cached.
[0036] Based on the metadata of the data file, the file reading interface is called to retrieve the file data of the data file from the file server.
[0037] In the above technical solution, the device further includes:
[0038] A storage module is used to store the metadata of the data file in the first storage space, and
[0039] The data file of the data file is stored in the second storage space.
[0040] In the above technical solution, the storage module is also used to traverse the historical logs of the data file to determine the reading frequency of the data file;
[0041] When the reading frequency of the data file is greater than the reading frequency threshold, the metadata of the data file is stored in the first storage space, and the file data of the data file is stored in the second storage space.
[0042] In the above technical solution, the storage module is further used to perform feature extraction processing on the data file to obtain the feature information of the data file;
[0043] Based on the feature information of the data file, a prediction process is performed to obtain the cache level of the data file;
[0044] When the cache level of the data file indicates that the data file needs to be cached, the metadata of the data file is stored in the first storage space, and the file data of the data file is stored in the second storage space.
[0045] In the above technical solution, the storage module is further used to divide the first storage space into multiple first blocks, wherein each first block corresponds one-to-one with a cache level;
[0046] The second storage space is divided into multiple second blocks, wherein each second block corresponds one-to-one with a cache level;
[0047] The metadata of the data file is stored in the first block corresponding to the cache level of the data file;
[0048] The file data of the data file is stored in the second block corresponding to the cache level of the data file.
[0049] In the above technical solution, the device further includes:
[0050] The first processing module is used to call the file reading interface to pre-obtain the metadata of the data file from the metadata server when the similarity between the historical data file that has been read and the data file is greater than the similarity threshold, and to store the metadata of the data file in the first storage space.
[0051] Based on the metadata of the data file, the file reading interface is called to retrieve the file data of the data file from the file server in advance, and the file data of the data file is stored in the second storage space.
[0052] In the above technical solution, the device further includes:
[0053] An update module is used to update and verify the file data of the data file obtained from the second storage space;
[0054] When it is determined through update verification that the file data of the data file corresponding to the file server has been updated, the updated file data of the data file is obtained from the file server, and the second storage space is updated based on the updated file data of the data file.
[0055] In the above technical solution, the update module is further used to encode the file data of the data file obtained from the second storage space to obtain the corresponding verification code;
[0056] When the verification code obtained from the file server is inconsistent with the encoded verification code, it is determined that the file data of the data file stored in the second storage space needs to be updated.
[0057] In the above technical solution, the device further includes:
[0058] The second processing module is used to delete the metadata of some of the data files in the first storage space when the available storage space of the first storage space is less than the first storage space threshold, or when the set first cache cleanup time arrives, until the available storage space of the first storage space is not less than the first storage space threshold.
[0059] When the available storage space of the second storage space is less than the second storage space threshold, or when the set second cache cleanup time arrives, delete part of the file data of the data files in the second storage space until the available storage space of the second storage space is not less than the second storage space threshold.
[0060] In the above technical solution, the second processing module is further configured to sort the file data of the data file in the second storage space in descending order based on the duration during which the file data has not been read, and delete the portion of the file data that appears first in the descending order; or,
[0061] Based on the number of times the file data is read, the file data of the data file in the second storage space is sorted in ascending order, and the file data of the data file that is first in the ascending order is deleted.
[0062] This application provides an electronic device for data file processing, the electronic device comprising:
[0063] Memory, used to store executable instructions;
[0064] The processor, when executing executable instructions stored in the memory, implements the data file processing method provided in the embodiments of this application.
[0065] This application provides a computer-readable storage medium storing executable instructions, which, when executed by a processor, implement the data file processing method provided in this application.
[0066] This application provides a computer program that, when executed by a processor, implements the data file processing method provided in this application.
[0067] The embodiments of this application have the following beneficial effects:
[0068] By calling the file reading interface to read the file data of the data file, the overhead of switching between user mode and kernel mode in the system is reduced, and the system stability is improved. Furthermore, by using the identifier of the data file, it is determined whether the data file has been cached. When the data file has been cached, the file data of the data file is retrieved from the second storage space, thereby improving the reading efficiency of the data file. Attached Figure Description
[0069] Figure 1 This is a schematic diagram illustrating an application scenario of the distributed file system provided in this application embodiment;
[0070] Figure 2 This is a schematic diagram of the structure of an electronic device for data file processing provided in an embodiment of this application;
[0071] Figures 3-5 This is a flowchart illustrating the data file processing method provided in an embodiment of this application;
[0072] Figure 6 This is a flowchart illustrating the data file processing method provided in an embodiment of this application;
[0073] Figure 7 This is a schematic diagram of the data flow when the cache is not hit, provided in an embodiment of this application;
[0074] Figure 8 This is a schematic diagram of the data flow when the cache is hit, provided in an embodiment of this application. Detailed Implementation
[0075] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. The described embodiments should not be regarded as limitations on this application. All other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0076] In the following description, the terms "first" and "second" are used merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first" and "second" may be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.
[0077] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.
[0078] Before providing a further detailed description of the embodiments of this application, the nouns and terms involved in the embodiments of this application will be explained, and the nouns and terms involved in the embodiments of this application shall be interpreted as follows.
[0079] 1) Client: An application running in a terminal that provides various services, such as a video playback client, a game client, etc.
[0080] 2) Responding to: used to indicate the conditions or states on which the operation is performed depends. When the conditions or states on which it depends are met, one or more operations can be performed in real time or with a set delay. Unless otherwise specified, there is no restriction on the order in which the multiple operations are performed.
[0081] 3) Artificial Intelligence (AI): A comprehensive technology in computer science that studies the design principles and implementation methods of various intelligent machines to enable them to perceive, reason, and make decisions. AI technology is a multidisciplinary field involving a wide range of areas, such as natural language processing and machine learning / deep learning. With technological advancements, AI will be applied in more fields and play an increasingly important role.
[0082] 4) Ceph: A distributed file system designed for object storage and cloud infrastructure, it adds replication and fault tolerance while maintaining POSIX compatibility. The lowest-level storage unit in Ceph is the data file, each containing metadata and file data.
[0083] 5) Ceph File System (CephFS): A POSIX-compliant file system that uses a Ceph storage cluster to store data. Ceph files can be accessed directly through CephFS as if they were local hard drives.
[0084] 6) Metadata Server (MDS): The Ceph metadata server is the module in a Ceph cluster that stores file metadata. When a CephFS client accesses a file within Ceph, it requests the file's metadata information from the MDS. This metadata includes the filename and attribute information of the data file, and the metadata is separated from the data itself.
[0085] 7) File Server (RADOS, Reliable Autonomic Distributed Object Store): A reliable and autonomous distributed object storage system that provides a stable, scalable, and high-performance single logical object storage interface and enables nodes to adapt and self-manage on a multi-storage device cluster for storing file data.
[0086] 8) Linux File System: Files in the Linux file system are collections of data. The file system not only contains the data in the files, but also the structure of the file system. All files, directories, symbolic links, and file protection information that Linux users and programs see are stored in it.
[0087] 9) User Mode: In the design of the Central Processing Unit (CPU), user mode refers to a non-privileged state. In this state, the code executing is hardware-restricted and cannot perform certain operations, such as writing to the memory space of other processes, to prevent security vulnerabilities to the operating system. In the design of the operating system, user mode is similar, referring to a non-privileged execution state. The kernel prohibits code in this state from performing potentially dangerous operations, such as writing to system configuration files, killing other users' processes, or restarting the system.
[0088] 10) Kernel mode: In processor memory protection, it is also known as privileged mode. Kernel mode is the mode in which the operating system kernel runs. Code running in this mode can access system memory and external devices without restriction.
[0089] The ways to switch from user mode to kernel mode include: system calls, which are a way for user-mode processes to actively request a switch to kernel mode, using system calls to request the use of operating system services to complete tasks; exceptions, when the CPU is executing a program running in user mode and some unpredictable exception occurs, it will trigger a switch from the currently running process to the kernel-related program that handles the exception; and peripheral device interrupts, when a peripheral device completes the user-requested operation, it will send a corresponding interrupt signal to the CPU, at which point the CPU will suspend the execution of the next instruction to be executed and instead execute the handler corresponding to the interrupt signal.
[0090] This application provides a data file processing method, apparatus, electronic device, and computer-readable storage medium that can accelerate the reading efficiency of data files.
[0091] The data file processing method provided in this application embodiment can be implemented by a terminal or a server alone; or it can be implemented by the terminal and the server in cooperation. For example, the terminal can undertake the data file processing method described below alone, or the terminal can send a read request for the data file to the server, and the server can execute the data file processing method according to the received read request for the data file to obtain the file data of the data file from the second storage space to realize the operation of reading the data file.
[0092] The electronic device for data file processing provided in this application can be various types of terminals or servers. The server can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The terminal can be a smartphone, tablet, laptop, desktop computer, smart speaker, smart TV, smartwatch, etc., but is not limited to these. The terminal and server can be directly or indirectly connected via wired or wireless communication, which is not limited herein.
[0093] Taking servers as an example, such as server clusters deployed in the cloud, AI as a Service (AIaaS) is offered to users. The AIaaS platform breaks down several common AI services and provides them as independent or packaged services in the cloud. This service model is similar to an AI-themed marketplace, where all users can access and use one or more AI services provided by the AIaaS platform through application programming interfaces.
[0094] For example, one type of AI cloud service can be a data file processing service, whereby a cloud server encapsulates the data file processing program provided in this application embodiment. A user invokes the data file processing service in the cloud service through a terminal (running a client), causing the cloud-deployed server to invoke the encapsulated data file processing program. In response to a data file read request, the program invokes a file read interface to determine the storage status of the data file. When the storage status indicates that the data file is cached, the program retrieves the file data from a second storage space.
[0095] See Figure 1 , Figure 1 This is a schematic diagram of an application scenario of the distributed file system 10 provided in this application embodiment. The terminal 200 connects to the server (example shows metadata server 100-1 and file server 100-2) through the network 300. The network 300 can be a wide area network or a local area network, or a combination of the two.
[0096] A terminal (running a client, such as a model training client, music client, video client, education client, etc.) can be used to obtain read requests for data files. For example, when a user opens a model training client running on the terminal and selects a model to be trained, the terminal automatically obtains a read request for the data file in order to train the model using the read data file.
[0097] In some embodiments, after receiving a read request for a data file, the terminal 200 calls the file read interface to determine the storage status of the data file. When the storage status of the data file indicates that the data file has been cached, the terminal 200 retrieves the file data of the data file from its local second storage space in response to the read request for the data file.
[0098] In some embodiments, after receiving a read request for a data file, the terminal 200 calls the file read interface to determine the storage status of the data file. When the storage status of the data file indicates that the data file is not cached, the terminal 200 calls the file read interface to retrieve the metadata of the training data file from the metadata server 100-1 and the file data of the training data file from the file server 100-2 based on the identifier of the data file. The terminal 200 then stores the metadata of the training data file in the first storage space on its local machine and stores the file data of the training data file in the second storage space on its local machine. The terminal 200 responds to the read request for the data file based on the file data of the data file.
[0099] As an application example, for a model training application (training a model using a large amount of training data), when a user opens the model training client running on the terminal and selects the model to be trained, the terminal automatically obtains a read request for the training data file. Based on this read request, the terminal calls the file read interface and determines the storage status of the training data file. If the storage status indicates that the training data file is cached, the file data is retrieved from the terminal's local second storage space to respond to the read request, thus enabling model training based on the retrieved training data file. This avoids retrieving the training data file from the file server, accelerating the model training process. If the storage status indicates that the training data file is not cached, the terminal calls the file read interface based on the training data file's identifier to retrieve the metadata from the metadata server, retrieves the file data from the file server, stores the metadata in the terminal's local first storage space, and stores the training data file data in the terminal's local second storage space. Subsequent reads of the training data file can directly retrieve it from the terminal's local second storage space, avoiding retrieval from the file server and further accelerating model training.
[0100] As another application example, for a music application (capable of playing a vast library of music), when a user opens the music client running on their terminal and selects the music they want to play, the terminal automatically obtains a read request for the music file. Based on this read request, the terminal calls the file read interface and determines the storage status of the music file. If the storage status indicates that the music file is cached, the terminal retrieves the file data from its local secondary storage space to respond to the read request. This allows for music playback based on the retrieved music file, avoiding the need to retrieve the music file from a file server and thus speeding up music playback.
[0101] The embodiments of this application can be implemented with the help of cloud technology, which refers to a hosting technology that unifies a series of resources such as hardware, software, and network within a wide area network or local area network to realize the computation, storage, processing, and sharing of data.
[0102] Cloud technology is a collective term for network technology, information technology, integration technology, management platform technology, and application technology applied to the cloud computing business model. It can form resource pools, providing flexible and convenient on-demand access. Cloud computing technology will become a crucial support. Backend services of technical network systems require substantial computing and storage resources, such as video websites, image websites, and many portal websites. With the rapid development and application of the internet industry, every item may have its own identification mark in the future, requiring transmission to backend systems for logical processing. Data at different levels will be processed separately, and various industry data will all require robust system support, which can only be achieved through cloud computing.
[0103] Cloud storage is a new concept that extends and develops from the concept of cloud computing. A distributed cloud storage system (hereinafter referred to as a storage system) refers to a storage system that uses cluster applications, grid technology and distributed storage file systems to bring together a large number of storage devices of various types in the network (storage devices are also called storage nodes) to work together through application software or application interfaces to provide data storage and business access functions to the outside world.
[0104] Currently, the storage method in storage systems is as follows: Logical volumes are created. During creation, physical storage space is allocated to each logical volume. This physical storage space may consist of a single storage device or the disks of several storage devices. Clients store data on a logical volume, which means storing the data on the file system. The file system divides the data into many parts, each part being an object. Each object contains not only the data but also additional information such as a data identifier (ID, ID entity). The file system writes each object to the physical storage space of that logical volume and records the storage location information of each object. Therefore, when a client requests access to data, the file system can allow the client to access the data based on the storage location information of each object.
[0105] The structure of the electronic device for data file processing provided in the embodiments of this application is described below. See also... Figure 2 , Figure 2 This is a schematic diagram of the structure of an electronic device 500 for data file processing provided in an embodiment of this application. The explanation uses a terminal as an example. Figure 2The illustrated electronic device 500 for data file processing includes at least one processor 510, a memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together via a bus system 540. It is understood that the bus system 540 is used to implement communication between these components. In addition to a data bus, the bus system 540 also includes a power bus, a control bus, and a status signal bus. However, for clarity, ... Figure 2 The general labeled all buses as Bus System 540.
[0106] The processor 510 can be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.
[0107] Memory 550 may include volatile memory or non-volatile memory, or both. The non-volatile memory may be read-only memory (ROM), and the volatile memory may be random access memory (RAM). The memory 550 described in this application embodiment is intended to include any suitable type of memory. Memory 550 may optionally include one or more storage devices physically located away from processor 510.
[0108] In some embodiments, memory 550 is capable of storing data to support various operations, examples of which include programs, modules, and data structures or subsets or supersets thereof, as illustrated below.
[0109] Operating system 551 includes system programs for handling various basic system services and performing hardware-related tasks, such as the framework layer, core library layer, driver layer, etc., for implementing various basic business functions and handling hardware-based tasks;
[0110] The network communication module 552 is used to reach other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: Bluetooth, WiFi, and Universal Serial Bus (USB), etc.
[0111] In some embodiments, the data file processing apparatus provided in this application can be implemented in software. The data file processing apparatus provided in this application can be provided in various software embodiments, including various forms such as applications, software, software modules, scripts or code.
[0112] Figure 2 A data file processing device 555 stored in memory 550 is shown. It may be software in the form of programs and plug-ins, and includes a series of modules, including a calling module 5551, a determining module 5552, a first reading module 5553, a second reading module 5554, a storage module 5555, a first processing module 5556, an update module 5557, and a second processing module 5558. These modules are logically related and can therefore be arbitrarily combined or further divided according to the functions they implement. The functions of each module will be described below.
[0113] As mentioned above, the data file processing method provided in this application embodiment can be implemented by various types of electronic devices. See also Figure 3 , Figure 3 This is a flowchart illustrating the data file processing method provided in the embodiments of this application, combined with... Figure 3 The steps shown are explained.
[0114] In the following steps, the first storage space is used to store metadata, and the second storage space is used to store file data. The first and second storage spaces can be the same or different. For example, since memory has a fast read speed but small storage space, it is suitable for storing metadata, while solid-state drives (SSDs) have large storage space but slow read speed, they are suitable for storing file data. Therefore, the first storage space can be memory, and the second storage space can be an SSD. If storage size is not a concern and read speed is a priority, both the first and second storage spaces can be memory; if read speed is not a concern and storage space is a priority, both the first and second storage spaces can be SSDs.
[0115] In the following steps, the file data can be in the form of text, images, audio, video, etc. For example, in a news recommendation scenario, the file data can be news in text form; in a face recognition scenario, the text data can be face images in image form, etc.
[0116] In step 101, in response to a read request for a data file, the file read interface is called to parse the read request for the data file and obtain the identifier of the data file.
[0117] As an example of obtaining a read request for a data file, when a user opens the model training client running on the terminal and selects the model to be trained, the user program on the terminal automatically obtains a read request for the data file in order to train the model by reading the data file.
[0118] After receiving a read request for a data file, the terminal calls the file read interface to directly retrieve the data file based on the read request. This avoids calling the Linux file system, thereby reducing the overhead of switching between user mode and kernel mode, improving the speed of subsequent file data reading, and enhancing the stability of file reading.
[0119] In step 102, the first storage space is traversed based on the identifier of the data file to determine the storage status of the data file.
[0120] The data file identifier is used to uniquely identify the data file. For example, the data file identifier can be the filename, file identity (ID), etc. After receiving a read request for a data file, the data file is parsed to obtain its identifier. Based on the identifier, the first storage space is traversed. If the metadata of the data file is found in the first storage space, the cache is hit, and the data file's storage status is determined to be cached. The file data can then be retrieved directly from the local machine by calling the file read interface, without needing to access the file server over the network, thus improving the speed of data file retrieval. If the metadata of the data file is not found in the first storage space, the cache is not hit, and the data file's storage status is determined to be uncached. The file data can then be retrieved from the file server by calling the file read interface, solving the problem of insufficient terminal storage space.
[0121] See Figure 4 , Figure 4 This is an optional flowchart illustrating the data file processing method provided in an embodiment of this application. Figure 4 Show Figure 3 The implementation also includes steps 106-109: In step 106, when the storage status of the data file indicates that the data file is not cached, the file reading interface is called to obtain the metadata of the data file from the metadata server based on the identifier of the data file; in step 107, the file reading interface is called to obtain the file data of the data file from the file server based on the metadata of the data file; in step 108, the metadata of the data file is stored in the first storage space; in step 109, the file data of the data file is stored in the second storage space.
[0122] For example, when the storage status of a data file indicates that the data file is not cached, it means that the data file is being read for the first time. It is necessary to retrieve the metadata of the data file from the metadata server and the file data of the data file from the file server. The metadata of the data file is cached in memory, and the file data of the data file is cached in SSD. This avoids having to retrieve the metadata of the data file from the metadata server and the file data of the data file from the file server again when reading the data file next time. This avoids interaction with Ceph and greatly shortens the input / output (I / O) time.
[0123] In some embodiments, storing the metadata of a data file in a first storage space and storing the file data of the data file in a second storage space includes: traversing the historical log of the data file to determine the reading frequency of the data file; when the reading frequency of the data file is greater than a reading frequency threshold, storing the metadata of the data file in the first storage space and storing the file data of the data file in the second storage space.
[0124] Continuing the example above, when a data file is not frequently used (i.e., read infrequently), there's no need to cache it locally on the terminal, saving storage space. By analyzing the data file's historical logs, the number of times the data file is read within a set time interval is determined, thus establishing the data file read frequency. If the read frequency exceeds a threshold, the data file is frequently used. In this case, the metadata is stored in the first storage space, and the file data is stored in the second storage space for future reads, avoiding interaction with Ceph and significantly reducing I / O time. Conversely, if the read frequency is less than or equal to the threshold, the data file is not frequently used. Therefore, caching it locally on the terminal is unnecessary, saving storage space. The metadata can still be retrieved from the metadata server, and the file data from the file server.
[0125] In some embodiments, storing the metadata of a data file in a first storage space and storing the file data of the data file in a second storage space includes: performing feature extraction processing on the data file to obtain feature information of the data file; performing prediction processing based on the feature information of the data file to obtain the cache level of the data file; when the cache level of the data file indicates that the data file needs to be cached, storing the metadata of the data file in the first storage space and storing the file data of the data file in the second storage space.
[0126] Continuing with the example above, a neural network model is used to extract features from the data file to obtain its feature information. Based on this feature information, a prediction process is performed to determine the cache level of the data file. When the cache level indicates that the data file needs to be cached, the metadata of the data file is stored in the first storage space, and the file data of the data file is stored in the second storage space. When the cache level indicates that the data file does not need to be cached, there is no need to cache the data file locally on the terminal, saving the terminal's storage space. The metadata of the data file can still be obtained from the metadata server, and the file data of the data file can still be obtained from the file server next time.
[0127] For example, a neural network model can be used to extract features from a data file, resulting in features across multiple modalities. These features are then fused to obtain a multimodal fused feature. Similarly, feature extraction is performed on the target user's historical interaction data to identify their interest features. The multimodal fused feature and the target user's interest features are then compared to obtain a similarity score, which is used as the data file's caching level. The multimodal features include text representation features, audio representation features, and image representation features. The neural network model is not limited to recurrent neural networks (RNNs), convolutional neural networks (CNNs), or deep neural networks (DNNs).
[0128] Continuing with the example above, multiple neural network models are used to predict the cache level of the data file, resulting in the cache level of the data file. Based on the weights of the multiple neural network models, the cache levels of the multiple neural network models are weighted and summed to obtain the cache level of the data file. Thus, by using multiple neural network models to predict the data file, the accuracy of the cache level is increased, avoiding errors in any one neural network model that could lead to incorrect predictions of the data file's cache level.
[0129] In some embodiments, before storing the metadata of the data file in the first storage space, the method further includes: dividing the first storage space into multiple first blocks, wherein each first block corresponds to a cache level; dividing the second storage space into multiple second blocks, wherein each second block corresponds to a cache level; accordingly, storing the metadata of the data file in the first storage space includes: storing the metadata of the data file in the first block corresponding to the cache level of the data file; accordingly, storing the file data of the data file in the second storage space includes: storing the file data of the data file in the second block corresponding to the cache level of the data file.
[0130] For example, a data file can have multiple caching levels, such as level 1, level 2, level 3, etc. The first storage space is divided into multiple first blocks corresponding to each cache level, and the second storage space is divided into multiple second blocks corresponding to each cache level. For instance, if there are 5 cache levels, the first storage space is divided into 5 first blocks, and the second storage space is divided into 5 second blocks. When storing metadata, the data file's metadata is stored in the first block corresponding to each cache level. When storing file data, the file data is stored in the second block corresponding to each cache level. This way, when the cache level of the data file is known, only the block corresponding to the cache level needs to be read, thus speeding up the data file reading process.
[0131] In some embodiments, before traversing the first storage space based on the identifier of the data file to determine the storage status of the data file, the method further includes: when the similarity between the historical data file that has been read and the data file is greater than a similarity threshold, calling the file reading interface to obtain the metadata of the data file from the metadata server in advance, and storing the metadata of the data file in the first storage space; and based on the metadata of the data file, calling the file reading interface to obtain the file data of the data file from the file server in advance, and storing the file data of the data file in the second storage space.
[0132] For example, to avoid reading data files from the file server only when needed, the data files can be pre-read from the file server and stored in a second storage space. After reading historical data files, the similarity between the read historical data files and the current data file is determined based on the file type or content. For instance, if both the read historical data files and the current data file are face training samples, the similarity is 100%. If the similarity is greater than a similarity threshold, it indicates a high probability that the data file will be read later. Therefore, the file reading interface can be called to pre-retrieve the metadata of the data file from the metadata server and store the metadata in the first storage space. If the similarity is less than or equal to the similarity threshold, it indicates a high probability that the data file will not be read later. Therefore, it is not necessary to pre-retrieve the metadata of the data file from the metadata server.
[0133] In step 103, when the storage status of the data file indicates that the data file has been cached, the metadata of the data file is obtained from the first storage space based on the identifier of the data file.
[0134] For example, when the storage status of a data file indicates that the data file has been cached, it means that the metadata of the data file is cached in the first storage space on the local terminal. Therefore, the metadata of the data file can be directly obtained from the first storage space on the local terminal so that the file data of the data file can be obtained from the local terminal based on the metadata of the data file.
[0135] In step 104, the file data of the data file is obtained from the second storage space based on the metadata of the data file.
[0136] For example, metadata includes the address of the data file. After obtaining the metadata of the data file, the file data of the data file is obtained from the second storage space based on the address of the data file in the metadata, and thus the read request for the data file is responded to based on the file data of the data file.
[0137] See Figure 5 , Figure 5 This is an optional flowchart illustrating the data file processing method provided in an embodiment of this application. Figure 5 Show Figure 4It also includes steps 110-111: In step 110, the file data of the data file obtained from the second storage space is updated and verified; in step 111, when it is determined through update verification that the file data of the data file corresponding to the file server has been updated, the updated file data of the data file is obtained from the file server, and the second storage space is updated based on the updated file data of the data file.
[0138] For example, when file data on the file server is updated, the file data cached in the second storage space expires and needs to be synchronized with the file data on the file server. Therefore, after retrieving file data from the second storage space on the terminal, it is necessary to verify the update of the file data retrieved from the second storage space. If it is determined that the file data in the second storage space has expired and needs to be updated, the updated file data is retrieved from the file server, and the second storage space is updated based on the updated file data.
[0139] In some embodiments, updating and verifying the file data of a data file obtained from a second storage space includes: encoding the file data of the data file obtained from the second storage space to obtain a corresponding verification code; and determining that the file data of the data file stored in the second storage space needs to be updated when the verification code obtained from the file server is inconsistent with the encoded verification code.
[0140] For example, the file data retrieved from the second storage space is encoded to obtain a verification code corresponding to the second storage space. Similarly, the file data in the file server is encoded to obtain a verification code corresponding to the file server. If the verification code on the file server does not match the verification code in the second storage space, it indicates that the file data in the second storage space has expired and needs to be updated. If the verification code on the file server matches the verification code in the second storage space, it indicates that the file data in the second storage space has not expired and does not need to be updated. The encoding algorithm is not limited to ASCII encoding, Base64 encoding, etc.
[0141] In some embodiments, since the first storage space is limited, when the available storage space of the first storage space is less than the first storage space threshold, or when the set first cache cleanup time arrives, the metadata of some data files in the first storage space is deleted until the available storage space of the first storage space is not less than the first storage space threshold.
[0142] In some embodiments, since the second storage space is limited, when the available storage space of the second storage space is less than the second storage space threshold, or when the set second cache cleanup time arrives, the file data of some data files in the second storage space is deleted until the available storage space of the second storage space is not less than the second storage space threshold.
[0143] For example, deleting some data files in the second storage space includes: sorting the data files in the second storage space in descending order based on the duration the data files have not been read, and deleting the data files at the top of the descending order. For example, if the second storage space has 100 data files, sorting these 100 data files in descending order based on the duration the data files have not been read, and deleting the first 50 data files in the descending order, until the available storage space in the second storage space is not less than the second storage space threshold.
[0144] For example, deleting some data files in the second storage space includes: sorting the data files in the second storage space in ascending order based on the number of times they have been read, and deleting the data files that appear first in the ascending order. For example, if the second storage space contains 100 files, sorting these 100 files in ascending order based on the number of times they have been read, and deleting the first 50 files in the ascending order, until the available storage space in the second storage space is not less than the second storage space threshold.
[0145] In step 105, a file data response based on the data file is made in response to a read request for the data file.
[0146] For example, if the metadata of the data file is found during the first storage space traversal, the cache is hit, and the storage status of the data file is determined to be cached. Then, the file data can be directly obtained from the local machine by calling the file read interface, and a read request for the data file can be responded to without accessing the file server through the network, thereby improving the speed of obtaining the data file. If the metadata of the data file is not found during the first storage space traversal, the cache is not hit, and the storage status of the data file is determined to be uncached. Then, the file data can be obtained from the file server by calling the file read interface, and a read request for the data file can be responded to, thus solving the problem of insufficient terminal storage space.
[0147] The following will describe an exemplary application of the embodiments of this application in a real-world application scenario.
[0148] In the context of AI and big data, machine learning platforms provide computing power. This often involves massive amounts of data files; for example, computer vision requires hundreds of millions of data files for model training. Typically, because training uses a distributed approach with multiple machines and GPUs, the training files are stored in the cloud using Ceph on machine learning computing platforms.
[0149] In machine learning training, the same dataset (one training sample in the dataset corresponds to one data file) is typically used for multiple rounds of training until the required convergence accuracy is met. Therefore, the same dataset will be accessed multiple times with the same frequency.
[0150] The Ceph MDS multi-node metadata caching technology in related technologies uses metadata caching backup technology to build an MDS cluster in a Ceph cluster. The use of the MDS cluster allows the system to cache the metadata information of hundreds of millions or even more files in memory. Each time a CephFS client requests Ceph MDS, the request can be load-balanced across the various MDS nodes, thereby alleviating the problem of Ceph MDS overload.
[0151] The applicant discovered that while the MDS multi-node metadata caching technology in related technologies improves the performance of metadata caching by increasing the number of server nodes to prevent MDS cluster overload, the following problems exist: 1) There is an upper limit to improving metadata caching performance, and increasing the number of server nodes leads to increased costs; 2) This solution can at most guarantee that the MDS cluster will not be overloaded, but cannot prevent CephFS clients from crashing when requesting to read massive amounts of files.
[0152] In related technologies, local metadata caching on the Ceph client involves executing metadata caching logic within the Ceph client. Whenever Ceph-FUSE (the user-space file system (FUSE) client of the Ceph distributed file system) reads a file from Ceph, the file's metadata is cached locally and not cleared, thereby avoiding repeated reading of the same file's metadata and improving data reading speed.
[0153] The applicant discovered that although the Ceph client-side local metadata caching technology caches file metadata locally through the Ceph client, thus eliminating the need to access the MDS when reading the same file subsequently, the following problems exist: 1) Only the metadata is cached, not the file data; accessing the file data still requires requesting access to RADOS; 2) All data requests still pass through the Linux file system, resulting in significant overhead from switching between user mode and kernel mode.
[0154] To address the aforementioned issues, this application embodiment directly obtains the training data file (file data) from Ceph RADOS during AI training via the CephFS application programming interface (API), and then asynchronously caches it on a local solid-state disk (SSD). When the same training data file needs to be accessed during subsequent training, it can be read directly from the local SSD, thereby accelerating the AI training speed.
[0155] In this embodiment, the speed of initially acquiring training data files is improved by using the CephFS API (file reading interface). When acquiring training data files for the first time during AI training, interaction with CephFS is required via the network. Using the CephFS API at this time can bypass the Linux file system and directly read data from RADOS, reducing the overhead of switching between user mode and kernel mode (when system calls, interrupts, or exceptions occur, user mode switches to kernel mode), thereby improving the speed of initially acquiring training data files. The training data files are cached using a local SSD cache, reducing interaction with CephFS. When the training data files are initially acquired, they are cached locally. When the same training data files need to be accessed in subsequent training processes, they can be read directly from the local SSD without accessing CephFS via the network, thereby improving the speed of acquiring training data files.
[0156] In this embodiment, the Ceph-FUSE client can also be used to interact with Ceph, and the metadata and file data of the requested data file can be cached in the Ceph-FUSE client. In other words, all the technical solutions in this paper are performed in the Ceph-FUSE client.
[0157] like Figure 6 As shown, the calling flow of this application embodiment is as follows: Steps 11-13:
[0158] Step 11: The user calls the file reading interface;
[0159] Step 12: Determine if the data file has been cached.
[0160] If the data file is not cached, the following operations are performed:
[0161] Step 12.1A: Obtain the metadata (including filename, attributes, and address) of the data file from MDS using the CephFS API based on the filename.
[0162] Step 12.2A: Read the file data of the data file from RADOS using the CephFS API based on the metadata of the data file;
[0163] Step 12.3A: Cache the metadata of the data file in memory, and store the file data on the local SSD;
[0164] If the data file hits the cache, then perform the following operations:
[0165] Step 12.1B: Retrieve the metadata of the data file from memory based on the filename;
[0166] Step 12.2B: Read the cached file data from the local SSD based on the metadata of the data file;
[0167] Step 13: Return the read file data.
[0168] When a file misses the cache, the data stream is as follows: Figure 7 As shown in steps 21-28 below:
[0169] Step 21: The user program sends a file read request to the cache module;
[0170] Step 22: The caching module sends a file metadata read request to MDS;
[0171] Step 23: MDS returns the metadata of the data file to the cache module;
[0172] Step 24: The caching module stores the metadata of the data file in memory;
[0173] Step 25: The caching module directly sends a file data read request to RADOS based on the metadata of the data file;
[0174] Step 26: RADOS returns file data;
[0175] Step 27: The caching module stores the file data in the SSD;
[0176] Step 28: The cache module returns the file data to the user program.
[0177] When a file hits the cache, the data stream is as follows: Figure 8 As shown in steps 31-34 below:
[0178] Step 31: The user program sends a file read request to the cache module;
[0179] Step 32: The cache module retrieves the metadata of the data file from memory;
[0180] Step 33: The caching module retrieves file data from the SSD;
[0181] Step 34: The cache module returns the file data to the user program.
[0182] In summary, the embodiments of this application have the following beneficial effects:
[0183] 1) Prevent Ceph MDS overload. When the number of files stored in the target path in Ceph reaches millions, requesting MDS to obtain file metadata during each training will cause a large number of MDS requests, leading to MDS overload in the Ceph cluster. The embodiments of this application can limit the occurrence of this situation from the source.
[0184] 2) Improve AI training speed. In AI training scenarios, the same data file is read in each round of training. The frequency of reading these data files is the same throughout the training process. For this kind of file reading with a certain pattern, the embodiments of this application can cache the metadata and file data of the read data file in the first training process. Subsequent training does not require interaction with Ceph, which greatly shortens the I / O time.
[0185] 3) Bypassing the Linux file system to speed up data reading: This application embodiment directly reads data through the CephFS API, without the need to frequently switch between user mode and kernel mode, which can speed up the data reading speed of the first training.
[0186] The exemplary application and implementation of the terminal provided in the embodiments of this application have been used to describe the data file processing method provided in the embodiments of this application. The following describes the scheme for the cooperation of various modules in the data file processing device 555 provided in the embodiments of this application to realize data file processing.
[0187] The calling module 5551 is used to respond to a read request for a data file by calling a file read interface to parse the read request for the data file and obtain the identifier of the data file; the determining module 5552 is used to traverse the first storage space based on the identifier of the data file to determine the storage status of the data file; the first reading module 5553 is used to, when the storage status of the data file indicates that the data file is cached, obtain the metadata of the data file from the first storage space based on the identifier of the data file; and obtain the file data of the data file from the second storage space based on the metadata of the data file.
[0188] In some embodiments, the data file processing device 555 further includes: a second reading module 5554, configured to, when the storage status of the data file indicates that the data file is not cached, call the file reading interface to obtain the metadata of the data file from the metadata server based on the identifier of the data file; and call the file reading interface to obtain the file data of the data file from the file server based on the metadata of the data file.
[0189] In some embodiments, the data file processing device 555 further includes a storage module 5555, configured to store the metadata of the data file in the first storage space and store the file data of the data file in the second storage space.
[0190] In some embodiments, the storage module 5555 is further configured to traverse the historical logs of the data file to determine the reading frequency of the data file; when the reading frequency of the data file is greater than the reading frequency threshold, the metadata of the data file is stored in the first storage space, and the file data of the data file is stored in the second storage space.
[0191] In some embodiments, the storage module 5555 is further configured to perform feature extraction processing on the data file to obtain feature information of the data file; perform prediction processing based on the feature information of the data file to obtain the cache level of the data file; when the cache level of the data file indicates that the data file needs to be cached, store the metadata of the data file in the first storage space and store the file data of the data file in the second storage space.
[0192] In some embodiments, the storage module 5555 is further configured to divide the first storage space into a plurality of first blocks, wherein the first blocks correspond one-to-one with the cache level; divide the second storage space into a plurality of second blocks, wherein the second blocks correspond one-to-one with the cache level; store the metadata of the data file in the first block corresponding to the cache level of the data file; and store the file data of the data file in the second block corresponding to the cache level of the data file.
[0193] In some embodiments, the data file processing device 555 further includes: a first processing module 5556, configured to, when the similarity between a read historical data file and the data file is greater than a similarity threshold, invoke the file reading interface to pre-obtain the metadata of the data file from the metadata server and store the metadata of the data file in the first storage space; and, based on the metadata of the data file, invoke the file reading interface to pre-obtain the file data of the data file from the file server and store the file data of the data file in the second storage space.
[0194] In some embodiments, the data file processing apparatus 555 further includes: an update module 5557, configured to perform update verification on the file data of the data file obtained from the second storage space; when it is determined through update verification that the file data of the data file corresponding to the file server has been updated, the module obtains the updated file data of the data file from the file server and updates the second storage space based on the updated file data of the data file.
[0195] In some embodiments, the update module 5557 is further configured to encode the file data of the data file obtained from the second storage space to obtain a corresponding verification code; when the verification code obtained from the file server is inconsistent with the encoded verification code, it is determined that the file data of the data file stored in the second storage space needs to be updated.
[0196] In some embodiments, the data file processing device 555 further includes: a second processing module 5558, configured to delete metadata of a portion of the data files in the first storage space until the available storage space of the first storage space is not less than the first storage space threshold when the available storage space of the first storage space is less than a first storage space threshold, or when a set first cache cleanup time is reached; and to delete file data of a portion of the data files in the second storage space until the available storage space of the second storage space is not less than the second storage space threshold when the available storage space of the second storage space is less than a second storage space threshold, or when a set second cache cleanup time is reached.
[0197] In some embodiments, the second processing module 5558 is further configured to sort the file data of the data file in the second storage space in descending order based on the duration during which the file data has not been read, and delete the file data of the data file that is first in the descending order; or, sort the file data of the data file in the second storage space in ascending order based on the number of times the file data has been read, and delete the file data of the data file that is first in the ascending order.
[0198] This application provides a computer program product or computer program that includes computer instructions stored in a computer-readable storage medium. The processor of an electronic device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the electronic device to perform the data file processing method described in this application.
[0199] This application provides a computer-readable storage medium storing executable instructions. When these executable instructions are executed by a processor, they cause the processor to execute the data file processing method provided in this application. For example, ... Figures 3-5 The data file processing method is shown.
[0200] In some embodiments, the computer-readable storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or it may be a variety of devices including one or any combination of the above-mentioned memories.
[0201] In some embodiments, executable instructions may take the form of a program, software, software module, script, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
[0202] As an example, executable instructions may, but do not necessarily, correspond to files in a file system. They may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a Hyper Text Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple collaborating files (e.g., a file that stores one or more modules, subroutines, or code sections).
[0203] As an example, executable instructions can be deployed to execute on a single computing device, or on multiple computing devices located in one location, or on multiple computing devices distributed across multiple locations and interconnected via a communication network.
[0204] The above description is merely an embodiment of this application and is not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, and improvements made within the spirit and scope of this application are included within the scope of protection of this application.
Claims
1. A data file processing method, characterized in that, The method includes: In response to a read request for a training data file, a file read interface is invoked to parse the read request for the training data file and obtain the identifier of the training data file. The file read interface is used to bypass the Linux file system. Based on the identifier of the training data file, the first storage space is traversed to determine the storage status of the training data file. The first storage space is memory-based. When the identifier of the training data file is encountered in the first storage space, the storage status of the training data file is cached. When the storage status of the training data file indicates that the training data file has been cached, the metadata of the training data file is obtained from the first storage space based on the identifier of the training data file. Based on the metadata of the training data file, the file data of the training data file is obtained from the second storage space, wherein the training data file is used for model training, and the second storage space is based on a solid-state drive; When the storage status of the training data file indicates that the training data file is not cached, it is determined that the training data file is being read for the first time. Based on the identifier of the training data file, the file reading interface is called to obtain the metadata of the training data file from the metadata server, and the metadata of the training data file is stored in the first storage space. Based on the metadata of the training data file, the file reading interface is called to obtain the file data of the training data file from the file server, and the file data of the training data file is stored in the second storage space.
2. The method according to claim 1, characterized in that, Before storing the metadata of the training data file in the first storage space, the method further includes: Traverse the historical logs of the training data file to determine the reading frequency of the training data file; When the reading frequency of the training data file exceeds the reading frequency threshold, the process of storing the metadata in the first storage space is initiated.
3. The method according to claim 1, characterized in that, Before storing the metadata of the training data file in the first storage space, the method further includes: The first storage space is divided into multiple first blocks, wherein each first block corresponds to a cache level. The second storage space is divided into multiple second blocks, wherein each second block corresponds one-to-one with a cache level; The step of storing the metadata of the training data file into the first storage space includes: The metadata of the training data file is stored in the first block corresponding to the cache level of the training data file; The step of storing the training data file in the second storage space includes: The training data file is stored in the second block corresponding to the cache level of the training data file.
4. The method according to claim 1, characterized in that, Before determining the storage status of the training data file by traversing the first storage space based on the identifier of the training data file, the method further includes: When the similarity between the read historical training data file and the training data file is greater than the similarity threshold, the file reading interface is called to obtain the metadata of the training data file from the metadata server in advance, and the metadata of the training data file is stored in the first storage space; Based on the metadata of the training data file, the file reading interface is called to obtain the file data of the training data file from the file server in advance, and the file data of the training data file is stored in the second storage space.
5. The method according to claim 1, characterized in that, The method further includes: The file data of the training data file obtained from the second storage space is updated and verified; When it is determined through update verification that the file data of the training data file corresponding to the file server has been updated, the updated file data of the training data file is obtained from the file server, and the second storage space is updated based on the updated file data of the training data file.
6. The method according to claim 5, characterized in that, The step of updating and verifying the file data of the training data file obtained from the second storage space includes: The file data of the training data file obtained from the second storage space is encoded to obtain the corresponding verification code; When the verification code obtained from the file server is inconsistent with the encoded verification code, it is determined that the file data of the training data file stored in the second storage space needs to be updated.
7. A data file processing apparatus, characterized in that, The device includes: The calling module is used to respond to a read request for a training data file by calling the file read interface to parse the read request for the training data file and obtain the identifier of the training data file. The file read interface is used to bypass the Linux file system. The determination module is used to traverse the first storage space based on the identifier of the training data file to determine the storage status of the training data file, wherein the first storage space is memory-based, and when the identifier of the training data file is encountered in the first storage space, the storage status of the training data file is cached. The first reading module is used to retrieve the metadata of the training data file from the first storage space based on the identifier of the training data file when the storage status of the training data file indicates that the training data file has been cached. Based on the metadata of the training data file, the file data of the training data file is obtained from the second storage space, wherein the training data file is used for model training, and the second storage space is based on a solid-state drive; The second reading module is used to determine that the training data file is being read for the first time when the storage status of the training data file indicates that the training data file is not cached, and based on the identifier of the training data file, to call the file reading interface to obtain the metadata of the training data file from the metadata server and store it in the first storage space. Based on the metadata of the training data file, the file reading interface is called to obtain the file data of the training data file from the file server and store it in the second storage space.
8. The apparatus according to claim 7, further comprising: The storage module is used to traverse the historical logs of the training data file to determine the reading frequency of the training data file; When the reading frequency of the training data file exceeds the reading frequency threshold, the process of storing the metadata in the first storage space is initiated.
9. The apparatus according to claim 8, characterized in that, The storage module is further configured to divide the first storage space into multiple first blocks, wherein each first block corresponds to a cache level. The second storage space is divided into multiple second blocks, wherein each second block corresponds one-to-one with a cache level; The metadata of the training data file is stored in the first block corresponding to the cache level of the training data file; The training data file is stored in the second block corresponding to the cache level of the training data file.
10. The apparatus according to claim 7, further comprising: The first processing module is used to call the file reading interface to pre-obtain the metadata of the training data file from the metadata server when the similarity between the read historical training data file and the training data file is greater than the similarity threshold, and to store the metadata of the training data file in the first storage space. Based on the metadata of the training data file, the file reading interface is called to obtain the file data of the training data file from the file server in advance, and the file data of the training data file is stored in the second storage space.
11. The apparatus according to claim 7, further comprising: An update module is used to update and verify the file data of the training data file obtained from the second storage space; When it is determined through update verification that the file data of the training data file corresponding to the file server has been updated, the updated file data of the training data file is obtained from the file server, and the second storage space is updated based on the updated file data of the training data file.
12. The apparatus according to claim 11, characterized in that, The update module is further configured to encode the file data of the training data file obtained from the second storage space to obtain the corresponding verification code; When the verification code obtained from the file server is inconsistent with the encoded verification code, it is determined that the file data of the training data file stored in the second storage space needs to be updated.
13. An electronic device, characterized in that, The electronic device includes: Memory, used to store executable instructions; A processor, when executing executable instructions stored in the memory, implements the data file processing method according to any one of claims 1 to 6.
14. A computer-readable storage medium, characterized in that, It stores executable instructions for causing the processor to perform the data file processing method according to any one of claims 1 to 6.
15. A computer program product comprising computer instructions, characterized in that, When the computer instructions are executed by the processor, they implement the data file processing method according to any one of claims 1 to 6.