A metadata storage method and device, computer equipment and storage medium

By using unordered indexes and linked list structures to store metadata in a distributed file system, the problems of metadata operation latency and high maintenance costs are solved, enabling fast querying and efficient directory traversal, thus improving system performance.

CN120909997BActive Publication Date: 2026-06-30TSINGHUA UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA UNIVERSITY
Filing Date
2025-07-23
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing distributed file systems suffer from problems in metadata management, such as poor metadata operation latency and throughput, high cost of maintaining ordered indexes, and the need for complex logging or transaction mechanisms for frequent updates. They are particularly inadequate in high-speed network and persistent memory environments.

Method used

Metadata is stored using an unordered index structure and linked lists. The storage location is determined by a hash table, and the storage location information of the metadata is linked within the metadata. This decouples the metadata index from the directory semantics, optimizes metadata operations using a preset memory layout, and optimizes directory traversal by combining multi-threaded parallel processing and caching.

Benefits of technology

It enables fast querying even when the metadata scale increases, reduces the complexity of metadata operations, improves directory traversal efficiency, and reduces metadata operation latency and maintenance costs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120909997B_ABST
    Figure CN120909997B_ABST
Patent Text Reader

Abstract

This application provides a metadata storage method, apparatus, computer device, and storage medium. The method includes: storing any metadata of a file system in a metadata server, wherein the metadata server is used to store metadata in the file system based on an unordered index structure; determining a second node in the hierarchical directory tree associated with the first node based on the location information of the first node corresponding to the metadata in the hierarchical directory tree, wherein the association between the second node and the first node includes the second node being a child node or a sibling node of the first node; obtaining the storage location information of the associated metadata corresponding to the second node in the metadata server; and linking the storage location information of the associated metadata corresponding to the second node in the metadata server within the metadata based on the association relationship between the first node and the second node.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and more specifically to a metadata storage method, apparatus, computer device, and storage medium. Background Technology

[0002] A distributed file system (DSP) is a file system that stores data across multiple network nodes and coordinates their management. It is widely used in industry for data management and sharing. A DSP typically consists of two parts: metadata management and data management. Metadata management maintains the file system's structural information, such as filenames, directory structure, and permissions, while data management stores the actual file content. Currently, data centers and high-performance computing clusters widely use DSPs to manage large-scale data, and file metadata management is one of the key performance bottlenecks.

[0003] Existing distributed file systems generally use ordered key-value storage systems (such as RocksDataBase, RocksDB) to store and manage metadata, and rely on the range query operations provided by ordered indexes to realize the directory traversal semantics in the file system. In this way, the directory semantics and the storage index of metadata are tightly coupled. As the size of the file system metadata grows, the maintenance of ordered indexes becomes more difficult, and the latency and throughput performance of metadata operations are poor. Frequent metadata updates also require complex and expensive logging or transaction mechanisms. Summary of the Invention

[0004] In view of this, this application provides a metadata storage method, apparatus, computer device, and storage medium.

[0005] Specifically, this application is implemented through the following technical solution:

[0006] In a first aspect, embodiments of this application provide a metadata storage method applied to a distributed file system, the method comprising:

[0007] For any metadata of the file system,

[0008] The metadata is stored in a metadata server, wherein the metadata server is used to store the metadata in the file system based on an unordered index structure;

[0009] Based on the location information of the first node corresponding to the metadata in the hierarchical directory tree, the second node associated with the first node in the hierarchical directory tree is determined. The hierarchical directory tree is used to describe the logical structure of storing data in the file system based on ordered key values. The association between the second node and the first node includes the second node being a child node or a sibling node of the first node.

[0010] Obtain the storage location information of the associated metadata corresponding to the second node on the metadata server; and,

[0011] Based on the association between the first node and the second node, the storage location information of the associated metadata corresponding to the second node in the metadata server is linked within the metadata.

[0012] Optionally, the unordered index structure includes a hash table;

[0013] The step of storing the metadata on a metadata server includes:

[0014] Based on the data type of the metadata, determine the key value corresponding to the metadata;

[0015] Perform a hash operation on the key-value pair corresponding to the metadata to determine the storage location of the metadata in the hash table;

[0016] The metadata is stored in the specified storage location.

[0017] Optionally, the metadata includes file metadata and directory metadata, and the directory metadata includes directory access metadata and directory content metadata;

[0018] The key corresponding to the file metadata is the name of the file metadata and the identifier of the parent node of the corresponding node of the file metadata;

[0019] The key corresponding to the directory access metadata is the name of the directory access metadata and the identifier of the parent node of the corresponding node of the directory access metadata;

[0020] The key corresponding to the directory content metadata is the identifier of the node corresponding to the directory content metadata.

[0021] Optionally, storing the metadata according to the storage location includes:

[0022] In the corresponding storage location, the metadata is stored according to a preset memory layout;

[0023] In the preset memory layout, data fields related to the same metadata operation are arranged in the same aligned persistent memory block; data in the same aligned persistent memory block is processed in the same CPU atomic operation.

[0024] Optionally, the data fields of the metadata include file content modification time and file status change time;

[0025] The method further includes:

[0026] In response to receiving a target metadata operation request for any metadata, if the metadata update field corresponding to the target metadata operation request includes the file content modification time and the file status change time, only the file content modification time of any metadata is updated;

[0027] Upon receiving a request to obtain the file status change time for any of the aforementioned metadata, the maximum value between the updated file content modification time and the file status change time recorded in the current aforementioned metadata is taken as the file status change time corresponding to the request.

[0028] Optionally, the file system metadata includes file metadata and directory metadata, the file metadata corresponds to a first memory layout, and the directory metadata corresponds to a second memory layout, the first memory layout and the second memory layout are different;

[0029] In the first memory layout, the file content modification time and file byte count are arranged in the same aligned persistent memory block; the lower two fields of the mode, file user identifier, file status change time, and file group identifier are arranged in the same aligned persistent memory block; the upper two fields of the file group identifier, file status information, and creation time are arranged in the same aligned persistent memory block.

[0030] In the second memory layout, the file content modification time and the head pointer pointing to the child node linked list are arranged in the same aligned persistent memory block; the lower two fields of the mode, file user identifier, file status change time, and file group identifier are arranged in the same aligned persistent memory block; the upper two fields of the file group identifier, file status information, and creation time are arranged in the same aligned persistent memory block.

[0031] Optionally, the method further includes:

[0032] Receive a directory traversal request sent by a client, and return the directory traversal result to the client based on at least one of the following methods:

[0033] While reading the linked list nodes, the partially read results are returned to the client.

[0034] The linked list is split into multiple segments, and the split segments are read in parallel by multiple threads. After merging the read results, the merged directory traversal result is returned to the client.

[0035] Determine whether the directory traversal result corresponding to the directory traversal request is cached in the dynamic random access memory (DRAM). If so, read the directory traversal result corresponding to the directory traversal request from the dynamic random access memory (DRAM) and return the read directory traversal result to the client. The DRAM caches historical directory traversal results within a preset time range before the current time.

[0036] The compressed directory traversal results are returned to the client.

[0037] Secondly, embodiments of this application provide a metadata storage device, the device comprising:

[0038] A storage module is used to store any metadata of the file system in a metadata server, wherein the metadata server is used to store the metadata in the file system based on an unordered index structure;

[0039] The determination module is used to determine the second node associated with the first node in the hierarchical directory tree based on the location information of the first node corresponding to the metadata in the hierarchical directory tree. The hierarchical directory tree is used to describe the logical structure of storing data in the file system based on ordered key values. The association between the second node and the first node includes the second node being a child node or a sibling node of the first node.

[0040] The processing module is configured to obtain the storage location information of the associated metadata corresponding to the second node in the metadata server; and, based on the association relationship between the first node and the second node, link the storage location information of the associated metadata corresponding to the second node in the metadata server within the metadata.

[0041] This application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described metadata storage method.

[0042] This specification provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the above-described metadata storage method when executing the program.

[0043] In the metadata storage method provided in this application, for any metadata in the file system, the metadata in the file system can be stored based on an unordered index structure. Under the unordered index structure, even if the size of the metadata increases, fast querying of the metadata can be achieved, reducing the complexity of metadata operations. In addition, within the metadata, the storage location information of the associated metadata of the second node associated with the first node of the metadata is linked in the form of a linked list. In this way, when determining the semantics of the directory, it can be done by traversing the linked list within the directory, thereby achieving the maintenance of the logical relationship between the metadata. Attached Figure Description

[0044] Figure 1 This is a flowchart illustrating a metadata storage method according to an exemplary embodiment of this application;

[0045] Figure 2 This is a schematic diagram of a hierarchical directory tree in a metadata storage method according to an exemplary embodiment of this application;

[0046] Figure 3 This is a schematic diagram of the hash mapping process in a metadata storage method according to an exemplary embodiment of this application;

[0047] Figure 4 This is a schematic diagram of the first memory layout in a metadata storage method according to an exemplary embodiment of this application;

[0048] Figure 5 This is a schematic diagram of the second memory layout in a metadata storage method according to an exemplary embodiment of this application;

[0049] Figure 6 This is a schematic diagram illustrating the file creation process in a metadata storage method according to an exemplary embodiment of this application;

[0050] Figure 7 This is a schematic diagram illustrating the optimization process of directory traversal in a metadata storage method according to an exemplary embodiment of this application;

[0051] Figure 8 This is a structural diagram of a metadata storage device illustrated in an exemplary embodiment of this application;

[0052] Figure 9 This is a schematic diagram of the structure of a computer device provided in this application. Detailed Implementation

[0053] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0054] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more of the associated listed items.

[0055] It should be understood that although the terms first, second, third, etc., may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."

[0056] Existing distributed file systems generally employ ordered key-value stores (such as RocksDB) for metadata storage and management, relying on range queries provided by ordered indexes to achieve directory traversal semantics within the file system. This approach, by tightly coupling directory semantics with the metadata storage index, restricts metadata management to an ordered key-value index structure. Furthermore, determining directory semantics relies solely on range queries provided by the ordered indexes for directory traversal. This leads to the following problems with existing technologies under high-speed networks and persistent memory storage:

[0057] (1) The latency and throughput of metadata operations are poor, especially in high-speed network and persistent memory hardware environments.

[0058] For example, querying the metadata of a certain file may require going through layers of directory queries before finally finding the specific file metadata, and these layered directory queries cause a significant time delay.

[0059] (2) As the size of file system metadata grows, the maintenance cost of ordered indexes increases significantly, and system scalability is limited.

[0060] Specifically, ordered key-value indexes require all metadata keys to be arranged in an ordered manner, and when the metadata is large, sorting the keys will incur significant maintenance costs.

[0061] (3) Frequent metadata updates require complex and costly logging or transaction mechanisms.

[0062] Based on this, this application provides a metadata storage method, apparatus, computer device, and storage medium. For any metadata in a file system, the metadata in the file system can be stored based on an unordered index structure. Under the unordered index structure, even if the size of the metadata increases, fast querying of the metadata can be achieved, reducing the complexity of metadata operations. In addition, within the metadata, the storage location information of the associated metadata of the second node associated with the first node of the metadata is linked in the form of a linked list. In this way, when determining the semantics of the directory, it can be done by traversing the linked list within the directory, thereby achieving the maintenance of the logical relationship between the metadata.

[0063] The metadata storage method provided in this application is described below with reference to specific embodiments. This method is applied to a distributed file system. See also... Figure 1 The diagram shown is a flowchart of a metadata storage method provided in this application, which includes the following steps:

[0064] S101. For any metadata of the file system, store the metadata in a metadata server, wherein the metadata server is used to store the metadata in the file system based on an unordered index structure.

[0065] S102. Based on the location information of the first node corresponding to the metadata in the hierarchical directory tree, determine the second node in the hierarchical directory tree that is associated with the first node, wherein the hierarchical directory tree is used to describe the logical structure of storing data of the file system based on ordered key values, and the association between the second node and the first node includes the second node being a child node or a sibling node of the first node.

[0066] S103. Obtain the storage location information of the associated metadata corresponding to the second node in the metadata server; and, based on the association relationship between the first node and the second node, link the storage location information of the associated metadata corresponding to the second node in the metadata server within the metadata.

[0067] The following is a detailed explanation of the steps described above.

[0068] For S101,

[0069] The method provided in this application is implemented by a metadata server. All metadata of the file system can be distributed to multiple metadata servers, and metadata can be stored in accordance with the method provided in this application within each metadata server.

[0070] In one possible implementation, the hierarchical directory tree of the file system can be broken down / segmented. Each node in the hierarchical directory tree can be metadata. The broken down / segmented metadata can include multiple file metadata and directory metadata. The file metadata corresponds to the file nodes in the hierarchical directory tree, and the directory metadata nodes correspond to the directory nodes (including subdirectory nodes and root directory nodes) in the hierarchical directory tree. The broken down metadata can then be distributed to at least one metadata server.

[0071] The hierarchical directory tree is used to describe the logical structure of storing data in the file system based on ordered key-value pairs. In one possible implementation, the file system can currently store metadata in an ordered key-value pair manner, and then the hierarchical directory tree of the file system can be determined based on the current storage method of the file system.

[0072] The hierarchical directory tree of the file system can be exemplified as follows: Figure 2 As shown, this includes root directory nodes, subdirectory nodes, and file nodes. File nodes can store specific data information. Root directory nodes and subdirectory nodes can contain other subdirectory nodes and file nodes.

[0073] Alternatively, if the fragmented metadata is distributed across multiple metadata servers, the key value of each metadata (which can be the first key value) can be determined first. Then, the metadata server corresponding to each metadata can be determined through a consistent hashing algorithm and the key value of each metadata. Finally, the metadata can be distributed to the corresponding metadata servers.

[0074] In one possible implementation, the fragmented metadata includes multiple file metadata and directory metadata, and each directory metadata may include directory access metadata and directory content metadata. The first key value of different types of metadata can be different. For example, for file metadata, its first key value can be the identifier of the parent node of the node corresponding to the file metadata; the first key value of directory access metadata can also be the identifier of the parent node of the node corresponding to the directory access metadata; and the first key value of directory content metadata can be the identifier of the node corresponding to the directory metadata.

[0075] Optionally, file metadata can be used to describe the basic attributes, data location, and access information of a file, such as file identification information, file attribute information (e.g., file type, permission information, creation time, modification time, access time, etc.), and data location information; directory metadata is used to maintain the organizational structure of files / subdirectories within a directory, and may include, for example, directory identification information, directory name, and directory attribute information (e.g., permission information, creation time, modification time, access time, etc.).

[0076] In one possible implementation, the unordered index structure may include a hash table. When the metadata server stores metadata in the file system based on the unordered index structure, it may first determine the storage location of the metadata in the hash table, and then store the metadata according to the storage location.

[0077] For example, one can first determine the key value corresponding to the metadata (which can be the second key value) based on the data type of the metadata, and then perform a hash operation on the key value corresponding to the metadata to determine the storage location of the metadata in the hash table.

[0078] Different types of metadata can have different keys. For example, the key corresponding to the file metadata can be the name of the file metadata and the identifier of the parent node of the corresponding node of the file metadata; the key corresponding to the directory access metadata can be the name of the directory access metadata and the identifier of the parent node of the corresponding node of the directory access metadata; and the key corresponding to the directory content metadata can be the identifier of the node corresponding to the directory content metadata.

[0079] If the node corresponding to the metadata of a certain directory is the root node, then the metadata of that directory does not need to be distinguished as directory access metadata and directory content metadata, and its corresponding key can be the identifier of the root node.

[0080] For example, if the hierarchical directory tree is like Figure 3As shown on the left, the root directory node Dir is included. For example, its corresponding identifier is 1. There are three file nodes under the root directory node, namely F1, F2, and F3. After the hierarchical directory tree is broken up, there are four metadata, namely Dir directory metadata, F1 file metadata, F2 file metadata, and F3 file metadata. When storing these four metadata in the hash table, for the Dir directory metadata, since it is the root node and has no parent node, the identifier "1" of the root node can be hashed (h(1) in the figure represents hashing 1). When it is determined that its corresponding position in the hash table is "0", the Dir directory metadata can be stored at the position "0". Correspondingly, for the F1 file metadata, F2 file metadata, and F3 file metadata, the identifier of the parent node of their corresponding node (i.e., the identifier "1" of Dir) and the file name can be hashed respectively to obtain the storage position "2" of F3, the storage position "4" of F1, and the storage position "6" of F2.

[0081] It should be noted that when determining the storage location of each metadata in the hash table, a hash operation was performed on the second key value. The first key value was also hashed when the metadata was distributed to various metadata servers. However, the calculation methods for the two hash operations can be different, and the two hash operations are different hash operations.

[0082] After the aforementioned metadata is stored, each storage location in the hash table has a corresponding storage location in persistent memory (PM). Metadata storage can actually be understood as storing it in the PM storage location corresponding to the hash table location.

[0083] In this way, the metadata in the file system can be stored in an unordered index. This makes the time complexity of metadata operations other than directory traversal (such as getting file status, reading, etc.) O(1), reducing the time complexity of metadata operations.

[0084] For S102 and S103,

[0085] Directory semantics refers to the content contained within a directory. When determining directory semantics, if following the ordered key-value indexing method in related technologies, it can be achieved through directory traversal. However, in this application, since the metadata is stored using an unordered index, directory semantics cannot be queried through directory traversal.

[0086] Based on this, when storing metadata, this application can determine the second node associated with the first node in the hierarchical directory tree based on the location information of the first node corresponding to that metadata, and link the storage location information of the associated metadata corresponding to the second node in the metadata server within the metadata. That is, this application uses a linked list to link other metadata within the metadata, thereby maintaining the logical relationships between metadata.

[0087] Here, the association between the second node and the first node includes the second node being a child node or sibling node of the first node. The storage location information of the linked metadata in the metadata server can refer to the storage location information of the linked metadata in the metadata server through a pointer. The storage location information of the linked metadata in the metadata server is the same as the storage location information in PM.

[0088] For example, for file metadata, the second node associated with the first node can be a sibling node. The file metadata can include a sibling pointer "sibling_ptr" data field, which can point to the associated metadata of the second node. For directory metadata, the second node associated with the first node can include sibling nodes and child nodes. The directory metadata can include a sibling pointer "sibling_ptr" and a child node pointer "child_ptr" data field, where the sibling pointer can point to the metadata of the sibling node and the child node pointer can point to the metadata of the child node.

[0089] In implementation, for each directory metadata, a linked list corresponding to that directory metadata can be maintained. This linked list can be a singly linked list, with the head pointer being the pointer to the child nodes of the directory metadata. The head pointer points to the storage location information of the metadata corresponding to the child nodes of the directory metadata on the metadata server. If the directory metadata has multiple child nodes, the head pointer of the linked list points to the storage location information of the metadata corresponding to the first created child node under that directory metadata on the metadata server.

[0090] For example, if the child nodes under directory A include file 1, file 2 and subdirectory 1, and if the creation order of file 1, file 2 and subdirectory 1 is file 1 > file 2 > subdirectory 1, then the linked list of directory A is structured as follows: the child node pointers of directory A point to the storage location information of the metadata of file 1, the sibling node pointers of file 1 point to the storage location information of the metadata of file 2, and the sibling node pointers of file 2 point to the storage location information of the metadata of subdirectory 1.

[0091] In this way, when determining the directory semantics of directory A, we can find file 1 by using the head pointer of the linked list of directory A, then find file 2 by using the sibling pointer of file 1, and finally find subdirectory 1 by using the sibling pointer of file 2, thus constructing the complete directory semantics of directory A: file 1, file 2, subdirectory 1.

[0092] By storing metadata using an unordered index and maintaining the directory semantics of metadata using a linked list, the decoupling between metadata index and directory semantics can be achieved, reducing the complexity of metadata operations.

[0093] In one possible implementation, when storing metadata according to corresponding storage locations, the metadata can be stored in the corresponding locations according to a preset memory layout. In this preset memory layout, data fields related to the same metadata operation are arranged in the same aligned persistent memory block. The length of the persistent memory block is a preset length, typically 16 bytes, and will be used as an example below. Data in the same aligned persistent memory block is processed in the same CPU atomic operation.

[0094] Here, different types of metadata can have different preset memory layouts. For example, file metadata corresponds to the first memory layout, and directory metadata corresponds to the second memory layout. The first memory layout and the second memory layout are different. In the first memory layout, the file content modification time and the file byte count are arranged in the same aligned persistent memory block; the lower two fields of the mode, file user identifier, file status change time, and file group identifier are arranged in the same aligned persistent memory block; the upper two fields of the file group identifier, file status information, and creation time are arranged in the same aligned persistent memory block. In the second memory layout, the file content modification time and the head pointer pointing to the child node linked list are arranged in the same aligned persistent memory block; the lower two fields of the mode, file user identifier, file status change time, and file group identifier are arranged in the same aligned persistent memory block; the upper two fields of the file group identifier, file status information, and creation time are arranged in the same aligned persistent memory block.

[0095] For example, the first memory layout can be as follows Figure 4 As shown, the second memory layout can be as follows: Figure 5 As shown in the table below, the meanings of each character in these two memory layouts are as follows:

[0096] ID A globally unique identifier, also known as an inode number. dev The device where the identification file resides mode Identify file type and access permissions gid_l The lower two sections of the file's group (group ID) field uid File owner (user ID) ctime File status (i.e., file metadata information) change time gid_u The two highest fields of the file's group (group ID) status File status information (normal status, deleted status, etc.) bdtime Creation time sibling_ptr Pointer to the metadata of the next sibling node parent_ptr pointer to parent node metadata mtime Document content modification time atime Last access time of the file content

[0097] The metadata fields specific to the first memory layout corresponding to the file metadata include the following:

[0098] size file bytes blksize File block size (in bytes) nlink Number of hard links pointing to a file rdev If the file is a device file, the rdev field contains the device's primary and secondary IDs.

[0099] The metadata fields specific to the second memory layout corresponding to the directory metadata include the following:

[0100] child_ptr The head pointer of the linked list pointing to child nodes ll_len Length of the child node linked list ll_mid The middle node of the child node linked list

[0101] In the table above, `ctime` records the last modification time of file metadata (such as permissions, owner, size, etc.), and `mtime` records the last modification time of file content. There is an order relationship between `ctime` and `mtime`, that is, `ctime` always monotonically increases. Since metadata modification (triggering `ctime` update) is often accompanied by or later than content modification (triggering `mtime` update), `ctime` must be greater than or equal to `mtime`. Based on this characteristic, when metadata operations need to update both `ctime` and `mtime` simultaneously, there is no need to explicitly write `ctime`, only `mtime` needs to be updated. Any subsequent access can dynamically infer `ctime` through `max(ctime,mtime)`.

[0102] In one possible implementation, in response to receiving a target metadata operation request for any metadata, if the metadata update field corresponding to the target metadata operation request includes the file content modification time (mtime) and the file status change time (ctime), only the file content modification time of the any metadata is updated; then, after receiving a request to obtain the file status change time for the any metadata, the maximum value between the updated file content modification time and the file status change time currently recorded in the any metadata is taken as the file status change time corresponding to the request.

[0103] Based on the aforementioned preset memory layout and the update settings for ctime and mtime, the method provided in this application can guarantee crash consistency.

[0104] The Central Processing Unit (CPU) now supports 16-byte aligned atomic memory operations. After the above metadata is stored, it can be stored in the PM. Since the PM's persistence granularity is based on cache lines, the metadata, after following the above preset memory layout, has a persistence granularity of one cache line byte. This also provides atomic update and persistence capabilities for aligned 16-byte data.

[0105] Specifically, all single-point metadata operations can be completed through a single CPU atomic write operation to update the data. For example, a single-point metadata operation may include the following:

[0106] Write and truncate operations:

[0107] Write operations are divided into two types: "overwrite" and "append". Overwrite operations only need to update the file's modification time (mtime); append operations need to update both the modification time (mtime) and the file size (size). Truncate operations need to update the modification time (mtime), state change time (ctime), and file size (size), similar to append operations.

[0108] For overwrite operations, only an 8-byte atomic write is needed to update mtime; for append operations, since... Figure 4 Memory layout requires updating `mtime` and the file size (`size`). Since `mtime` and `size` reside in an aligned 16-byte persistent memory block, this can be performed in a single atomic operation. Truncation operations are similar to append operations, also requiring updates to `mtime` and `size`.

[0109] Read operations and directory traversal operations (Read and readdir):

[0110] A read operation refers to reading the contents of a file, which only requires updating the access time (atime).

[0111] The directory traversal operation (Readdir) lists the files and subdirectories under a directory. Similarly, it only requires updating the access time (atime), which can be completed in a single atomic operation.

[0112] Change owner and change permissions (Chown and chmod):

[0113] The change of owner operation (Chown) involves modifying the file's user ID (uid) and group ID (gid), while the change of permissions operation (Chmod) involves modifying the file's permissions (mode). Both operations also update the state change time (ctime). These fields total 18 bytes (4 bytes for uid, 4 bytes for gid, 2 bytes for mode, and 8 bytes for ctime), exceeding the size of a persistent memory block.

[0114] To support simultaneous updates of the aforementioned fields via a single 16-byte write operation, `gid` can be split into the high two bytes (`gid_u`) and low two bytes (`gid_l`) of a queue. The low two bytes (`gid_l`) and other required fields (`uid`, `mode`, `ctime`) are then placed in an aligned 16-byte persistent memory block. This way, during `Chown` and `Chmod` operations, when `gid` is less than 2 bytes, the update can be performed simultaneously. 16 In this case, these fields can be updated through a single 16-byte atomic write operation.

[0115] In addition, multi-point metadata operations can also be completed with a few CPU atomic operations. Specifically, multi-point metadata operations can include the following:

[0116] File creation operation (Create): Creates a new file in a specified directory, requiring updates to both the file's metadata and the parent directory's metadata. The process of creating file metadata is as follows: Figure 6 As shown, it includes the following steps:

[0117] (1) Insert file metadata into hash table (atomic write operation 1).

[0118] Create a new file metadata F2 (containing the filename, parent pointer, sibling pointer, etc.) and insert it into the hash table. Its second key value is (1 / F2). At this time, the file status is marked as creating (StatusCreating, SC). In the diagram, "P" represents the parent pointer, and the parent pointer of F2 points to the metadata marked "1". "S" represents the sibling pointer, and the sibling pointer of F2 points to F1.

[0119] (2) Update the parent directory metadata (atomic write operation 2).

[0120] The parent directory metadata (i.e., the directory metadata marked "1") needs to update the `mtime` and `child_ptr` pointers. The `child_ptr` pointer needs to point to the metadata of the new file (i.e., adding the new file F3 to the directory's linked list). Because of the above... Figure 5 In the second preset layout, mtime and child_ptr are in the same aligned persistent memory block, so the parent directory metadata can be updated in a single atomic write operation.

[0121] (3) Update file status (atomic write operation 3).

[0122] A single atomic write operation can update the file status of F2 from SC to Normal (SN), indicating that the file creation is complete.

[0123] The `Mkdir` operation creates a new subdirectory under a specified directory. In addition to updating the directory access metadata and parent directory metadata, it also inserts the directory content metadata. Specifically, this may include the following steps:

[0124] (1) Insert directory access metadata (same as step (1) of creating a file).

[0125] Access metadata for subdirectories (including name, parent directory pointer, etc.) is inserted into a hash table and marked with a status of SC.

[0126] (2) Update the parent directory metadata (same as step (2) of the file creation operation):

[0127] The parent directory's mtime and child_pt are updated via atomic writes, adding the child directory to the parent directory linked list.

[0128] (3) Insert directory content metadata.

[0129] The content metadata of the subdirectories (used to maintain a linked list of the subdirectories themselves) is inserted into a hash table (the key being the subdirectory's own ID). Since the content metadata can be generated from the access metadata, it can be rebuilt even in the event of a crash, without the need for distributed transactions.

[0130] (4) Update directory status (same as step 3 of create):

[0131] The status of the subdirectory access metadata has been updated from SC to SN, indicating that creation is complete.

[0132] Deleting files and directories (Delete and rmdir):

[0133] Delete file: Removes the specified file, marks the file status as "deleting", and updates the metadata of the parent directory. This may include the following steps:

[0134] (1) Mark the file as "deleting" (atomic write operation 1):

[0135] A single 16-byte atomic write marks the file status as StatusDeleting and updates the bdtime (creation-deletion timestamp) to record the deletion initiation time.

[0136] (2) Update the parent directory metadata (atomic write operation 2):

[0137] The parent directory's mtime is updated via 8-byte atomic writes (commit point) - if the system crashes at this point, the file status is 'Deleting' but the parent directory has been updated, and it will subsequently be considered 'should be deleted' and cleaned up.

[0138] (3) Mark the file as "deleted" (atomic write operation 3):

[0139] Update the file status from StatusDeleting to StatusDeleted (deleted). The physical deletion of the file (removal from the hash table and linked list) is deferred (e.g., updating the linked list pointer during directory traversal, skipping deleted nodes), reducing real-time operation overhead.

[0140] Delete directory (rmdir): Removes the specified subdirectory. It first checks if the directory is empty before performing steps similar to file deletion, and additionally removes the directory's metadata. Specifically, it may include the following steps:

[0141] (1) Check if the directory is empty:

[0142] By accessing the child_ptr (head pointer of the linked list) of the subdirectory's content metadata, it can be confirmed that the linked list is empty (without any child nodes); otherwise, deletion will fail.

[0143] (2) Mark directory access metadata as "deleting" (atomic write operation 1):

[0144] Similar to step 1 of delete, mark the status of directory access metadata as StatusDeleting and update bdtime.

[0145] (3) Update the parent directory metadata (atomic write operation 2):

[0146] Similar to step 2 of the delete operation, update the mtime in the parent directory.

[0147] (4) Clean up directory metadata and mark it as "deleted" (atomic write operation 3):

[0148] Specifically, the directory access metadata status can be updated to StatusDeleted, and the directory content metadata can be deleted from the hash table (since the directory is empty, the content metadata is unnecessary).

[0149] In the above operations, whether it is a single-point atomic operation or a multi-point atomic operation, it can be completed with a small number of atomic operations by utilizing the memory layout and the update mechanism for mtime.

[0150] In one possible implementation, after receiving a directory traversal request from the client, the directory traversal result can also be returned to the client based on at least one of the following methods:

[0151] While reading the linked list nodes, the partially read results are returned to the client.

[0152] The linked list is split into multiple segments, and the split segments are read in parallel by multiple threads. After merging the read results, the merged directory traversal result is returned to the client.

[0153] Determine whether the directory traversal result corresponding to the directory traversal request is cached in the dynamic random access memory (DRAM). If so, read the directory traversal result corresponding to the directory traversal request from the dynamic random access memory (DRAM) and return the read directory traversal result to the client. The DRAM caches historical directory traversal results within a preset time range before the current time.

[0154] The compressed directory traversal results are returned to the client.

[0155] For example, the optimization process for directory traversal is as follows: Figure 7 As shown, it includes:

[0156] 1. Basic Traversal: Unoptimized serial reading process:

[0157] After the client sends a directory traversal request (readdir) via Remote Procedure Call (RPC), the server serially reads the linked list nodes from persistent memory (PM) (accessing the metadata of each file in sequence along the sibling_ptr), and returns the results to the client all at once after all the nodes have been read.

[0158] However, PM reading and network transmission are sequential. If the directory contains a large number of files (such as millions), the total time = PM reading time + network transmission time, resulting in high latency.

[0159] 2. Pipeline Optimization (+Pipeline): Overlapping PM reading and network transmission processes:

[0160] While the server is reading the first node in PM, it starts returning the read results to the client via the network; when reading the second node, it continues to transmit the results of the first node, and so on, allowing "PM reading" and "network transmission" to proceed in parallel (pipeline overlap), which can hide some of the PM reading delay.

[0161] 3. Parallel Read Optimization (+Parallel): The process of splitting the linked list into multiple segments for simultaneous reading is as follows:

[0162] For directories containing a large number of files, the server splits the linked list into multiple segments (e.g., two segments) based on its length (ll_len) and midpoint pointer (ll_mid). Multiple threads then read the nodes of different segments in parallel, and the results are merged and returned to the client after completion. This solves the problem of "serial reading node by node" in long linked lists, reducing traversal time.

[0163] 4. DRAM Cache Optimization (+Cache): Reusing the traversal results of hot directories:

[0164] The server caches the most recently visited directory list nodes (i.e., those within a preset time range before the current moment) in DRAM. When a client traverses the same directory again, the server reads the cached results directly from DRAM without needing to access PM again.

[0165] In this way, for frequently accessed targets, the traversal time can be close to the DRAM access speed, avoiding repeated consumption of PM bandwidth.

[0166] 5. Data Compression Optimization (+Compress): Reduces network transmission volume. Process:

[0167] Before returning the traversal results to the client, the server first compresses the directory entries (such as filenames, inode numbers, and other information with high repetition) using a compression algorithm, and then transmits the compressed data. The client receives the data, decompresses it, and uses it.

[0168] This can maliciously reduce the amount of data transmitted over the network and lower network latency.

[0169] Corresponding to the embodiments of the aforementioned metadata storage method, this application also provides embodiments of a metadata storage device. Figure 8 A schematic diagram of the metadata storage device provided in this application specifically includes:

[0170] Storage module 801 is used to store any metadata of the file system in a metadata server, wherein the metadata server is used to store the metadata in the file system based on an unordered index structure;

[0171] The determining module 802 is used to determine the second node associated with the first node in the hierarchical directory tree based on the location information of the first node corresponding to the metadata in the hierarchical directory tree, wherein the hierarchical directory tree is used to describe the logical structure of storing data in the file system based on ordered key values, and the association between the second node and the first node includes the second node being a child node or a sibling node of the first node.

[0172] The processing module 803 is used to obtain the storage location information of the associated metadata corresponding to the second node in the metadata server; and, based on the association relationship between the first node and the second node, to link the storage location information of the associated metadata corresponding to the second node in the metadata server within the metadata.

[0173] Optionally, the unordered index structure includes a hash table;

[0174] The storage module 801, when storing the metadata on the metadata server, is used for:

[0175] Based on the data type of the metadata, determine the key value corresponding to the metadata;

[0176] Perform a hash operation on the key-value pair corresponding to the metadata to determine the storage location of the metadata in the hash table;

[0177] The metadata is stored in the specified storage location.

[0178] Optionally, the metadata of the file system includes file metadata and directory metadata, and the directory metadata includes directory access metadata and directory content metadata;

[0179] The key corresponding to the file metadata is the name of the file metadata and the identifier of the parent node of the corresponding node of the file metadata;

[0180] The key corresponding to the directory access metadata is the name of the directory access metadata and the identifier of the parent node of the corresponding node of the directory access metadata;

[0181] The key corresponding to the directory content metadata is the identifier of the node corresponding to the directory content metadata.

[0182] Optionally, when storing the metadata according to the storage location, the storage module 801 is used to:

[0183] In the corresponding storage location, the metadata is stored according to a preset memory layout;

[0184] In the preset memory layout, data fields related to the same metadata operation are arranged in the same aligned persistent memory block; data in the same aligned persistent memory block is processed in the same CPU atomic operation.

[0185] Optionally, the data fields of the metadata include file content modification time and file status change time;

[0186] The processing module 803 is further configured to:

[0187] In response to receiving a target metadata operation request for any metadata, if the metadata update field corresponding to the target metadata operation request includes the file content modification time and the file status change time, only the file content modification time of any metadata is updated;

[0188] Upon receiving a request to obtain the file status change time for any of the aforementioned metadata, the maximum value between the updated file content modification time and the file status change time recorded in the current aforementioned metadata is taken as the file status change time corresponding to the request.

[0189] Optionally, the metadata includes file metadata and directory metadata, the file metadata corresponds to a first memory layout, and the directory metadata corresponds to a second memory layout, the first memory layout and the second memory layout are different;

[0190] In the first memory layout, the file content modification time and file byte count are arranged in the same aligned persistent memory block; the lower two fields of the mode, file user identifier, file status change time, and file group identifier are arranged in the same aligned persistent memory block; the upper two fields of the file group identifier, file status information, and creation time are arranged in the same aligned persistent memory block.

[0191] In the second memory layout, the file content modification time and the head pointer pointing to the child node linked list are arranged in the same aligned persistent memory block; the lower two fields of the mode, file user identifier, file status change time, and file group identifier are arranged in the same aligned persistent memory block; the upper two fields of the file group identifier, file status information, and creation time are arranged in the same aligned persistent memory block.

[0192] Optionally, the device further includes: a return module 804, for:

[0193] Receive a directory traversal request sent by a client, and return the directory traversal result to the client based on at least one of the following methods:

[0194] While reading the linked list nodes, the partially read results are returned to the client.

[0195] The linked list is split into multiple segments, and the split segments are read in parallel by multiple threads. After merging the read results, the merged directory traversal result is returned to the client.

[0196] Determine whether the directory traversal result corresponding to the directory traversal request is cached in the dynamic random access memory (DRAM). If so, read the directory traversal result corresponding to the directory traversal request from the dynamic random access memory (DRAM) and return the read directory traversal result to the client. The DRAM caches historical directory traversal results within a preset time range before the current time.

[0197] The compressed directory traversal results are returned to the client.

[0198] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.

[0199] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this application according to actual needs. Those skilled in the art can understand and implement this without creative effort.

[0200] This application also provides a computer-readable storage medium storing a computer program that can be used to execute the metadata storage method described in the above embodiments.

[0201] This application also provides a computer device, see [link to relevant documentation] Figure 9 The diagram shown is a structural schematic of the computer device provided in this application. At the hardware level, the computer device includes a processor, an internal bus, a network interface, memory, and non-volatile memory, and may also include other hardware required for business operations. The processor reads the corresponding computer program from the non-volatile memory into memory and then runs it to implement the metadata storage described in the above embodiments. Of course, in addition to software implementation, this specification does not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. That is to say, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.

[0202] The embodiments of the subject matter and functional operation described in this specification can be implemented in the following ways: digital electronic circuits, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or combinations thereof. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by a data processing apparatus or for controlling the operation of a data processing apparatus. Alternatively or additionally, the program instructions may be encoded on artificially generated propagation signals, such as machine-generated electrical, optical, or electromagnetic signals, which are generated to encode information and transmit it to a suitable receiving device for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or combinations thereof.

[0203] The processing and logic flow described in this specification can be executed by one or more programmable computers that execute one or more computer programs to perform corresponding functions by operating on input data and generating output. The processing and logic flow can also be executed by dedicated logic circuitry—such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits), and the device can also be implemented as dedicated logic circuitry.

[0204] Suitable computers for executing computer programs include, for example, general-purpose and / or special-purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit receives instructions and data from read-only memory and / or random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include one or more mass storage devices for storing data, such as disks, magneto-optical disks, or optical disks, or the computer will be operatively coupled to such mass storage devices to receive data from or transfer data to them, or both. However, a computer is not required to have such devices. Furthermore, a computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device such as a universal serial bus (USB) flash drive, to name a few.

[0205] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, such as semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disks or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks. Processors and memory may be supplemented by or incorporated into dedicated logic circuitry.

[0206] While this specification contains numerous specific implementation details, these should not be construed as limiting the scope of any invention or the scope of the claims, but rather are primarily intended to describe features of specific embodiments of a particular invention. Certain features described in the various embodiments herein may also be implemented in combination in a single embodiment. Conversely, various features described in a single embodiment may also be implemented separately in various embodiments or in any suitable sub-combination. Furthermore, while features may function in certain combinations as described above and even initially claimed in this way, one or more features from a claimed combination may be removed from that combination in some cases, and a claimed combination may refer to a sub-combination or a variation thereof.

[0207] Similarly, although the operations are depicted in a specific order in the accompanying drawings, this should not be construed as requiring these operations to be performed in the specific order shown or sequentially, or requiring all illustrated operations to be performed to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0208] Thus, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claims may be performed in a different order and still achieve the desired result. Furthermore, the processes depicted in the drawings are not necessarily shown in a specific order or sequence to achieve the desired result. In some implementations, multitasking and parallel processing may be advantageous.

[0209] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.

Claims

1. A metadata storage method applied to a distributed file system, characterized in that, The method includes: For any metadata of the file system, The metadata is stored in a metadata server, wherein the metadata server is used to store metadata in the file system based on an unordered index structure; the unordered index structure includes a hash table; storing the metadata in the metadata server includes: determining the key-value pair corresponding to the metadata based on the data type of the metadata; performing a hash operation on the key-value pair corresponding to the metadata to determine the storage location of the metadata in the hash table; storing the metadata in the corresponding storage location according to a preset memory layout; wherein, in the preset memory layout, data fields related to the same metadata operation are arranged in the same aligned persistent memory block; data in the same aligned persistent memory block is processed in the same CPU atomic operation; Based on the location information of the first node corresponding to the metadata in the hierarchical directory tree, the second node associated with the first node in the hierarchical directory tree is determined. The hierarchical directory tree is used to describe the logical structure of storing data in the file system based on ordered key values. The association between the second node and the first node includes the second node being a child node or a sibling node of the first node. Obtain the storage location information of the associated metadata corresponding to the second node on the metadata server; and, Based on the association between the first node and the second node, the storage location information of the associated metadata corresponding to the second node in the metadata server is linked within the metadata.

2. The method of claim 1, wherein, The file system's metadata includes file metadata and directory metadata, and the directory metadata includes directory access metadata and directory content metadata. The key corresponding to the file metadata is the name of the file metadata and the identifier of the parent node of the corresponding node of the file metadata; The key corresponding to the directory access metadata is the name of the directory access metadata and the identifier of the parent node of the corresponding node of the directory access metadata; The key corresponding to the directory content metadata is the identifier of the node corresponding to the directory content metadata.

3. The method according to claim 2, characterized in that, The metadata data fields include the file content modification time and the file status change time; The method further includes: In response to receiving a target metadata operation request for any metadata, if the metadata update field corresponding to the target metadata operation request includes the file content modification time and the file status change time, only the file content modification time of any metadata is updated; Upon receiving a request to obtain the file status change time for any of the aforementioned metadata, the maximum value between the updated file content modification time and the file status change time recorded in the current aforementioned metadata is taken as the file status change time corresponding to the request.

4. The method according to claim 1, characterized in that, The file system metadata includes file metadata and directory metadata. The file metadata corresponds to a first memory layout, and the directory metadata corresponds to a second memory layout. The first memory layout and the second memory layout are different. In the first memory layout, the file content modification time and the number of file bytes are arranged in the same aligned persistent memory block; The lower two sections of the schema, file user identifier, file status change time, and file group identifier fields are arranged in the same aligned persistent memory block; The top two fields of the file's group identifier, the file's status information, and the creation time are arranged in the same aligned persistent memory block; In the second memory layout, the file content modification time and the head pointer pointing to the child node linked list are arranged in the same aligned persistent memory block; The lower two fields of the pattern, file user identifier, file status change time, and file group identifier are arranged in the same aligned persistent memory block; the upper two fields of the file group identifier, file status information, and creation time are arranged in the same aligned persistent memory block.

5. The method according to claim 4, characterized in that, The method further includes: Receive a directory traversal request sent by a client, and return the directory traversal result to the client based on at least one of the following methods: While reading the linked list nodes, the partially read results are returned to the client; The linked list is split into multiple segments, and the split segments are read in parallel by multiple threads. After merging the read results, the merged directory traversal result is returned to the client. Determine whether the directory traversal result corresponding to the directory traversal request is cached in the dynamic random access memory (DRAM). If so, read the directory traversal result corresponding to the directory traversal request from the dynamic random access memory (DRAM) and return the read directory traversal result to the client. The DRAM caches historical directory traversal results within a preset time range before the current time. The compressed directory traversal results are returned to the client.

6. A metadata storage device, applied to a distributed file system, characterized in that, The device includes: A storage module is configured to store any metadata of the file system in a metadata server, wherein the metadata server stores the metadata in the file system based on an unordered index structure, the unordered index structure including a hash table; the storage module is specifically configured to: determine the key-value pair corresponding to the metadata based on the data type of the metadata; perform a hash operation on the key-value pair corresponding to the metadata to determine the storage location of the metadata in the hash table; and store the metadata in the corresponding storage location according to a preset memory layout; wherein, in the preset memory layout, data fields related to the same metadata operation are arranged in the same aligned persistent memory block; and data in the same aligned persistent memory block is processed in the same CPU atomic operation; The determination module is used to determine the second node associated with the first node in the hierarchical directory tree based on the location information of the first node corresponding to the metadata in the hierarchical directory tree. The hierarchical directory tree is used to describe the logical structure of storing data in the file system based on ordered key values. The association between the second node and the first node includes the second node being a child node or a sibling node of the first node. The processing module is configured to obtain the storage location information of the associated metadata corresponding to the second node in the metadata server; and, based on the association relationship between the first node and the second node, link the storage location information of the associated metadata corresponding to the second node in the metadata server within the metadata.

7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by a processor, it implements the steps of the method according to any one of claims 1-5.

8. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, The processor performs the steps of the method according to any one of claims 1-5.