A method and apparatus for file system access pattern emulation and parameter tuning

By recording file system access patterns at the customer's site and simulating and testing them at the R&D center, the problem of confidentiality leakage in file system parameter optimization was solved, and remote and efficient file system performance optimization was achieved.

CN117312236BActive Publication Date: 2026-06-23NANJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV OF POSTS & TELECOMM
Filing Date
2023-10-10
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately acquire and simulate on-site business load characteristics to optimize file system parameters without disclosing customer business secrets, resulting in the file system's performance not being fully utilized after deployment.

Method used

The file system access pattern characteristics are recorded at the customer's site and sent to the R&D center for simulation. Multiple simulated services are generated for performance testing. The service that is closest to the on-site performance is selected for parameter tuning, thereby achieving remote optimization of file system parameters.

Benefits of technology

It enables precise characterization of business workloads, optimization of file system performance, and provision of higher-quality products and services without leaking customer data.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117312236B_ABST
    Figure CN117312236B_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of storage system, and discloses a file system access mode simulation and parameter optimization method and device; in a customer field environment, a real business accesses a file system, the file system records the characteristics of the access mode and sends them to a research and development center environment, and recovery and simulation are completed in the research and development center, so that research and development experts can accurately acquire and depict the business load characteristics of the simulated field in the research and development center, and complete the optimization work of the file system. Under the premise of not leaking the business confidentiality of the customer, the optimization work of the file system is completed to fully exert the system performance, and targeted optimization and improvement work can be carried out on the file system according to the business load characteristics, and higher quality products and services are provided.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of storage system technology, specifically relating to a method and apparatus for simulating file system access modes and optimizing parameters. Background Technology

[0002] Applications (computer programs designed to perform one or more specific tasks, including but not limited to Android, Apple, Windows, Linux, and MapReduce parallel computing programs, also referred to as business applications) rely heavily on file read and write access. The file system is crucial for supporting these operations. The performance of file system access is directly related to its parameter settings. To achieve better read and write performance, file system experts often need to perform targeted parameter tuning based on the application's characteristics. Conversely, development experts need to understand the specific business characteristics of the application environment to better optimize and improve the application after deployment. However, file system development and application are at different stages and are inherently separated by physical location. File system development is typically done in the R&D center, while application is usually done on-site at the customer's site. The file system needs targeted tuning based on the on-site business access patterns to fully utilize its capabilities (read and write performance). Currently, the analysis and tuning of existing systems are usually done by R&D experts on-site, which is time-consuming, labor-intensive, and inefficient. While some technical tasks can be performed remotely by R&D experts logging into the customer's on-site host, this approach faces objective difficulties due to business confidentiality concerns. Firstly, in scenarios involving national security and user privacy, allowing R&D experts direct access to application file read / write operations could lead to leaks and potentially serious, incalculable consequences. Furthermore, the varying abilities and experience of R&D experts make it difficult to accurately grasp business load characteristics and provide optimal parameter tuning results based solely on human experience. Secondly, once the file system is deployed on-site, the R&D center struggles to accurately acquire and characterize the simulated on-site business load characteristics for targeted optimization and improvement of the file system. Summary of the Invention

[0003] To address the aforementioned technical issues, this invention provides a method and apparatus for simulating and optimizing file system access patterns. Without disclosing customer business confidentiality, it records only the application's access patterns to the file system and performs restoration and simulation at the R&D center. This allows R&D experts to accurately acquire and characterize the business load characteristics of the simulated environment, better optimize the file system to fully leverage system performance, and perform targeted optimization and improvement of the file system based on business load characteristics, providing higher quality products and services.

[0004] The method for simulating file system access modes and optimizing parameters according to the present invention includes the following steps:

[0005] Step 1: In the customer's on-site environment, perform real business access to the file system, including reading files, writing files, deleting files, reading directories, creating directories, deleting directories, and modifying file names;

[0006] Step 2: The file system records the characteristics of access patterns hourly, including data status, access status, and performance status.

[0007] Step 3: Send the characteristics of the recorded access patterns to the R&D center environment, and generate N simulated services based on these characteristics;

[0008] Step 4: Execute the simulated services one by one and record their performance. Then, compare the performance of the N simulated services with the performance of the access modes recorded in Step 2, and select the service whose performance is closest to that of the service sent from the actual site as the final simulated service.

[0009] Step 5: R&D experts optimize the parameters of the R&D center's environment file system for simulated business operations;

[0010] Step 6: Send the new parameters to the file system of the customer's field environment and reset the various parameters of the file system.

[0011] Furthermore, the data details of the access pattern characteristics include:

[0012] Directory information: directory tree level, total number of directories, number of files in the largest directory, number of empty directories;

[0013] File details: total number of files, maximum file size, minimum file size, number of empty files, size distribution percentage, average file size;

[0014] Other factors: number of nodes, number of hard drives, number of replicas, overall storage space utilization, and node data balance.

[0015] Furthermore, the access patterns of the access characteristics include:

[0016] The top ten most accessed files by size and access frequency, LRU queue length and space occupied, number of file reads, number of file writes, number of file deletions, number of directory creations, number of directory reads, number of directory deletions, number of other operations, and the size of the longest sequentially accessed data.

[0017] Furthermore, the access pattern characteristics include: average latency, bandwidth, and IOPS during the statistical period, measured in hours.

[0018] Furthermore, in step 4, the basis for determining the simulated service whose performance most closely matches that sent from the field is as follows:

[0019]

[0020] Where a is the proportion of latency ratio in the overall optimization objective, b is the proportion of bandwidth ratio in the overall optimization objective, and c is the proportion of IOPS ratio in the overall optimization objective; a+b+c=1, which means that the overall optimization objective is composed of these three proportions. For the time delay ratio, This is the bandwidth ratio. For IOPS ratio, F represents the performance recorded on-site by the customer, and k represents the kth generated service, with a value ranging from 1 to N.

[0021] An apparatus for simulating file system access modes and tuning parameters, used to implement the above method, comprising:

[0022] The actual business modules in the customer's on-site environment are the objects to be simulated and optimized.

[0023] The customer site environment file system module is used to record the characteristics of the actual business modules in the customer site environment, and reset the various parameters of the file system based on the parameter tuning results.

[0024] The R&D center environment simulation business module simulates business based on the recorded characteristics of real business in the customer's on-site environment, making it as close as possible to the real business in the customer's on-site environment.

[0025] The R&D center environment file system module is a mirror image of the customer's on-site environment file system. After R&D experts make targeted adjustments to various parameters of the file system for the R&D center environment simulation business, they send the adjustment results to the customer's on-site environment file system through the transmission module.

[0026] The transmission module sends the recorded characteristics of the actual business in the customer's field environment to the R&D center environment; and sends the parameter adjustment results of the R&D center experts to the customer's field environment file system module.

[0027] The beneficial effects of this invention are as follows: The system and method described in this invention achieve accurate characterization of the user's on-site application access patterns without exposing (or copying) user on-site data or recording complete access logs (detailed file and directory access times and operations), and minimize the amount of data transmitted remotely (from the customer's on-site environment to the R&D center environment); Based on the data and access situation, this invention generates multiple services, performs performance tests on each of them, and selects the service whose performance is closest to that sent from the site as the final simulated service. This helps to fully utilize the performance of the on-site file system and allows for targeted optimization and improvement of the file system based on business load characteristics, providing better products and services. Attached Figure Description

[0028] Figure 1 This is a structural diagram of the device described in this invention;

[0029] Figure 2 This is a flowchart of the method described in this invention;

[0030] Figure 3 This is a schematic diagram of the directory tree structure for the first set of simulated services;

[0031] Figure 4 This is a schematic diagram of the directory tree structure for the fourth group of simulated services;

[0032] Figure 5 This is a schematic diagram illustrating the access mode recording and recovery described in this invention. Detailed Implementation

[0033] To make the content of this invention easier to understand, the invention will be further described in detail below with reference to specific embodiments and accompanying drawings.

[0034] The application environment of this invention comprises two parts: a customer site environment and a research and development center environment. The hardware and software configurations of these two environments are identical, especially the file system; data can be transferred between the two environments via network (or email); research and development personnel operate in the research and development center environment.

[0035] The present invention provides a method for simulating file system access modes and optimizing parameters, such as... Figure 2 As shown, it includes the following steps:

[0036] Step 1: In the customer's on-site environment, perform real business access to the file system, including file system operations such as reading files, writing files, deleting files, reading directories, creating directories, deleting directories, and modifying file names;

[0037] Step 2: During Step 1, the file system records the characteristics of the access patterns hourly. These access pattern characteristics include:

[0038] 1) Data situation:

[0039] Directory information: directory tree level, total number of directories, number of files in the largest directory, number of empty directories;

[0040] File details: total number of files, maximum file size, minimum file size, number of empty files, size distribution percentage (4KB and below, 4KB to 1MB, 1MB to 4MB, 4MB to 1GB, 1GB and above), average file size;

[0041] Other factors (optional): number of nodes, number of hard disks, number of replicas, overall storage space utilization, and node data balance (variance of storage space utilization for each node);

[0042] 2) Access status:

[0043] The top ten most accessed files by size and access frequency, LRU queue length and space occupied, number of file reads, number of file writes, number of file deletions, number of directory creations, number of directory reads, number of directory deletions, number of other operations, and the longest sequential access data size (MB);

[0044] Access statistics do not refer to "which specific files and which sections were accessed", but rather to the statistics and summarization of accesses (information abstraction) to reflect the characteristics of the accesses;

[0045] 3) Performance:

[0046] The average latency, bandwidth, and IOPS during the statistical period are calculated in hours.

[0047] Step 3: Send the recorded access patterns to the R&D center environment via the transmission module, and generate N simulated services based on these patterns; for example... Figure 5 As shown;

[0048] Step 4: Execute the simulated business operations one by one, performing read, write, and delete accesses to the file system, and record the performance. Then, compare the performance of the N simulated business operations with the performance of the access patterns recorded in Step 2, using the following formula as the basis for calculation:

[0049]

[0050] The service whose performance is closest to that of the service sent from the field will be used as the final simulated service.

[0051] Step 5: R&D experts optimize the file system parameters for the simulated business, including but not limited to: number of replicas, page cache size, upper limit of dirty data cache threshold, snapshot switch and snapshot mode, SSD size, cache replacement method, capacity balancing threshold, background data migration speed limit, load balancing threshold, hard disk SMART frequency, node heartbeat detection frequency, asynchronous update time range threshold, size of a single object, upper limit of the number of threads in the thread pool, timeout retry threshold, storage dedicated CPU threshold, network dedicated CPU threshold, number of files in the directory threshold, directory level threshold, and filename length threshold.

[0052] Step 6: Send the new parameters to the customer's site environment via the transmission module and reset the parameters of the file system.

[0053] The present invention provides a device for simulating and optimizing file system access modes, such as... Figure 1 As shown, it includes:

[0054] The customer's actual business module in the field environment. This module and its data are not allowed to be leaked. Only its characteristics can be recorded (that is, the information itself is confidential and cannot be transmitted to the outside, but the characteristics of the file load are allowed to be recorded).

[0055] The customer's on-site environment file system module receives the parameter tuning results through the transmission module and resets various parameters of the file system, including but not limited to: number of replicas, page cache size, upper limit of cache dirty data threshold, snapshot switch and snapshot mode, SSD size, cache replacement method, capacity balancing threshold, background data migration rate limit, load balancing threshold, hard disk SMART frequency, node heartbeat detection frequency, asynchronous update time range threshold, size of a single object, upper limit of the number of threads in the thread pool, timeout retry threshold, storage dedicated CPU threshold, network dedicated CPU threshold, number of files in the directory threshold, directory level threshold, and filename length threshold.

[0056] The R&D center environment simulation service module simulates (generates) services based on the recorded characteristics of real services in the customer's on-site environment, making them as similar as possible to the real services in the customer's on-site environment. Information compression (characterization) inevitably leads to information loss. Therefore, the services simulated in the R&D center environment can hardly be completely identical to the real services in the customer's on-site environment; they can only be as similar as possible. To compensate for information loss, the R&D center environment will simulate multiple services (i.e., service 1, service 2, ..., service N), and perform performance tests on each of them. The service whose performance is closest to that sent from the on-site environment will be used as the final simulated service.

[0057] The R&D center environment file system module is a mirror image of the customer's on-site environment file system. After R&D experts make targeted adjustments to various parameters of the file system for the R&D center environment simulation business, they send the adjustment results to the customer's on-site environment file system through the transmission module.

[0058] The transmission module sends the recorded characteristics of the actual business in the customer's field environment to the R&D center environment; and sends the parameter tuning results of the R&D center experts to the file system of the customer's field environment.

[0059] The invention will now be described with reference to examples.

[0060] In the system section:

[0061] The client's on-site environment is a government big data scenario involving a number of personal privacy data, such as marital status, income, ID card number, hotel check-in records, and flight / train ticket travel records. Therefore, it is required that the data not leave the client's on-site environment.

[0062] The file system deployed in the customer's on-site environment is the CephFS distributed file system, version v13.2.5. It is configured with 10 nodes, each node is configured with 4 1TB Seagate hard drives (model ST1000DM010) and 3 replicas.

[0063] Between 10:00 AM and 11:00 AM, the access to the file system by applications in this government big data scenario was recorded (characterized and described), including:

[0064] 1) The data record is as follows: 3-13-14-1-28-1024-2-0-11-50-29-10-0-43177. The above numbers mean: at 10:00 AM, the file system has a directory tree level of 3, a total number of directories of 13, a maximum number of files in the largest directory of 14, a number of empty directories of 1, a total number of files of 28, a maximum file size of 1024MB, a minimum file size of 2KB, a number of empty files of 0, a file size distribution of 4KB and below (11%), 4KB to 1MB (50%), 1MB to 4MB (29%), 4MB to 1GB (10%), and 1GB and above (0%), with an average file size of 43177KB.

[0065] 2) Access records are as follows:

[0066] 5-99-125-54-444-52-55055-32-2-22-1024-15-87878-14-14-11-145-10-154-8-1024-10-2154-9954-0-0-19566-0-1-10.

[0067] The numbers above represent the following: The file with the highest number of accesses is 5KB (accessed 99 times); the second most accessed file is 125KB (accessed 54 times); the third most accessed file is 444KB (accessed 52 times); the fourth most accessed file is 55055KB (accessed 32 times); the fifth most accessed file is 2KB (accessed 22 times); the sixth most accessed file is 1024KB (accessed 15 times); the seventh most accessed file is 87878KB (accessed 14 times); the eighth most accessed file is 14KB (accessed 11 times); the ninth most accessed file is 145KB (accessed 10 times); and the tenth most accessed file is 154KB (accessed 8 times). The LRU queue length is 1024 and the occupied space is 10MB. The number of file reads is 215, file writes are 995, file deletions are 0, directory creations are 0, directory reads are 196, directory deletions are 0, other operations are 1, and the longest sequentially accessed data size is 10MB.

[0068] 3) The performance record is: 101-1.1-9903; the above numbers mean: average latency is 101ms, average bandwidth is 1.1GB / s, and average IOPS is 9903.

[0069] The aforementioned record was sent to the research and development center. Since the sent content was merely a string of numbers and unrelated to any privacy data, it did not constitute a privacy breach.

[0070] At the R&D center, four sets of simulated business scenarios were generated:

[0071] The first set of simulated services has a directory tree structure as follows: Figure 3 As shown, the numbers in the boxes represent the number of files in that directory. It can be seen that this group of directories is balanced.

[0072] After analysis and experimentation, the dimensions of the 28 files in this group are shown in Table 1.

[0073] Table 1 File Size

[0074]

[0075]

[0076] The size distribution of the 28 files shown in Table 1 conforms to the following: 11% are 4KB or less, 50% are 4KB to 1MB, 29% are 1MB to 4MB, 10% are 4MB to 1GB, and 0% are 1GB or more. The average file size is 43177KB. The content of the files is all zeros.

[0077] These files are randomly distributed in Figure 3 In the directory tree.

[0078] Access to files in the above directory structure was recorded based on access data. The average latency for the first group was 109ms, the average bandwidth was 1.3GB / s, and the average IOPS was 10003. The performance proximity index calculated using Formula 1 was 0.13.

[0079] The second set of simulated services: The directory tree structure remains the same. Figure 3 As shown in Table 1, the file sizes remain the same. The difference is that these files are distributed in the directory in ascending order of size, meaning that the deeper the directory hierarchy, the larger the file size.

[0080] Access to files in the above directory structure was recorded based on access data. The average latency for the second group was 199ms, the average bandwidth was 3.2GB / s, and the average IOPS was 9903. The performance proximity index calculated using Formula 1 was 0.55.

[0081] The third group of simulated services: The directory tree structure remains the same. Figure 3 As shown in Table 1, the file sizes remain the same. The difference is that these files are distributed in the directory in descending order of size, meaning that the deeper the directory hierarchy, the smaller the file size.

[0082] Access to files in the above directory structure was recorded based on access data. The average latency for the third group was 159ms, the average bandwidth was 2.2GB / s, and the average IOPS was 503. The performance proximity index calculated using Formula 1 was 2.01.

[0083] The fourth group of simulated services: directory tree structure as follows Figure 4 As shown, the file sizes remain as shown in Table 2. These files are randomly distributed in... Figure 4 In the directory tree.

[0084] Access records were used to access files in the directory structure described above. The fourth group recorded an average latency of 79ms, an average bandwidth of 2.2GB / s, and an average IOPS of 8803. The performance proximity index calculated using Formula 1 was 0.33.

[0085] Comparing the four groups of performance similarity exponents, the first group has the smallest exponent, therefore the first group is the last simulated service to be generated.

[0086] The R&D experts optimized the file system access parameters for this first set of services. The adjusted parameters were: page cache size reduced to 512MB, dirty data cache threshold increased to 10MB, background data migration speed limited to 2.5GB / s, and hard drive SMART frequency reduced to once per day. Other parameters remained unchanged.

[0087] The adjusted parameters were sent to the customer's site, and the configuration of the customer's on-site file system was adjusted accordingly.

[0088] As can be seen, no data leakage related to customer site documents occurred during the above process. R&D experts can accurately characterize and simulate customer site business characteristics solely within the R&D center, enabling precise optimization and tuning. All parameter tuning can be completed at the R&D center, ensuring reliable and accurate results that enhance file system performance. Furthermore, targeted optimizations and improvements can be made to the file system based on business load characteristics, providing better products and services.

[0089] The above description is merely a preferred embodiment of the present invention and is not intended to further limit the present invention. All equivalent changes made based on the description and drawings of the present invention are within the protection scope of the present invention.

Claims

1. A method for simulating file system access modes and optimizing parameters, characterized in that, Includes the following steps: Step 1: In the customer's on-site environment, perform real business access to the file system, including reading files, writing files, deleting files, reading directories, creating directories, deleting directories, and modifying file names; Step 2: The file system records the characteristics of access patterns hourly, including data status, access status, and performance status. Step 3: Send the characteristics of the recorded access patterns to the R&D center environment, and generate N simulated services based on these characteristics; Step 4: Execute the simulated services one by one and record their performance. Then, compare the performance of the N simulated services with the performance of the access modes recorded in Step 2, and select the service whose performance is closest to that of the service sent from the actual site as the final simulated service. Step 5: R&D experts optimize the parameters of the R&D center's environment file system for simulated business operations; Step 6: Send the new parameters to the file system of the customer's field environment and reset the various parameters of the file system; The data details of the access pattern characteristics include: Directory information: directory tree level, total number of directories, number of files in the largest directory, number of empty directories; File details: total number of files, maximum file size, minimum file size, number of empty files, size distribution percentage, average file size; Other factors: number of nodes, number of hard drives, number of replicas, overall storage space utilization, and node data balance.

2. The method for simulating file system access modes and optimizing parameters according to claim 1, characterized in that, The access patterns described include: The top ten most accessed files by size and access frequency, LRU queue length and space occupied, number of file reads, number of file writes, number of file deletions, number of directory creations, number of directory reads, number of directory deletions, number of other operations, and the size of the longest sequentially accessed data.

3. The method for simulating file system access modes and optimizing parameters according to claim 1, characterized in that, The access pattern features include: average latency, bandwidth, and IOPS during the statistical period, measured in hours.

4. The method for simulating and optimizing file system access modes according to claim 1, characterized in that, In step 4, the following formula is used to determine the simulated service whose performance most closely matches that of the service sent from the field: , Where a is the proportion of latency ratio in the overall optimization objective, b is the proportion of bandwidth ratio in the overall optimization objective, and c is the proportion of IOPS ratio in the overall optimization objective; a+b+c=1, which means that the overall optimization objective is composed of these three proportions. For the time delay ratio, This is the bandwidth ratio. For IOPS ratio, F represents the performance recorded on-site by the customer, and k represents the kth generated service, with a value ranging from 1 to N.

5. An apparatus for simulating and optimizing file system access modes, characterized in that, The apparatus is used to implement a method for simulating and optimizing file system access modes according to any one of claims 1-4, comprising: The actual business modules in the customer's on-site environment are the objects to be simulated and optimized. The customer site environment file system module is used to record the characteristics of the actual business modules in the customer site environment, and reset the various parameters of the file system based on the parameter tuning results. The R&D center environment simulation business module simulates business based on the recorded characteristics of real business in the customer's on-site environment, making it as close as possible to the real business in the customer's on-site environment. The R&D center environment file system module is a mirror image of the customer's on-site environment file system. After R&D experts make targeted adjustments to various parameters of the file system for the R&D center environment simulation business, they send the adjustment results to the customer's on-site environment file system through the transmission module. The transmission module sends the recorded characteristics of the actual business in the customer's field environment to the R&D center environment; and sends the parameter adjustment results of the R&D center experts to the customer's field environment file system module.