NUMA-aware optimization system and method for nvm key-value databases

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By creating a proxy thread pool for the NVM key-value database under the NUMA architecture and combining it with monitoring and scheduling strategies, the performance degradation problem of cross-terminal access and high-concurrency access of the NVM database under the NUMA architecture was solved, achieving a balance between system performance and resources and improving system availability.

CN117742613BActive Publication Date: 2026-06-23SHANGHAI JIAOTONG UNIV

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHANGHAI JIAOTONG UNIV
Filing Date: 2023-12-29
Publication Date: 2026-06-23

Application Information

Patent Timeline

29 Dec 2023

Application

23 Jun 2026

Publication

CN117742613B

IPC: G06F3/06; G06F9/50; G06F9/54

AI Tagging

Application Domain

Input/output to record carriers Resource allocation

Technology Topics

Data pack Data operations

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing NVM key-value databases suffer from performance degradation issues under NUMA architecture due to cross-platform access and high concurrency access, and existing solutions have failed to effectively address communication latency and excessive resource consumption in high-load scenarios.

Method used

A proxy module is used to create a proxy thread pool for each NUMA node. The consistent hashing algorithm is used to convert requests into local accesses. Combined with the monitoring and scheduling modules, the proxy threads are dynamically scheduled to optimize system performance and resource utilization, including load trend prediction, queue waterline strategy and thread spin-wait strategy.

Benefits of technology

It effectively solves the performance degradation problem of cross-platform access and high-concurrency access of NVM database under NUMA architecture, balances system performance and resource overhead, and improves system availability.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117742613B_ABST

Patent Text Reader

Abstract

The application provides a NUMA-aware optimization system and method for an NVM key-value database, comprising a proxy module: creating a proxy thread pool for each node of the NUMA; the thread pool converts data operation requests sent from each node to the local node into local access through a proxy mode; a monitoring module: monitoring the running state of the system and collecting running data; the running data comprises running data at the proxy thread level, running data at the proxy node level and running data at the system level; a scheduling module: scheduling the proxy threads in the proxy module. The application proposes a node multi-proxy mode, which can convert any node access request to the local NVM into local access through a certain number of proxy threads, thereby solving the performance decline problem of NVM database cross-end access and high concurrency access under the NUMA architecture.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of non-volatile memory, NUMA awareness, databases, data structure design, and algorithm design. Specifically, it relates to a NUMA-aware optimization system and method for NVM key-value databases. Background Technology

[0002] Key-value databases are a type of non-relational database. Compared to relational databases, key-value databases use key-value methods to store, retrieve, and manage related data. Traditional key-value databases are designed based on block storage devices. To address the characteristics of block storage devices, data structures optimized for block devices have been proposed, such as Log Structured Merge Tree (LSM). This transforms deletion operations into another type of write operation, writing them into the same batch of operations as the original write operation. It leverages the high sequential write performance of block devices to batch append data to the end of the block storage file, thereby improving performance. However, due to the low I / O performance of block storage devices, key-value databases based on block storage devices are always limited by the performance bottleneck of the storage device.

[0003] Non-volatile memory (NVM) is a new type of storage medium that lies between DRAM and block storage devices. NVM offers read / write speeds close to DRAM while retaining the persistence of block devices. Furthermore, NVM has a larger capacity than DRAM. Therefore, the emergence of NVM provides a new approach to improving the performance of key-value databases. Because NVM has new characteristics such as byte addressing compared to traditional block devices like HDDs and SSDs, key-value databases need to design new data structures to utilize NVM hardware. Several works in this field have been proposed, such as Viper, FlatStore, and MatrixKV.

[0004] Despite its new features, NVM still has some shortcomings in read / write bandwidth, remote access, and concurrency scenarios. Firstly, compared to DRAM, NVM still lags behind in read / write performance. NVM's sequential read and write bandwidth is 63% and 78% lower than DRAM, respectively, while its random read and write bandwidth is 85% and 95% lower, respectively. Secondly, NVM suffers from reduced remote access performance in Non-Uniform Memory Access (NUMA) architectures. NUMA architectures treat NVM access in two ways: accessing the node where the thread resides (NVM) is called "local access," and accessing other nodes is called "remote access." With two nodes, the read / write bandwidth for NVM local access is 1.2-1.79 times that of NVM remote access, and 1.68-2.53 times that of NVM remote access. Finally, NVM suffers from a decline in concurrent read and write performance. For a single-node NVM, the write bandwidth can be fully utilized when the number of concurrent threads reaches 4-6. As the number of concurrent threads increases, the write performance decreases, and the optimal number of concurrent threads for read performance is relatively higher than the write performance.

[0005] With the explosive growth of data volume, the demand for database capacity expansion inevitably necessitates the deployment of NVM-based key-value databases across multiple nodes in a NUMA server architecture to increase database capacity. This necessitates considering the performance issues of NVM within a NUMA architecture. Several works have already proposed key-value database solutions that implement NUMA-aware NVM, such as ListDB and PMDB.

[0006] These solutions can be categorized into two types: the first type converts all remote accesses to local accesses using a proxy thread, and the second type eliminates the impact of remote accesses by designing a NUMA-aware data structure to maintain a global data structure. However, these solutions all have certain shortcomings. First, the proxy approach uses only one thread to handle request proxying. In high-load scenarios, the single-threaded solution incurs significant communication latency overhead, and a large number of blocked requests lead to performance degradation. Second, even with a NUMA-aware data structure, performance degradation cannot be avoided when dealing with purely remote accesses.

[0007] In addition, neither design scheme takes into account the impact of high-concurrency access scenarios on NVM performance, nor does it control the concurrency level of NVM.

[0008] Chinese patent document CN111262753A discloses a method, system, terminal, and storage medium for automatically configuring the number of NUMA nodes. The method includes: obtaining CPU information of the current test environment using the `lscpu` command; extracting the number of NUMA nodes from the CPU information using the `grep` command; assigning the number of NUMA nodes to the running parameter `GROUP_COUNT` using a script; and executing the SPEC jbb2015 test based on the `GROUP_COUNT` number. However, this patent document only avoids the tedious process of manual configuration by testers and still fails to solve the aforementioned problem. Summary of the Invention

[0009] To address the shortcomings of existing technologies, the purpose of this invention is to provide a NUMA-aware optimization system and method for NVM key-value databases.

[0010] A NUMA-aware optimization system for NVM key-value databases provided by the present invention includes:

[0011] Proxy module: Creates proxy thread pools for each NUMA node; the thread pools convert data operation requests sent from each node to the local node into local access requests through a proxy method;

[0012] Monitoring module: Monitors the system's operating status and collects operating data; the operating data includes operating data at the agent thread level, operating data at the agent node level, and operating data at the system level;

[0013] Scheduling module: Schedules the proxy threads in the proxy module.

[0014] Preferably, the proxy module includes a request sharder based on a consistent hashing algorithm, which forwards requests to the corresponding NUMA nodes, and a proxy thread method that converts requests into locally executed node proxies.

[0015] Preferably, the monitoring module includes a monitoring thread that executes monitoring logic and periodically collects data; a DelegateWorkerData class for storing the state of the agent thread; a NodeStatus class for storing the state of the node; and a DBStatus class for storing the state of the system.

[0016] Preferably, the system records runtime data based on the DelegateWorkerData, NodeStatus, and DBStatus classes and provides a sampling interface; the monitoring module calls the sampling interface to complete data collection.

[0017] Preferably, the scheduling module includes a mechanism for controlling the switching on and off of agent threads and a scheduling strategy; the scheduling strategy adjusts the number of node agent threads and node switching when the system load changes, balancing system performance and resource overhead; the scheduling strategy includes a load trend prediction strategy, a queue water level strategy, and a thread spin-wait strategy.

[0018] Preferably, the load trend prediction strategy includes collecting system monitoring data S0, S1, ..., Sn from the previous n periods, focusing on the indicator T, where T = T0, T1, ..., Tn, and calculating the average indicator T. avg =Avg(T0, T1, T2, ..., Tn), calculates the change T of the average indicator over a historical period relative to the indicator in the first period. speedup =(T avg -T0) / T0 yields the trend of the monitored indicators over past periods; if T speedup If T is negative, the indicator is in a downward trend. speedup If the value is positive, the indicator is in an upward trend; based on the magnitude of change |T speedup The value of | dynamically adjusts the number of proxy threads for the node;

[0019] The queue watermark strategy includes dividing the queue into watermarks of different heights to adjust node threads; at the beginning of each cycle, checking the remaining message count in the current node's message queue based on the strategy; the message queue includes a highest watermark, a second-highest watermark, a second-lowest watermark, and a lowest watermark; when the remaining request count in all queues of a node is below the second-lowest watermark, one thread is shut down; when the remaining request count in all queues of a node is below the lowest watermark, two threads are shut down; when the remaining request count in all queues of a node is between the highest and lowest watermarks, the system makes no adjustment; when the remaining request count in all queues of a node is between the second-highest and highest watermarks, one thread is added; when the remaining request count in all queues of a node is above the highest watermark, two threads are added.

[0020] The thread spin-wait strategy includes the client calculating the spin time before sending to the proxy thread; the spin time is derived from the ratio of the base key value size to the key value size in the request, and the base spin time is scaled according to this ratio to obtain the request spin time; the proxy thread internally checks whether the current timeout has occurred, and if it has, the thread is closed, otherwise the request is retrieved and executed; after retrieving the request, the stop time is updated according to the spin time in the request, and it is determined whether the current message queue is full, and if it is full, a new thread is woken up.

[0021] Preferably, the node proxy includes a process where, when an operation request is sent to a designated node, the corresponding node proxy receives the request and places it in a buffer message queue, and the proxy thread retrieves the request and performs operation processing; the node proxy includes a proxy thread pool, a circular buffer, and proxy threads.

[0022] Preferably, the proxy thread pool includes maintaining a set of proxy threads for each node during system operation. Upon receiving a request from a specified node, a set of proxy threads corresponding to that node is selected, and special types of requests are diverted to bypass the proxy based on a request separation strategy. For requests that have been proxied, a proxy thread is randomly selected to send the request to a circular buffer bound to the proxy thread. The circular buffer includes a circular queue constructed using an array, and a polling method based on spin locks is used as the communication mechanism between the circular buffer and the proxy threads. The proxy threads include proxy threads that cyclically retrieve requests from the buffer and parse the requests to perform local proxy operations.

[0023] Preferably, the separation strategy of the proxy thread pool includes a read-write separation strategy and a key-value size separation strategy; when the request type is a read operation, the request will bypass the proxy and directly operate the underlying database engine and NVM part; when the request type is a write operation, some requests are processed through the proxy by means of the key-value size separation strategy; in the write operation, requests smaller than the preset key-value pair bypass the proxy and directly operate the underlying database engine and NVM part for processing, while requests larger than the preset key-value pair are processed through the proxy.

[0024] A NUMA-aware optimization method for NVM key-value databases provided by the present invention includes:

[0025] Step S1: Create a proxy thread pool for each NUMA node; the thread pool converts data operation requests sent from each node to the local node into local access through a proxy method.

[0026] Step S2: Monitor the system's operating status and collect operating data; the operating data includes operating data at the agent thread level, operating data at the agent node level, and operating data at the system level;

[0027] Step S3: Schedule the proxy threads in the proxy module.

[0028] Preferably, the proxy module includes a request sharder based on a consistent hashing algorithm, which forwards requests to the corresponding NUMA nodes, and a proxy thread method that converts requests into locally executed node proxies.

[0029] Preferably, the monitoring module includes a monitoring thread that executes monitoring logic and periodically collects data; a DelegateWorkerData class for storing the state of the agent thread; a NodeStatus class for storing the state of the node; and a DBStatus class for storing the state of the system.

[0030] Preferably, the system records runtime data based on the DelegateWorkerData, NodeStatus, and DBStatus classes and provides a sampling interface; the monitoring module calls the sampling interface to complete data collection.

[0031] Preferably, the scheduling module includes a mechanism for controlling the switching on and off of agent threads and a scheduling strategy; the scheduling strategy adjusts the number of node agent threads and node switching when the system load changes, balancing system performance and resource overhead; the scheduling strategy includes a load trend prediction strategy, a queue water level strategy, and a thread spin-wait strategy.

[0032] Preferably, the load trend prediction strategy includes statistically analyzing system monitoring data S0, S1, ..., Sn over the previous n periods, focusing on the indicator T, where T = T0, T1, ..., Tn, and calculating the average indicator T. avg =Avg(T0, T1, T2, ..., Tn), calculates the change T of the average indicator over a historical period relative to the indicator in the first period. speedup =(T avg -T0) / T0 yields the trend of the monitored indicators over past periods; if T speedup If T is negative, the indicator is in a downward trend. speedup If the value is positive, the indicator is in an upward trend; based on the magnitude of change |T speedup The value of | dynamically adjusts the number of proxy threads for the node;

[0033] The queue watermark strategy includes dividing the queue into watermarks of different heights to adjust node threads; at the beginning of each cycle, checking the remaining message count in the current node's message queue based on the strategy; the message queue includes a highest watermark, a second-highest watermark, a second-lowest watermark, and a lowest watermark; when the remaining request count in all queues of a node is below the second-lowest watermark, one thread is shut down; when the remaining request count in all queues of a node is below the lowest watermark, two threads are shut down; when the remaining request count in all queues of a node is between the highest and lowest watermarks, the system makes no adjustment; when the remaining request count in all queues of a node is between the second-highest and highest watermarks, one thread is added; when the remaining request count in all queues of a node is above the highest watermark, two threads are added.

[0034] The thread spin-wait strategy includes the client calculating the spin time before sending to the proxy thread; the spin time is derived from the ratio of the base key value size to the key value size in the request, and the base spin time is scaled according to this ratio to obtain the request spin time; the proxy thread internally checks whether the current timeout has occurred, and if it has, the thread is closed, otherwise the request is retrieved and executed; after retrieving the request, the stop time is updated according to the spin time in the request, and it is determined whether the current message queue is full, and if it is full, a new thread is woken up.

[0035] Preferably, the node proxy includes a process where, when an operation request is sent to a designated node, the corresponding node proxy receives the request and places it in a buffer message queue, and the proxy thread retrieves the request and performs operation processing; the node proxy includes a proxy thread pool, a circular buffer, and proxy threads.

[0036] Preferably, the proxy thread pool includes maintaining a set of proxy threads for each node during system operation. Upon receiving a request from a specified node, a set of proxy threads corresponding to that node is selected, and special types of requests are diverted to bypass the proxy based on a request separation strategy. For requests that have been proxied, a proxy thread is randomly selected to send the request to a circular buffer bound to the proxy thread. The circular buffer includes a circular queue constructed using an array, and a polling method based on spin locks is used as the communication mechanism between the circular buffer and the proxy threads. The proxy threads include proxy threads that cyclically retrieve requests from the buffer and parse the requests to perform local proxy operations.

[0037] Preferably, the separation strategy of the proxy thread pool includes a read-write separation strategy and a key-value size separation strategy; when the request type is a read operation, the request will bypass the proxy and directly operate the underlying database engine and NVM part; when the request type is a write operation, some requests are processed through the proxy by means of the key-value size separation strategy; in the write operation, requests smaller than the preset key-value pair bypass the proxy and directly operate the underlying database engine and NVM part for processing, while requests larger than the preset key-value pair are processed through the proxy.

[0038] Compared with the prior art, the present invention has the following beneficial effects:

[0039] 1. This invention proposes a node multi-proxy approach, which can convert any node's local NVM access request into a local access through a certain number of proxy threads, thus solving the performance degradation problem of cross-terminal access and high-concurrency access of NVM database under NUMA architecture.

[0040] 2. This invention solves the problem of excessive system resource consumption by proxy threads by proposing a scheduling strategy based on the relationship between proxy threads and system load, enabling the system to balance performance and resource overhead.

[0041] 3. This invention solves the problem of excessive overhead for some special types of requests during the proxy process by proposing a separation strategy for different types of requests and requests in different locations, thereby improving the availability of the system.

[0042] Other beneficial effects of the present invention will be explained in detail through the introduction of specific technical features and technical solutions in specific embodiments. Those skilled in the art should be able to understand the beneficial technical effects brought about by these technical features and technical solutions through the introduction of these technical features and technical solutions. Attached Figure Description

[0043] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:

[0044] Figure 1 The system framework diagram provided for this invention.

[0045] Figure 2 This is a diagram of the data flow path and structural components of the node agent in this embodiment of the invention.

[0046] Figure 3 This is a diagram showing the core components and attributes of the monitoring data in this embodiment of the invention.

[0047] Figure 4 This is a schematic diagram of the operation and interaction of the scheduling part in an embodiment of the present invention. Detailed Implementation

[0048] The present invention will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present invention, but do not limit the invention in any way. It should be noted that those skilled in the art can make several changes and improvements without departing from the concept of the present invention. These all fall within the protection scope of the present invention.

[0049] Reference Figure 1 The figure shows the overall framework of the NUMA-aware optimization system for NVM key-value databases proposed in this invention. The overall framework consists of three parts: agent design, monitoring design, and scheduling design.

[0050] The agent design section describes the organizational structure of sharders and node agents, the monitoring section describes the types of monitoring data and running members, and the scheduling section describes the relevant strategies. Figure 1 The upper black arrows indicate the possible paths for database operation requests, while the lower dashed arrows indicate the order of adjustments in the overall system operation.

[0051] The design of the agent component is described as follows:

[0052] A proxy thread pool is created for each NUMA node. A sharder using a consistent hashing algorithm forwards requests to the corresponding node proxy on the NUMA node. In the node proxy, the proxy thread pool selects proxy threads from the pool to convert data operation requests sent from each node to the local node into local access.

[0053] The design of the monitoring section is described as follows:

[0054] The system monitors its operational status, including operational data at the agent thread level, agent node level, and system level. This collected data is used for dynamic scheduling of agent threads. The system also monitors the monitoring threads that execute the monitoring logic, collecting data periodically. The data structure for storing agent thread-level status uses the `DelegateWorkerData` class, the data structure for storing agent node-level status uses the `NodeStatus` class, and the data structure for storing system-level status uses the `DBStatus` class.

[0055] The design of the scheduling part is described as follows:

[0056] Based on the operational data collected by the monitoring unit, the agent threads in the agent unit are scheduled. The scheduling strategies include: load trend prediction strategy, queue capacity watermark strategy, and agent thread spin-wait strategy. The scheduling unit ensures that system performance is maintained while minimizing system resource overhead during system operation. Based on the results of the scheduling strategies, the scheduling module makes different adjustments to the node agent threads, including: enabling node agents, disabling node agents, increasing the number of node agent threads, and decreasing the number of node agent threads.

[0057] Reference Figure 2 As shown, the node proxy is further subdivided into three parts: the node proxy pool, the circular buffer, and the proxy thread. When an operation request is sent to a specified node, the corresponding node proxy receives the request and puts it into the buffer message queue, and the proxy thread retrieves the request and performs the operation.

[0058] The proxy thread pool maintains a set of proxy threads for each node during system operation. Upon receiving a request from a specified node, it selects the corresponding set of proxy threads for that node and uses a request separation strategy to divert special types of requests that bypass the proxy. For requests that have passed through the proxy, a proxy thread is randomly selected to send the request to the circular buffer bound to that proxy thread. The separation strategies include read-write separation and key-value separation. For read operations, the request bypasses the proxy and directly operates on the database engine and NVM. For write operations, the key-value separation strategy is used to process some requests through the proxy. In write operations, requests containing smaller key-value pairs bypass the proxy and directly operate on the underlying database engine and NVM, while requests containing larger key-value pairs are processed through the proxy.

[0059] Circular Buffer: The circular buffer uses an array to implement a circular queue, and uses a polling method based on spinlocks as the communication mechanism between the circular buffer and the agent thread.

[0060] Proxy Thread: The proxy thread continuously retrieves requests from the buffer and parses them for local proxy operations. First, the thread binds itself to the core of the specified node and runs. Then, the thread enters a loop. The main steps in the loop are: first, check if the stop flag is set. If a stop signal is received and the number of sent requests equals the number of received requests in the bound circular buffer, it means the thread has finished processing all requests, records the end time, and terminates the loop. Otherwise, it retrieves a request from the circular buffer. If there are no requests, the loop restarts. If a request exists and the task type is a write operation, the database engine's write operation interface is called to perform the actual data writing; if the task type is a read operation, the database engine's read operation interface is called to perform the actual data reading.

[0061] Reference Figure 3 As shown, the DelegateWorkerData, NodeStatus, and DBStatus classes are used to record runtime data in the optimization system.

[0062] The DelegateWorkerData class includes:

[0063] read_cnt records the number of read operations that passed through the proxy;

[0064] write_cnt records the number of write operations that passed through the proxy;

[0065] read_bytes records the amount of data read through the proxy;

[0066] write_bytes records the amount of data written through the proxy;

[0067] `write_samples` records the number of samples with the current agent thread that record the latency.

[0068] write_latency records the cumulative latency sampled by the current agent thread;

[0069] The NodeStatus class includes:

[0070] node_id records the current node number;

[0071] latency records the average latency of all requests on the current node;

[0072] write_throughput records the write throughput of the current node through the proxy;

[0073] read_throughput records the read throughput of the current node through the proxy;

[0074] read_request_down and write_request_down record the number of read and write operations that the current node has completed through the proxy;

[0075] elapse_time stores the elapsed time of the recording period;

[0076] cur_workers records the number of active agent threads on the current node within the period;

[0077] read_QPS and write_QPS record the node's read and write QPS;

[0078] bypass_read_cnt and bypass_write_cnt record the number of bypass read / write requests;

[0079] bypass_read_bytes and bypass_write_bytes record the amount of data requested for bypass read and write operations;

[0080] read_req and write_req record the number of all read and write requests sent to the current node;

[0081] The DBStatus class includes:

[0082] node_status records the node status;

[0083] worker_data records the thread state;

[0084] cpu_utilization records CPU core utilization;

[0085] The three classes provide sampling interfaces, and the monitoring part can complete data collection by calling the interfaces.

[0086] Reference Figure 4 As shown, the scheduling part provides three scheduling strategies, including: load trend prediction strategy, queue water level strategy, and thread spin-wait strategy.

[0087] Specifically, the load trend prediction strategy involves: statistically analyzing system monitoring data S0, S1, ..., S over the previous n periods. n Pay attention to the indicator T, namely T0, T1, ..., T n Then calculate the average index T. avg =Avg(T0, T1, T2, ..., T) n Next, calculate the change T of the average indicator over the historical period relative to the indicator in the first period. speedup =(T avg -T0) / T0, based on the magnitude of the change, we can know the trend of the key indicators over the past few periods. If T speedup If T is negative, it indicates that the indicator is in a downward trend. speedup If the value is positive, it indicates that the indicator is in an upward trend. Finally, based on the magnitude of the change |T speedup | Dynamically adjust the number of proxy threads for each node.

[0088] Queue watermark strategy: By dividing the queue into high and low watermarks, the strategy adjusts the number of threads on each node. At the beginning of each cycle, the strategy checks the remaining message count in the current node's message queue. The message queue contains two watermarks (high and low) and two secondary watermarks. When the remaining number of requests in all queues on a node falls below the secondary low watermark, one thread is shut down. When the remaining number of requests falls below the low watermark, two threads are shut down. When the remaining queue capacity is between the high and low watermarks, no adjustment is made. When it is between the secondary high watermark and the high watermark, one thread is added. When it is above the high watermark, two threads are added.

[0089] Thread spin-wait strategy: Before sending to the proxy thread, the client calculates the spin time. The spin time depends on the ratio of the base key value size to the key value size in the request. The base spin time is scaled up by this ratio to obtain the request spin time. Inside the proxy thread loop, the client first checks if the current timeout has occurred. If it has, the thread is closed. Otherwise, the request is retrieved and executed. After retrieving the request, the client first updates the stop time according to the spin time in the request. Then, the client checks if the message queue is full. If the queue is full, a new thread is woken up.

[0090] The scheduling part includes four basic types of adjustments for agent node threads:

[0091] (1) Node proxy is turned off. If the number of requests in the previous period exceeds a certain threshold, the node proxy is turned on.

[0092] (2) Node proxy is enabled. No requests occurred in the previous period. Node proxy is disabled.

[0093] (3) When the node agent is enabled, the system load increases, so the agent thread is increased.

[0094] (4) When the node agent is enabled, the system load is reduced and the number of agent threads is reduced.

[0095] This invention also provides a NUMA-aware optimization method for NVM key-value databases, comprising:

[0096] Step S1: Create a proxy thread pool for each NUMA node; the thread pool converts data operation requests sent from each node to the local node into local access through a proxy method.

[0097] Step S2: Monitor the system's operating status and collect operating data; the operating data includes operating data at the agent thread level, operating data at the agent node level, and operating data at the system level;

[0098] Step S3: Schedule the proxy threads in the proxy module.

[0099] Those skilled in the art will understand that, besides implementing the system and its various devices, modules, and units provided by this invention in the form of purely computer-readable program code, the same functions can be achieved entirely through logical programming of the method steps, making the system and its various devices, modules, and units of this invention function in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, the system and its various devices, modules, and units provided by this invention can be considered as a hardware component, and the devices, modules, and units included therein for implementing various functions can also be considered as structures within the hardware component; alternatively, the devices, modules, and units for implementing various functions can be considered as both software modules implementing the method and structures within the hardware component.

[0100] Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art can make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention. Unless otherwise specified, the embodiments and features described in this application can be arbitrarily combined with each other.

Claims

1. A NUMA-aware optimization system for NVM key-value databases, characterized in that, include: Proxy module: Creates proxy thread pools for each NUMA node; The thread pool converts data operation requests sent from each node to the local node into local access requests through a proxy. Monitoring module: Monitors the system's operating status and collects operating data; the operating data includes operating data at the agent thread level, operating data at the agent node level, and operating data at the system level; Scheduling module: Schedules the proxy threads in the proxy module; The scheduling module includes a mechanism for controlling the switching on and off of agent threads and a scheduling strategy. The scheduling strategy adjusts the number of node agent threads and the switching on and off of nodes when the system load changes, balancing system performance and resource overhead. The scheduling strategy includes a load trend prediction strategy, a queue water level strategy, and a thread spin-wait strategy. The load trend prediction strategy includes pre-statistical analysis. System monitoring data for each period Pay attention to the indicators. , Calculate the average index Calculate the change in the average indicator over the historical period relative to the indicator in the first period. This allows us to obtain the changing trends of the key indicators over past cycles; if If the value is negative, the indicator is in a downward trend. If the value is positive, the indicator is in an upward trend; based on the magnitude of the change... The value dynamically adjusts the number of proxy threads for each node; The queue watermark strategy includes dividing the queue into watermarks of different heights to adjust node threads; at the beginning of each cycle, checking the remaining message count in the current node's message queue based on the strategy; the message queue includes a highest watermark, a second-highest watermark, a second-lowest watermark, and a lowest watermark; when the remaining request count in all queues of a node is below the second-lowest watermark, one thread is shut down; when the remaining request count in all queues of a node is below the lowest watermark, two threads are shut down; when the remaining request count in all queues of a node is between the highest and lowest watermarks, the system makes no adjustment; when the remaining request count in all queues of a node is between the second-highest and highest watermarks, one thread is added; when the remaining request count in all queues of a node is above the highest watermark, two threads are added. The thread spin-wait strategy includes the client calculating the spin time before sending to the proxy thread; the spin time is derived from the ratio of the base key value size to the key value size in the request, and the base spin time is scaled according to this ratio to obtain the spin time of the request. The proxy thread loop checks if a timeout has occurred. If a timeout has occurred, the thread is closed; otherwise, the request is retrieved and executed. After retrieving the request, the stop time is updated based on the spin time within the request. The thread then checks if the message queue is full. If it is full, a new thread is woken up.

2. The NUMA-aware optimization system for NVM key-value databases according to claim 1, characterized in that, The proxy module includes a request sharder based on a consistent hashing algorithm, which forwards requests to the corresponding NUMA nodes, and a proxy thread mode that converts requests into locally executed node proxies.

3. The NUMA-aware optimization system for NVM key-value databases according to claim 1, characterized in that, The monitoring module includes a monitoring thread that executes monitoring logic and periodically collects data; a DelegateWorkerData class for storing the state of the proxy thread; a NodeStatus class for storing the state of the node; and a DBStatus class for storing the state of the system.

4. The NUMA-aware optimization system for NVM key-value databases according to claim 3, characterized in that, The system uses the DelegateWorkerData, NodeStatus, and DBStatus classes to record runtime data and provide sampling interfaces; the monitoring module calls the sampling interfaces to complete data collection.

5. A NUMA-aware optimization system for NVM key-value databases according to claim 2, characterized in that, The node proxy includes a process where, when an operation request is sent to a designated node, the corresponding node proxy receives the request and places it in a buffer message queue, and the proxy thread retrieves the request and performs the operation processing; the node proxy includes a proxy thread pool, a circular buffer, and proxy threads.

6. A NUMA-aware optimization system for NVM key-value databases according to claim 5, characterized in that, The proxy thread pool includes maintaining a set of proxy threads for each node during system operation. Upon receiving a request from a specified node, it selects a set of proxy threads corresponding to that node and uses a request separation strategy to divert special types of requests to bypass the proxy. For requests that have been proxied, it randomly selects a proxy thread to send the request to a circular buffer bound to the proxy thread. The circular buffer uses an array to form a circular queue and uses a polling method based on spin locks as the communication mechanism between the circular buffer and the proxy threads. The proxy threads include proxy threads that cyclically retrieve requests from the buffer and parse the requests to perform local proxy operations.

7. A NUMA-aware optimization system for NVM key-value databases according to claim 6, characterized in that, The separation strategy of the proxy thread pool includes a read-write separation strategy and a key-value size separation strategy. When the request type is a read operation, the request will bypass the proxy and directly operate the underlying database engine and NVM. When the request type is a write operation, the key-value size separation strategy will process some requests through the proxy. In the write operation, requests with key-value pairs smaller than the preset key-value pairs will bypass the proxy and directly operate the underlying database engine and NVM. Requests with key-value pairs larger than the preset key-value pairs will be processed through the proxy.

8. A NUMA-aware optimization method for NVM key-value databases, characterized in that, include: Step S1: Create a proxy thread pool for each NUMA node; The thread pool converts data operation requests sent from each node to the local node into local access requests through a proxy. Step S2: Monitor the system's operating status and collect operating data; the operating data includes operating data at the agent thread level, operating data at the agent node level, and operating data at the system level; Step S3: Schedule the proxy threads in the proxy module; Step S3 includes a control mechanism for switching agent threads on and off and a scheduling strategy; the scheduling strategy adjusts the number of node agent threads and node switching on and off when the system load changes, balancing system performance and resource overhead; the scheduling strategy includes a load trend prediction strategy, a queue water level strategy, and a thread spin-wait strategy. The load trend prediction strategy includes pre-statistical analysis. System monitoring data for each period Pay attention to the indicators. , Calculate the average index Calculate the change in the average indicator over the historical period relative to the indicator in the first period. This allows us to obtain the changing trends of the key indicators over past cycles; if If the value is negative, the indicator is in a downward trend. If the value is positive, the indicator is in an upward trend; based on the magnitude of the change... The value dynamically adjusts the number of proxy threads for each node; The queue watermark strategy includes dividing the queue into watermarks of different heights to adjust node threads; at the beginning of each cycle, checking the remaining message count in the current node's message queue based on the strategy; the message queue includes a highest watermark, a second-highest watermark, a second-lowest watermark, and a lowest watermark; when the remaining request count in all queues of a node is below the second-lowest watermark, one thread is shut down; when the remaining request count in all queues of a node is below the lowest watermark, two threads are shut down; when the remaining request count in all queues of a node is between the highest and lowest watermarks, the system makes no adjustment; when the remaining request count in all queues of a node is between the second-highest and highest watermarks, one thread is added; when the remaining request count in all queues of a node is above the highest watermark, two threads are added. The thread spin-wait strategy includes the client calculating the spin time before sending to the proxy thread; the spin time is derived from the ratio of the base key value size to the key value size in the request, and the base spin time is scaled according to this ratio to obtain the spin time of the request. The proxy thread loop checks if a timeout has occurred. If a timeout has occurred, the thread is closed; otherwise, the request is retrieved and executed. After retrieving the request, the stop time is updated based on the spin time within the request. The thread then checks if the message queue is full. If it is full, a new thread is woken up.

Citation Information

Patent Citations

CN111262753A
CN111078426A
CN116627978A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

CN111262753A

CN111078426A

CN116627978A