A consistent hash-based numa resource binding relationship monitoring tool

By using a numa resource binding relationship monitoring tool based on consistent hashing, the problems of high difficulty in process binding operations and poor real-time performance monitoring are solved, achieving efficient performance monitoring and fault repair, and reducing the complexity of user operations and maintenance costs.

CN115658423BActive Publication Date: 2026-06-26SHANDONG YUNHAI GUOCHUANG CLOUD COMPUTING EQUIP IND INNOVATION CENT CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANDONG YUNHAI GUOCHUANG CLOUD COMPUTING EQUIP IND INNOVATION CENT CO LTD
Filing Date
2022-09-27
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies suffer from complex and costly process binding operations, low performance after binding, poor real-time performance monitoring, lack of a unified performance monitoring panel, imperfect fault diagnosis and repair mechanisms, and lack of a unified load balancing and performance optimization framework.

Method used

This tool provides a consistent hashing-based monitoring tool for NUMA resource binding relationships, including web front-end and back-end, IO processes, monitoring service clusters, device drivers, and NUMA resources. It features a unified performance monitoring panel and fault repair mechanism, automatically determines the optimal bound resources, and provides a one-click repair function.

Benefits of technology

It reduces the difficulty of process binding operations, improves performance after binding, enables real-time performance monitoring and fault repair, reduces user operation steps and operation and maintenance costs, and has a unified load balancing and performance optimization framework.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115658423B_ABST
    Figure CN115658423B_ABST
Patent Text Reader

Abstract

The application provides a numa resource binding relationship monitoring tool based on consistent hash, which is divided into web front and back end, IO process, monitoring service cluster, device driver and numa resource according to functions. The web front and back end is used for displaying numa resource binding relationship, performance optimization suggestion and one-key repair button. The IO process is used for transmitting or storing data through an IO device. The monitoring service cluster is used for monitoring IO process, IO device, CPU and memory binding relationship and ranking their performance. The device driver is used for providing driving interface of data transmission and performance test of IO device of the numa node. The numa resource includes hardware resources of CPU, memory and IO device managed by the numa node. The application can solve problems in the prior art, such as high difficulty of process binding operation, low performance after binding process and poor real-time performance of performance monitoring.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of server technology, and more specifically to a NUMA resource binding relationship monitoring tool based on consistent hashing. Background Technology

[0002] Consistent hashing is a special hashing algorithm used to solve distributed caching problems. When using multiple servers to achieve load balancing for data transmission or data processing, adding or removing a server using consistent hashing can minimize changes to the mapping relationship between service requests and the servers that process them, and present the monotonicity of service request processing as much as possible. This method achieves load balancing and avoids cache invalidation, thereby improving system performance.

[0003] With the development of information technologies such as artificial intelligence, the Internet of Things, and cloud computing, more and more smart hardware has emerged, and the amount of business data related to smart hardware is also increasing exponentially. The requirements for data transmission performance of data center servers are also getting higher and higher. In order to improve the data transmission performance of the system, researchers have optimized many data transmission systems using the NUMA mechanism, thereby improving system performance by leveraging the performance advantages of CPU access to local NUMA resources such as CPU, memory, and I / O devices.

[0004] Currently, the common method for optimizing system data transfer performance based on the NUMA mechanism is to bind the access process of the IO device to the NUMA node with the corresponding IO device and low CPU and memory utilization. This reduces the CPU cache invalidation problem caused by the system process scheduling algorithm, the memory read and write performance degradation caused by cross-NUMA node memory access, and the device read and write performance degradation caused by cross-NUMA node access to the IO device, thereby improving the data transfer performance of the IO process.

[0005] This approach can effectively optimize data transmission performance based on the NUMA mechanism, but the following problems still exist:

[0006] Process binding is a complex operation. It requires using operating system commands to query NUMA nodes with I / O devices and low CPU and memory usage. Furthermore, it necessitates using operating system commands or programming based on the operating system interface to bind processes to specific CPUs and memory locations. This makes process binding challenging for users unfamiliar with operating systems or programming skills.

[0007] Even after process binding, performance remains low. Due to the lack of a NUMA resource performance evaluation mechanism, users are often unsure which CPU or memory location to bind the I / O process to achieve optimal performance. They typically simply bind the I / O process to certain CPUs or memory locations and modify the process CPU scheduling algorithm to fully utilize the CPU. Since the CPU, memory, and I / O devices may not be on the same NUMA node and the average utilization of CPU, memory, and I / O devices is uncertain, this may result in low performance even after process binding.

[0008] The real-time performance monitoring is poor. Due to the lack of a unified performance monitoring panel, users cannot observe the performance of the IO process in real time. Problems are usually only discovered when the IO process fails. When problems are discovered, users can only manually run the performance test program provided by the device manufacturer to determine whether the currently bound CPU, memory, and other resources are optimally bound.

[0009] The lack of a unified performance monitoring panel makes it difficult for users to visually observe the resource utilization ranking of each NUMA node, including CPU, memory, and I / O devices, as well as process performance ranking. This hinders their ability to determine which NUMA node a process should be bound to, or whether other processes can be shut down / moved from the current NUMA node to improve the performance of the current process.

[0010] There is a lack of a unified troubleshooting mechanism. Because of this lack, when performance issues arise in the I / O process, users can only manually execute test commands according to the device manufacturer's instructions to determine the cause of the performance problem. This results in numerous performance testing steps and extensive communication and collaboration with the manufacturer, negatively impacting the user experience.

[0011] There is a lack of a unified fault repair mechanism. Due to the lack of a unified fault repair mechanism, I / O process performance issues cannot be automatically repaired. When users locate the I / O process performance problem as being caused by the device itself, it is usually necessary to contact the equipment manufacturer to send engineers to the customer's site to help debug and resolve the I / O process performance problem, resulting in high equipment maintenance costs.

[0012] The lack of a unified load balancing framework means that when a system has multiple devices with similar functions and performance (such as multiple network cards for transmitting network data), but the IO process only uses one of these devices, users typically bind the IO process to a NUMA node with that type of device to improve performance. Later, when performance issues arise due to high device wear and tear, the process is manually bound to another NUMA node. The IO device cannot be automatically switched.

[0013] There is a lack of a unified performance optimization framework. Because of this lack, automatic suggestions for I / O process optimization and one-click performance problem fixing are not available. Users need to learn the performance testing programs and product manuals provided by all device manufacturers and manually perform various tests to optimize process performance. This learning curve is steep when there are many I / O devices. Summary of the Invention

[0014] In view of this, the purpose of this invention is to propose a numa resource binding relationship monitoring tool based on consistent hashing, which solves the problems existing in the prior art, such as the high difficulty of process binding operation, the low performance after binding processes, and the poor real-time performance monitoring.

[0015] Based on the above objectives, on the one hand, the present invention provides a monitoring tool for numa resource binding relationships based on consistent hashing, which is divided into web front-end and back-end, IO process, monitoring service cluster, device driver, and numa resource according to function;

[0016] The web front-end and back-end are used to display the numa resource binding relationship, performance optimization suggestions, and a one-click repair button;

[0017] I / O processes are used to transfer or store data through I / O devices;

[0018] The monitoring service cluster is used to monitor IO processes, IO devices, CPU, memory binding relationships, and sort their performance.

[0019] Device drivers provide driver interfaces for data transmission and performance testing of numa node I / O devices.

[0020] NUMA resources include hardware resources such as CPU, memory, and I / O devices, managed by NUMA nodes.

[0021] As a further aspect of the present invention, the web page is used to display the numa resource binding relationship of the IO process, including:

[0022] The performance order of CPU / memory / IO devices among numa nodes is displayed, showing the sorting results of the remaining resources among numa nodes.

[0023] The CPU / memory / IO device usage order of processes within numa is displayed, showing the sorted results of CPU, memory, and IO device resource usage of all processes bound to the numa node.

[0024] The optimization suggestions display performance optimization recommendations related to the binding relationship with numa resources;

[0025] The one-click repair button displays the one-click repair buttons related to each optimization suggestion.

[0026] As a further aspect of the present invention, the web server is used to forward data requests from the web page to the web backend; the web backend is used to provide data to the web page or process requests from the web page, including: numa status query, optimization suggestion query, and one-click repair execution.

[0027] As a further aspect of the present invention, the numa status query includes:

[0028] The process performance query queries the performance of all processes bound to this numa node in terms of CPU and memory.

[0029] CPU performance query: Query the sum of the unused CPUs of all CPUs in this numa node;

[0030] Memory performance query: Query the remaining memory space of this NUMA node;

[0031] The device performance query queries the controller, I / O devices, and maximum link performance of this NUMA node.

[0032] As a further aspect of the present invention, the optimization suggestion query is used to query performance optimization suggestions related to the binding relationship of numa resources; the one-click repair execution is used to send a one-click performance optimization request to the monitoring service cluster.

[0033] As a further aspect of the present invention, the NUMA resource binding relationship monitoring tool based on consistent hashing also includes data tables, including a NUMA configuration table, a process status table, a CPU status table, a memory status table, a controller status table, a device status table, and an optimization suggestion table.

[0034] As a further aspect of the present invention, the IO process is a business process that uses IO devices to transmit data, including starting, running, and exiting;

[0035] Startup and process initialization process; the process initialization process includes:

[0036] Manually bind a process to a specific NUMA resource; invoke the NUMA optimization service to automatically bind a process to the optimal NUMA resource; register descriptor update / online reset notifications;

[0037] The process execution flow is as follows: it performs read, write, or control operations on the I / O device through the device descriptor pointer.

[0038] Exit: The process exit procedure is the reverse of the startup procedure.

[0039] As a further aspect of the present invention, a monitoring service cluster, used to monitor the NUMA resource binding relationship of system IO processes, includes:

[0040] Device management services: device descriptor acquisition, descriptor pointer mapping, device descriptor release;

[0041] NUMA query services: controller performance query, device performance query, link performance query, CPU performance sorting, memory performance sorting, interrupt frequency sorting, process binding query, process performance query, and process performance sorting.

[0042] NUMA optimization services include: optimal binding acquisition, process binding execution, optimization suggestion generation, descriptor pointer update, and online device reset.

[0043] Device descriptor set: Contains a set of I / O device descriptors for each type in the system.

[0044] As a further aspect of the present invention, the numa resource binding relationship monitoring tool based on consistent hashing also includes system process / numa node CPU / memory / IO device / interrupt status data, which is used for operating system maintenance of process / numa node CPU / memory / IO device / interrupt status data and is displayed through the file system.

[0045] As a further aspect of the present invention, the device driver includes:

[0046] Device initialization, initializing I / O devices;

[0047] Descriptor creation: Creates an I / O device descriptor, which can be used to call other interfaces of the device driver;

[0048] Device read: Reading data from the internal storage space of a device;

[0049] Device write: writing data into the device's internal storage space;

[0050] Equipment control refers to performing control operations on equipment.

[0051] Interrupt handling involves processing hardware interrupt signals generated by the device.

[0052] Performance testing tests the maximum performance of the device driver and the performance of each process using the device.

[0053] As a further aspect of the present invention, the numa resources include multiple numa nodes and interrupt controllers, etc., and each numa node includes several CPUs, memory, various types of IO controllers, and various types of IO devices.

[0054] As a further aspect of the present invention, the execution of the numa resource binding relationship monitoring tool based on consistent hashing includes the following steps:

[0055] After the device driver is loaded, the I / O device is initialized through the initialization interface.

[0056] After the device management service starts, it calls the device driver's "descriptor creation" interface to obtain all device descriptors and writes the descriptors into different device descriptor sets according to the IO device type.

[0057] During the startup phase, the IO process manually binds the CPU and memory through the "Manual Resource Binding" module, or calls the device management service through the "Automatic Resource Binding" module. The device management service further calls the "Optimal Binding Acquisition" and "Process Binding Execution" modules of the system optimization service to obtain and bind the optimal resources. Then, the device descriptor is returned to the IO process through the "Descriptor Pointer Mapping" module of the device management service.

[0058] During the runtime phase, the IO process uses the device descriptor pointer to call the read and write interfaces of the IO device driver to transfer data;

[0059] The device driver read / write interface drives the I / O device to perform read / write operations;

[0060] I / O devices perform read and write operations via DMA between device memory and system memory;

[0061] When an I / O device read / write operation is completed, an interrupt notification is sent to the CPU via the interrupt controller.

[0062] During the read / write process of I / O devices, other driver modules in the system record CPU and memory usage and CPU interrupt frequency;

[0063] The numa query service queries the system's process / CPU / memory / IO device / interrupt status data, obtains the system's IO controller, CPU, memory, and IO device performance, and sorts the CPU, memory, and IO device performance of each numa node and the performance of processes bound to the same numa node in descending order.

[0064] The numa query service stores the mapping relationship between numa nodes and CPU, memory, and I / O devices, as well as the performance values ​​and performance sorting results of CPU, memory, and I / O devices in the database;

[0065] The web page sends a "Get Numa Status" request to the web backend. The web backend queries the Numa status data table and returns the mapping relationship between Numa nodes and CPU, memory, and I / O devices, as well as the performance values ​​and performance order of CPU, memory, and I / O devices to the web frontend, and displays them on the web page.

[0066] The numa optimization service periodically queries the status of IO devices and the optimal binding resources for IO processes. Based on the status of IO devices and the optimal binding resources for IO processes, it generates system optimization suggestions and writes them into the optimization suggestion data table.

[0067] The web page sends a "Get optimization suggestions" request to the web backend. The web backend then queries the optimization suggestion table in the database and returns the optimization suggestions to the web frontend and displays them on the web page.

[0068] When a user clicks the "One-Click Fix" button after each optimization suggestion on the web page, the web page sends a "numa optimization" request to the web backend through the web server.

[0069] Based on the numa optimization suggestions, the web backend calls the "descriptor pointer update" or "device online reset" module of the numa optimization service to optimize the IO process and IO device performance;

[0070] The "Online Device Reset" module of the numa optimization service calls the device control interface of the device driver to reset the device;

[0071] The device driver software drives the I / O device to perform a reset operation;

[0072] After the I / O device reset is complete, an interrupt signal is sent to notify the device driver software that the reset is complete.

[0073] The device driver software notifies the NUMA optimization service that the device reset is complete.

[0074] The numa optimization service deletes optimization suggestions from the database and returns an optimization completion notification to the web backend.

[0075] The web backend returns an optimization completion status to the web page, notifying the web page to display that the numa optimization is complete.

[0076] Compared to traditional implementations, the main advantages of this invention are:

[0077] Process binding is relatively easy. Since it doesn't require using operating system commands to query NUMA nodes with I / O devices that have low CPU and memory usage and high I / O device link performance, and it doesn't require using operating system commands or programming based on operating system interfaces to bind processes to specific CPUs or memory locations, the process binding operation is relatively simple.

[0078] High performance is achieved after process binding. Due to the NUMA resource performance evaluation mechanism, the system automatically determines the optimal binding resources for IO processes, including CPU, memory, and IO devices. Since the CPU, memory, and IO devices are on the same NUMA node and their average utilization is low, high performance is achieved after the process is bound to the optimal resources.

[0079] The performance monitoring offers high real-time performance. With a unified performance monitoring panel, users can observe the data transfer performance of the I / O process in real time, identify performance issues, and fix them with a one-click repair button.

[0080] It features a unified performance monitoring panel. This panel allows users to intuitively observe the CPU, memory, and I / O device usage rankings, as well as process performance rankings, across all NUMA nodes. This facilitates determining which NUMA node a process should be bound to, or whether other processes can be shut down / moved from the current NUMA node to improve the performance of the current process.

[0081] It features a unified troubleshooting mechanism. Because of this unified mechanism, when performance issues arise in the I / O process, users can resolve them with a single click, eliminating the need to manually execute test commands as instructed by the device manufacturer to determine the cause. This reduces user steps and provides a better user experience.

[0082] It features a unified fault repair mechanism. Because of this unified mechanism, the system can automatically repair I / O process performance issues. Users do not need to follow the equipment manufacturer's instructions to find the cause of I / O process performance problems, nor does the equipment manufacturer need to send engineers to the customer's site to resolve I / O process performance issues. Therefore, the equipment's maintenance costs are lower.

[0083] It features a unified load balancing framework. Because of this framework, when the system has multiple devices with similar functionality and performance, but the IO process only uses one of these devices, users do not need to select which NUMA node to bind the IO process to improve performance, nor do they need to manually bind it to other NUMA nodes later when performance issues arise due to device wear and tear. The system can automatically switch the binding relationship between the IO device and the IO process.

[0084] It features a unified performance optimization framework. Because of this framework, it can automatically provide IO process optimization suggestions and one-click performance problem fixing. Users do not need to learn all the performance testing programs and product manuals provided by device manufacturers to perform various tests to optimize process performance. This reduces the learning curve when there are many IO devices.

[0085] These or other aspects of this application will become more apparent from the following description of embodiments. It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not intended to limit the application. Attached Figure Description

[0086] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other embodiments can be obtained based on these drawings without creative effort.

[0087] In the diagram:

[0088] Figure 1 This is a block diagram of the NUMA resource binding relationship monitoring tool module based on consistent hashing of the present invention;

[0089] Figure 2 This is a flowchart illustrating the process of obtaining the optimal resource binding for an IO process in the consistent hashing-based numa resource binding relationship monitoring tool of the present invention.

[0090] Figure 3 This is a flowchart illustrating the consistent hashing algorithm execution process in the consistent hashing-based NUMA resource binding relationship monitoring tool of the present invention.

[0091] Figure 4 A flowchart for generating system optimization suggestions in the consistent hashing-based numa resource binding relationship monitoring tool of the present invention;

[0092] Figure 5 This is a flowchart of the one-click repair performance process in the consistent hashing-based numa resource binding relationship monitoring tool of the present invention;

[0093] Figure 6 This is a schematic diagram of the system data table definition in the consistent hashing-based numa resource binding relationship monitoring tool of the present invention. Detailed Implementation

[0094] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be further described in detail below with reference to specific examples and the accompanying drawings.

[0095] It should be noted that all uses of "first" and "second" in the embodiments of the present invention are for the purpose of distinguishing two different entities or different parameters with the same name. Therefore, "first" and "second" are merely for convenience of expression and should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, such as other steps or units inherent in a process, method, system, product, or device that includes a series of steps or units.

[0096] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be further described in detail below with reference to specific examples and the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative and are not intended to limit the scope of this application.

[0097] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0098] The flowchart shown in the attached diagram is for illustrative purposes only and does not necessarily include all content and operations / steps, nor does it necessarily have to be performed in the order described. For example, some operations / steps can be broken down, combined, or partially merged, so the actual execution order may change depending on the actual situation.

[0099] The following detailed description of some embodiments of this application is provided in conjunction with the accompanying drawings. Unless otherwise specified, the following embodiments and features can be combined with each other.

[0100] While methods for optimizing system data transmission performance based on the NUMA mechanism can effectively improve data transmission performance, the following problems still exist:

[0101] Process binding is a complex operation. It requires using operating system commands to query NUMA nodes with I / O devices and low CPU and memory usage. Furthermore, it necessitates using operating system commands or programming based on the operating system interface to bind processes to specific CPUs and memory locations. This makes process binding challenging for users unfamiliar with operating systems or programming skills.

[0102] Even after process binding, performance remains low. Due to the lack of a NUMA resource performance evaluation mechanism, users are often unsure which CPU or memory location to bind the I / O process to achieve optimal performance. They typically simply bind the I / O process to certain CPUs or memory locations and modify the process CPU scheduling algorithm to fully utilize the CPU. Since the CPU, memory, and I / O devices may not be on the same NUMA node and the average utilization of CPU, memory, and I / O devices is uncertain, this may result in low performance even after process binding.

[0103] The real-time performance monitoring is poor. Due to the lack of a unified performance monitoring panel, users cannot observe the performance of the IO process in real time. Problems are usually only discovered when the IO process fails. When problems are discovered, users can only manually run the performance test program provided by the device manufacturer to determine whether the currently bound CPU, memory, and other resources are optimally bound.

[0104] The lack of a unified performance monitoring panel makes it difficult for users to visually observe the resource utilization ranking of each NUMA node, including CPU, memory, and I / O devices, as well as process performance ranking. This hinders their ability to determine which NUMA node a process should be bound to, or whether other processes can be shut down / moved from the current NUMA node to improve the performance of the current process.

[0105] There is a lack of a unified troubleshooting mechanism. Because of this lack, when performance issues arise in the I / O process, users can only manually execute test commands according to the device manufacturer's instructions to determine the cause of the performance problem. This results in numerous performance testing steps and extensive communication and collaboration with the manufacturer, negatively impacting the user experience.

[0106] There is a lack of a unified fault repair mechanism. Due to the lack of a unified fault repair mechanism, I / O process performance issues cannot be automatically repaired. When users locate the I / O process performance problem as being caused by the device itself, it is usually necessary to contact the equipment manufacturer to send engineers to the customer's site to help debug and resolve the I / O process performance problem, resulting in high equipment maintenance costs.

[0107] The lack of a unified load balancing framework means that when a system has multiple devices with similar functions and performance (such as multiple network cards for transmitting network data), but the IO process only uses one of these devices, users typically bind the IO process to a NUMA node with that type of device to improve performance. Later, when performance issues arise due to high device wear and tear, the process is manually bound to another NUMA node. The IO device cannot be automatically switched.

[0108] There is a lack of a unified performance optimization framework. Because of this lack, automatic suggestions for I / O process optimization and one-click performance problem fixing are not available. Users need to learn the performance testing programs and product manuals provided by all device manufacturers and manually perform various tests to optimize process performance. This learning curve is steep when there are many I / O devices.

[0109] The purpose of this invention is to provide a numa resource binding relationship monitoring tool based on consistent hashing, which solves the problems existing in the prior art, such as the high difficulty of process binding operation, the low performance after binding processes, and the poor real-time performance monitoring.

[0110] See Figure 1 As shown, this invention provides a monitoring tool for NUMA resource binding relationships based on consistent hashing, which is divided into web front-end and back-end, IO process, monitoring service cluster, device driver, and NUMA resources according to function.

[0111] The web front-end and back-end are used to display numa resource binding relationships, performance optimization suggestions, and one-click repair buttons; IO processes are used to transfer or store data through IO devices; the monitoring service cluster is used to monitor the binding relationships of IO processes, IO devices, CPU, and memory, and to rank their performance; device drivers are used to provide driver interfaces for data transmission and performance testing of numa node IO devices; numa resources include hardware resources such as CPU, memory, and IO devices managed by numa nodes.

[0112] The complete steps of the method of this invention are as follows: A block diagram of a numa resource binding relationship monitoring tool module based on consistent hashing and its execution flow are as follows. Figure 1 As shown.

[0113] In this embodiment, the web page is used to display the numa resource binding relationship of the IO process, including:

[0114] The performance order of CPU / memory / IO devices among NUMA nodes is displayed, showing the sorted results of the remaining resources (including CPU, memory, and IO devices) among NUMA nodes.

[0115] The CPU / memory / IO device usage order of processes within numa is displayed, showing the sorted results of CPU, memory, and IO device resource usage of all processes bound to the numa node.

[0116] The optimization suggestions display performance optimization recommendations related to the binding relationship with numa resources;

[0117] The one-click repair button displays the one-click repair buttons related to each optimization suggestion.

[0118] In this embodiment, the web server is used to forward data requests from the web page to the web backend; the web backend is used to provide data to the web page or process requests from the web page, including: NUMA status query, optimization suggestion query, and one-click repair execution.

[0119] In this embodiment, the NUMA status query includes:

[0120] The process performance query queries the performance of all processes bound to this numa node in terms of CPU and memory.

[0121] CPU performance query: Query the sum of the unused CPUs of all CPUs in this numa node;

[0122] Memory performance query: Query the remaining memory space of this NUMA node;

[0123] The device performance query queries the controller, I / O devices, and maximum link performance of this NUMA node.

[0124] In this embodiment, the optimization suggestion query is used to query performance optimization suggestions related to the binding relationship of numa resources; the one-click repair execution is used to send a one-click performance optimization request to the monitoring service cluster.

[0125] In this embodiment, the NUMA resource binding relationship monitoring tool based on consistent hashing also includes data tables, including a NUMA configuration table, process status table, CPU status table, memory status table, controller status table, device status table, and optimization suggestion table. Detailed definitions of each data table are available in [reference needed]. Figure 6 As shown.

[0126] In this embodiment, the IO process is a business process that uses IO devices to transmit data, including starting, running, and exiting;

[0127] Startup and process initialization process; the process initialization process includes:

[0128] Manually bind a process to a specific NUMA resource; invoke the NUMA optimization service to automatically bind a process to the optimal NUMA resource; register descriptor update / online reset notifications, and register the process's descriptor update / online reset notification callback interface in the NUMA optimization service;

[0129] The process execution flow is as follows: it performs read, write, or control operations on the I / O device through the device descriptor pointer.

[0130] Exit: The process exit procedure is the reverse of the startup procedure.

[0131] In this embodiment, the monitoring service cluster, used to monitor the NUMA resource binding relationship of system IO processes, includes:

[0132] Device Management Service: Device Descriptor Acquisition, Descriptor Pointer Mapping, Device Descriptor Release; among them, Device Descriptor Acquisition acquires the device descriptors of all I / O devices when the service starts; Descriptor Pointer Mapping maps device descriptors to descriptor pointers of processes; and Device Descriptor Release releases the device descriptors of all I / O devices when the service exits.

[0133] NUMA query services include: Controller performance query (querying the maximum performance of the I / O controller); Device performance query (querying the maximum performance of I / O devices); Link performance query (querying the maximum performance of the link between the I / O controller and I / O devices); CPU performance sorting (sorting each NUMA by average CPU utilization from smallest to largest); Memory performance sorting (sorting each NUMA by remaining memory capacity from largest to smallest); Interrupt frequency sorting (sorting each NUMA by interrupt frequency from smallest to largest); Process binding query (querying the NUMA resources currently bound to an I / O process, including CPU and memory); Process performance query (querying the average performance of an I / O process); and Process performance sorting (sorting the average performance of I / O processes from largest to smallest).

[0134] NUMA optimization services include: Optimal binding acquisition (acquires the optimal binding resource for an I / O process); Process binding execution (binds the I / O process to the optimal binding resource); Optimization suggestion generation (generates optimization suggestions for I / O devices and I / O processes); Descriptor pointer update (updates the current device descriptor pointer for the I / O process); and Online device reset (performs an online reset for the I / O device).

[0135] Device descriptor set: Contains a set of I / O device descriptors for each type in the system.

[0136] In this embodiment, the consistent hashing-based numa resource binding relationship monitoring tool also includes system process / numa node CPU / memory / IO device / interrupt status data, which is used for operating system maintenance of process / numa node CPU / memory / IO device / interrupt status data and is displayed through a file system such as Linux sysfs.

[0137] In this embodiment, the device driver includes:

[0138] Device initialization, initializing I / O devices;

[0139] Descriptor creation: Creates an I / O device descriptor, which can be used to call other interfaces of the device driver;

[0140] Device read: Reading data from the internal storage space of a device;

[0141] Device write: writing data into the device's internal storage space;

[0142] Equipment control refers to performing control operations on equipment.

[0143] Interrupt handling involves processing hardware interrupt signals generated by the device.

[0144] Performance testing tests the maximum performance of the device driver and the performance of each process using the device.

[0145] In this embodiment, the numa resources include multiple numa nodes and interrupt controllers, etc. Each numa node includes several CPUs, memory, various types of IO controllers, and various types of IO devices.

[0146] In some embodiments, when the consistent hashing-based numa resource binding relationship monitoring tool is executed, the following steps are included:

[0147] After the device driver is loaded, the I / O device is initialized through the initialization interface.

[0148] After the device management service starts, it calls the device driver's "descriptor creation" interface to obtain all device descriptors and writes the descriptors into different device descriptor sets according to the IO device type.

[0149] During the startup phase, the IO process manually binds the CPU and memory through the "Manual Resource Binding" module, or calls the device management service through the "Automatic Resource Binding" module. The device management service further calls the "Optimal Binding Acquisition" and "Process Binding Execution" modules of the system optimization service to obtain and bind the optimal resources. Then, the device descriptor is returned to the IO process through the "Descriptor Pointer Mapping" module of the device management service.

[0150] During the runtime phase, the IO process uses the device descriptor pointer to call the read and write interfaces of the IO device driver to transfer data;

[0151] The device driver read / write interface drives the I / O device to perform read / write operations;

[0152] I / O devices perform read and write operations via DMA between device memory and system memory;

[0153] When an I / O device read / write operation is completed, an interrupt notification is sent to the CPU via the interrupt controller.

[0154] During the read / write process of I / O devices, other driver modules in the system record CPU and memory usage and CPU interrupt frequency;

[0155] The numa query service queries the system's process / CPU / memory / IO device / interrupt status data, obtains the system's IO controller, CPU, memory, and IO device performance, and sorts the CPU, memory, and IO device performance of each numa node and the performance of processes bound to the same numa node in descending order.

[0156] The numa query service stores the mapping relationship between numa nodes and CPU, memory, and I / O devices, as well as the performance values ​​and performance sorting results of CPU, memory, and I / O devices in the database;

[0157] The web page sends a "Get Numa Status" request to the web backend. The web backend queries the Numa status data table and returns the mapping relationship between Numa nodes and CPU, memory, and I / O devices, as well as the performance values ​​and performance order of CPU, memory, and I / O devices to the web frontend, and displays them on the web page.

[0158] The numa optimization service periodically queries the status of IO devices and the optimal binding resources for IO processes. Based on the status of IO devices and the optimal binding resources for IO processes, it generates system optimization suggestions and writes them into the optimization suggestion data table.

[0159] The web page sends a "Get optimization suggestions" request to the web backend. The web backend then queries the optimization suggestion table in the database and returns the optimization suggestions to the web frontend and displays them on the web page.

[0160] When a user clicks the "One-Click Fix" button after each optimization suggestion on the web page, the web page sends a "numa optimization" request to the web backend through the web server.

[0161] Based on the numa optimization suggestions, the web backend calls the "descriptor pointer update" or "device online reset" module of the numa optimization service to optimize the IO process and IO device performance;

[0162] The "Online Device Reset" module of the numa optimization service calls the device control interface of the device driver to reset the device;

[0163] The device driver software drives the I / O device to perform a reset operation;

[0164] After the I / O device reset is complete, an interrupt signal is sent to notify the device driver software that the reset is complete.

[0165] The device driver software notifies the NUMA optimization service that the device reset is complete.

[0166] The numa optimization service deletes optimization suggestions from the database and returns an optimization completion notification to the web backend.

[0167] The web backend returns an optimization completion status to the web page, notifying the web page to display that the numa optimization is complete.

[0168] The process for obtaining the optimal resource binding for an IO process is as follows: Figure 2 As shown, Figure 2 The execution flow of the consistent hashing algorithm is as follows: Figure 3 As shown in the flowchart of the consistent hashing algorithm, the steps are described as follows:

[0169] 1) During the device management service startup phase, a hash ring with a maximum key value of 2^32-1 is generated based on the system's IO device list. Each IO device generates n virtual hash key values ​​(e.g., ...). Figure 3 Equipment B-01, B-02).

[0170] 2) Upon receiving the "automatic resource binding" request from the IO process, based on the process name (e.g., ... Figure 3 Process 1) Generates hash key values.

[0171] 3) Compare I / O processes (e.g.) Figure 3 Process 1) Between which two I / O devices does the hash key value lie? Bind the I / O process to the I / O device with a hash key value greater than its own (e.g., ...). Figure 3 Bind process 1 to device B).

[0172] 4) In the equipment (such as Figure 3 After device B) is damaged or hot-plugged, the I / O process (such as...) will be interrupted. Figure 3 Process 1) binds to the next I / O device (e.g., ...) with a hash key value greater than its hash key value. Figure 3 Equipment C).

[0173] 5) In the equipment (such as Figure 3 After device B) is reset and recovers or after a hot-plug occurs, the I / O process (such as...) will resume. Figure 3 Process 1) rebinds to the next I / O device with a hash key greater than its value (e.g., ... Figure 3 Equipment B).

[0174] Compared to other load balancing algorithms, the advantage of using consistent hashing is that it assumes multiple I / O processes (such as...) Figure 3 Processes 1, 2, and 3 repeatedly restart, thus requesting / canceling automatic binding multiple times. They will be bound to different I / O devices (such as...). Figure 3 Devices B, C, and D), a certain I / O device (e.g., Figure 3 When device B is hot-plugged, only the I / O processes originally bound to that I / O device (such as...) are affected. Figure 3 Process 1) Device binding relationship, without affecting the original binding to other devices (such as...) Figure 3 I / O processes of devices C and D (e.g.) Figure 3The device binding relationship between processes 2 and 3 results in fewer process / CPU cache invalidations, thus leading to higher performance.

[0175] In an embodiment of the present invention, the process for generating system optimization suggestions is described below. Figure 4 As shown, the process for performing a one-click repair is as follows: Figure 5 As shown, the system data table is defined as follows: Figure 6 As shown, the technical solution of this invention provides a NUMA resource binding relationship monitoring tool based on consistent hashing, which can bring the following beneficial effects:

[0176] Process binding is relatively easy. Since it doesn't require using operating system commands to query NUMA nodes with I / O devices that have low CPU and memory usage and high I / O device link performance, and it doesn't require using operating system commands or programming based on operating system interfaces to bind processes to specific CPUs or memory locations, the process binding operation is relatively simple.

[0177] High performance is achieved after process binding. Due to the NUMA resource performance evaluation mechanism, the system automatically determines the optimal binding resources for IO processes, including CPU, memory, and IO devices. Since the CPU, memory, and IO devices are on the same NUMA node and their average utilization is low, high performance is achieved after the process is bound to the optimal resources.

[0178] The performance monitoring offers high real-time performance. With a unified performance monitoring panel, users can observe the data transfer performance of the I / O process in real time, identify performance issues, and fix them with a one-click repair button.

[0179] It features a unified performance monitoring panel. This panel allows users to intuitively observe the CPU, memory, and I / O device usage rankings, as well as process performance rankings, across all NUMA nodes. This facilitates determining which NUMA node a process should be bound to, or whether other processes can be shut down / moved from the current NUMA node to improve the performance of the current process.

[0180] It features a unified troubleshooting mechanism. Because of this unified mechanism, when performance issues arise in the I / O process, users can resolve them with a single click, eliminating the need to manually execute test commands as instructed by the device manufacturer to determine the cause. This reduces user steps and provides a better user experience.

[0181] It features a unified fault repair mechanism. Because of this unified mechanism, the system can automatically repair I / O process performance issues. Users do not need to follow the equipment manufacturer's instructions to find the cause of I / O process performance problems, nor does the equipment manufacturer need to send engineers to the customer's site to resolve I / O process performance issues. Therefore, the equipment's maintenance costs are lower.

[0182] It features a unified load balancing framework. Because of this framework, when the system has multiple devices with similar functionality and performance, but the IO process only uses one of these devices, users do not need to select which NUMA node to bind the IO process to improve performance, nor do they need to manually bind it to other NUMA nodes later when performance issues arise due to device wear and tear. The system can automatically switch the binding relationship between the IO device and the IO process.

[0183] It features a unified performance optimization framework. Because of this framework, it can automatically provide IO process optimization suggestions and one-click performance problem fixing. Users do not need to learn all the performance testing programs and product manuals provided by device manufacturers to perform various tests to optimize process performance. This reduces the learning curve when there are many IO devices.

[0184] It should be noted that this invention, through a unified NUMA resource binding relationship monitoring panel, can sort the CPU, memory, and IO device usage rates and IO process performance of each NUMA node in real time, making it convenient for users to judge the usage of NUMA node CPU, memory, and IO devices by each IO process and to make reasonable judgments on NUMA resource binding relationships.

[0185] This invention utilizes a unified NUMA resource binding relationship optimization framework to automatically provide optimal resource binding configurations for IO processes based on the status of IO devices and IO processes. It also generates and displays performance optimization suggestions for each IO device and IO process, and provides one-click repair for performance issues of IO devices and IO processes, thereby reducing the operation and maintenance costs of IO devices.

[0186] This invention achieves load balancing through a unified NUMA resource binding relationship and a consistent hashing algorithm. This not only binds different IO processes to different IO devices, thereby reducing device wear and mitigating performance degradation, but also reduces changes in the binding relationship between IO processes and IO devices, thereby reducing IO cache invalidation and improving performance.

[0187] This invention proposes a monitoring system for NUMA resource binding relationships based on consistent hashing. The method defines a web front-end, web back-end, performance data table, business processes, monitoring services, device drivers, and NUMA resources, etc. This is only for understanding the specific implementation of the invention and is not intended to limit the invention. Any optimizations made without departing from the spirit and scope of this invention, particularly regarding the design and optimization of the web front-end / back-end, performance data table, NUMA resource binding, performance optimization suggestion generation, one-click performance problem repair, and load balancing processes, are all within the protection scope of this invention.

[0188] The above are exemplary embodiments disclosed in this invention. However, it should be noted that various changes and modifications can be made without departing from the scope of the embodiments of this invention as defined by the claims. The functions, steps, and / or actions of the methods according to the disclosed embodiments described herein do not need to be performed in any particular order. Furthermore, although the elements disclosed in the embodiments of this invention may be described or claimed individually, they may be understood as multiple unless explicitly limited to a singular number.

[0189] It should be understood that, as used herein, the singular form "a" is intended to include the plural form as well, unless the context clearly supports an exception. It should also be understood that, as used herein, "and / or" refers to any and all possible combinations of one or more of the associatedly listed items. The embodiment numbers disclosed above are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0190] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of the invention (including the claims) is limited to these examples. Within the framework of the invention, technical features of the above embodiments or different embodiments can be combined, and many other variations of different aspects of the invention exist, which are not provided in the details for the sake of brevity. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the invention should be included within the protection scope of the invention.

Claims

1. A NUMA resource binding relationship monitoring tool based on consistent hashing, characterized in that, This NUMA resource binding relationship monitoring tool is divided into web front-end and back-end, IO process, monitoring service cluster, device driver, and NUMA resource according to its functions. The web front-end and back-end are used to display the numa resource binding relationship, performance optimization suggestions, and a one-click repair button; I / O processes are used to transfer or store data through I / O devices; The monitoring service cluster is used to monitor IO processes, IO devices, CPU, memory binding relationships, and sort their performance. Device drivers provide driver interfaces for data transmission and performance testing of numa node I / O devices. NUMA resources include hardware resources such as CPU, memory, and I / O devices, with NUMA nodes as the management unit; Device management service: device descriptor acquisition, descriptor pointer mapping, device descriptor release. The device management service is used to bind multiple IO processes to different IO devices through a consistent hashing algorithm, and when a certain IO device is hot-plugged, only the device binding relationship of the IO processes originally bound to that IO device is affected. In this process, the web page sends a "get numa status" request to the web backend. The web backend then queries the numa status data table and returns the mapping relationship between numa nodes and CPU, memory, and I / O devices, as well as the performance values ​​and performance order of CPU, memory, and I / O devices to the web frontend, and displays them on the web page. When a user clicks the one-click fix button after each optimization suggestion on the web page, the web page is triggered to send a "numa optimization" request to the web backend through the web server. The web backend, based on the numa optimization suggestions, calls the "descriptor pointer update" or "device online reset" module of the numa optimization service to optimize the IO process and IO device performance.

2. The NUMA resource binding relationship monitoring tool based on consistent hashing according to claim 1, characterized in that, The web page is used to display the NUMA resource binding relationships of the IO process, including: The performance order of CPU / memory / IO devices among numa nodes is displayed, showing the sorting results of the remaining resources among numa nodes. The CPU / memory / IO device usage order of processes within numa is displayed, showing the sorted results of CPU, memory, and IO device resource usage of all processes bound to the numa node. The optimization suggestions display performance optimization recommendations related to the binding relationship with numa resources; The one-click repair button displays the one-click repair buttons related to each optimization suggestion.

3. The NUMA resource binding relationship monitoring tool based on consistent hashing according to claim 1, characterized in that, Web servers are used to forward data requests from web pages to the web backend; The web backend is used to provide data to web pages or process requests from web pages, including: NUMA status query, optimization suggestion query, and one-click repair execution.

4. The NUMA resource binding relationship monitoring tool based on consistent hashing according to claim 3, characterized in that, The NUMA status query includes: The process performance query queries the performance of all processes bound to this numa node in terms of CPU and memory. CPU performance query: Query the sum of the unused CPUs of all CPUs in this numa node; Memory performance query: Query the remaining memory space of this NUMA node; The device performance query queries the controller, I / O devices, and maximum link performance of this NUMA node.

5. The NUMA resource binding relationship monitoring tool based on consistent hashing according to claim 4, characterized in that, The optimization suggestion query is used to retrieve performance optimization suggestions related to numa resource binding relationships; The one-click repair function is used to send a one-click performance optimization request to the monitoring service cluster.

6. The NUMA resource binding relationship monitoring tool based on consistent hashing according to claim 1, characterized in that, The NUMA resource binding relationship monitoring tool based on consistent hashing also includes data tables, including the NUMA configuration table, process status table, CPU status table, memory status table, controller status table, device status table, and optimization suggestion table.

7. The NUMA resource binding relationship monitoring tool based on consistent hashing according to claim 6, characterized in that, IO processes are business processes that use IO devices to transfer data, including startup, running, and exit: Startup and process initialization process; the process initialization process includes: manually binding the process to a specific NUMA resource; calling the NUMA optimization service to automatically bind the process to the optimal NUMA resource; registering descriptor update / online reset notifications; The process execution flow is as follows: it performs read, write, or control operations on the I / O device through the device descriptor pointer. Exit: The process exit procedure is the reverse of the startup procedure.

8. The NUMA resource binding relationship monitoring tool based on consistent hashing according to claim 1, characterized in that, The monitoring service cluster is used to monitor the NUMA resource binding relationships of system IO processes, including: NUMA query services: controller performance query, device performance query, link performance query, CPU performance sorting, memory performance sorting, interrupt frequency sorting, process binding query, process performance query, and process performance sorting. NUMA optimization services include: optimal binding acquisition, process binding execution, optimization suggestion generation, descriptor pointer update, and online device reset. Device descriptor set: Contains a set of I / O device descriptors for each type in the system.

9. The NUMA resource binding relationship monitoring tool based on consistent hashing according to claim 1, characterized in that, The consistent hashing-based NUMA resource binding relationship monitoring tool also includes system process / NUMA node CPU / memory / IO device / interrupt status data, which is used for operating system maintenance and is displayed through the file system.

10. The NUMA resource binding relationship monitoring tool based on consistent hashing according to claim 1, characterized in that, Device drivers, including: Device initialization, initializing I / O devices; Descriptor creation: Creates an I / O device descriptor, which can be used to call other interfaces of the device driver; Device read: Reading data from the internal storage space of a device; Device write: writing data into the device's internal storage space; Equipment control refers to performing control operations on equipment. Interrupt handling involves processing hardware interrupt signals generated by the device. Performance testing tests the maximum performance of the device driver and the performance of each process using the device.