Method and system for cpu scheduling for hybrid memory architecture
By identifying and optimizing process types and resource usage in a hybrid memory architecture, and implementing process migration and scheduling strategies, the problem of CPU resource contention in a hybrid memory architecture was solved, thereby improving CPU resource utilization and process performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA NORMAL UNIV
- Filing Date
- 2022-12-14
- Publication Date
- 2026-06-23
AI Technical Summary
In hybrid memory architectures, the existing Linux process scheduler cannot effectively handle the competition for CPU resources and the problem of underutilization, especially when frequent access to persistent memory and frequent computation processes share the same CPU core, leading to decreased computing performance and resource contention.
By periodically collecting CPU resource usage data, analyzing the busyness of physical cores under NUMA nodes, identifying CPU-intensive, PM-intensive, and I/O-intensive processes, implementing process migration and scheduling strategies, optimizing L3 cache resource allocation, and reducing resource contention.
It improved CPU resource utilization and enhanced the read and write performance of processes under the hybrid memory architecture, especially in the presence of PM-intensive processes, where read and write bandwidth was increased by 24%.
Smart Images

Figure CN116225686B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of CPU processing technology, and in particular to a CPU scheduling method and system for hybrid memory architectures. Background Technology
[0002] With the development of technologies such as cloud computing and big data, the explosive growth in data volume has placed higher demands on server storage performance. To avoid impacting user experience, servers need to handle large numbers of data requests from clients with low read / write latency. Therefore, in traditional memory architectures, servers typically use Dynamic Random Access Memory (DRAM) to reduce read / write latency to external storage. This is because the operating system sets up page caches on DRAM, which caches data on external storage, thus alleviating the performance bottleneck of reading and writing to external storage. However, in the era of explosive data growth, developers need to pay significantly more for configuring more DRAM for servers. Therefore, using DRAM as a cache for external storage to address the speed difference between the CPU and external storage is not currently the optimal solution. The emergence of Persistent Memory (PM) precisely addresses these shortcomings.
[0003] In a hybrid memory architecture, when the memory module (PM) is configured as a high-speed disk, it lacks support for Direct Memory Access (DMA) technology because the PM resides on the memory bus. DMA technology enables direct data transfer between memory and devices residing on external storage buses such as PCI without consuming CPU resources. Therefore, processes accessing the PM require CPU resources to transfer data between the PM and DRAM. In contrast, ordinary disks can use DMA technology to transfer data between DRAM and external storage, allowing data transfer and CPU computation to occur in parallel. Consequently, the hybrid memory architecture disrupts the traditional memory architecture's division of labor where DMA is used for input / output and the CPU is used for other tasks. This leads to a competition for CPU resources between processes frequently reading and writing to the PM and processes frequently using the CPU for computation.
[0004] Furthermore, the existing Linux process scheduler cannot effectively handle the "idle core" problem that occurs in hybrid memory architectures. For example, when there are idle CPU cores in the current system, there are still some CPU cores running more than two processes. This situation indicates that processes in Linux are not being scheduled properly, and the CPU resources of the idle cores are not being fully utilized.
[0005] Therefore, in a hybrid memory architecture, when processes that frequently access the PM and processes that frequently perform computational operations run on the same CPU core and there are idle cores in the current system, these two types of processes will not only compete with each other, but also result in the CPU's computing performance not being fully utilized. Summary of the Invention
[0006] This invention aims to address at least one of the technical problems existing in the prior art. To this end, this invention proposes a CPU scheduling method and system for hybrid memory architectures, capable of rationally scheduling processes within a hybrid memory architecture.
[0007] On one hand, embodiments of the present invention provide a CPU scheduling method for hybrid memory architectures, comprising the following steps:
[0008] Periodically collect CPU resource usage data;
[0009] The busy level of each physical core under the NUMA node is analyzed based on the CPU resource usage.
[0010] The task scheduling queue attributes of the busy cores in the physical cores are analyzed. The task scheduling queue attributes include CPU-intensive processes, PM-intensive processes, and I / O-intensive processes.
[0011] The process migration strategy and process scheduling strategy are executed based on the task scheduling queue attributes.
[0012] In some embodiments, the periodic collection of CPU resource usage includes:
[0013] Use the perf stat tool to monitor CPU resources and collect CPU usage information;
[0014] Based on the CPU resources and CPU usage information, analyze the core utilization rate and total CPU utilization rate of each physical core during the acquisition cycle;
[0015] The cycle duration is then modified after calculating the new acquisition cycle based on the total CPU utilization.
[0016] When the total CPU utilization exceeds a first threshold, a program is started to monitor the busy level of each physical core under the NUMA node.
[0017] In some embodiments, analyzing the busy level of each physical core under a NUMA node includes:
[0018] Store the system CPU busy status in the NUMA information table;
[0019] The NUMA information table includes a core sequence number, which represents the logical sequence number of the physical core under the NUMA node. A logical sequence number of 0 indicates that the physical core is in a non-busy state, and a logical sequence number of 1 indicates that the physical core is in a busy state.
[0020] In some embodiments, analyzing the task scheduling queue attributes of busy cores in the physical core includes:
[0021] A runtime stubbing mechanism is used for the busy core, setting the LD_PRELOAD environment variable based on the dynamic linker and performing secondary encapsulation on the standard library's read and write functions;
[0022] Add file access paths to the source code of the read and write functions in the standard library to intercept calls to the read and write standard library functions in the process at runtime;
[0023] When the file path being read or written is a path that accesses PM, it indicates that the process corresponding to the read / write is a PM-intensive process, and the PID number of the process corresponding to the read / write is stored in the PM_pid_list linked list; when the file path being read or written is a path that accesses ordinary external storage, it indicates that the process corresponding to the read / write is an I / O-intensive process, and the PID number of the process corresponding to the read / write is stored in the IO_pid_list linked list.
[0024] When perf top detects that a process's CPU utilization is greater than the second threshold and it is not a PM-intensive process, the detected process is determined to be a CPU-intensive process, and the PID of the detected process is stored in the CPU_pid_list linked list.
[0025] In some embodiments, executing the process migration strategy based on the task scheduling queue attributes includes:
[0026] Determine whether there are any idle cores under the local NUMA node where the busy core is located;
[0027] If there is only one idle core under the local NUMA node, the idle core is stored in a linked list using local CPU cache and local memory, and a process with the same attribute is randomly selected from the request queue of the busy core as the target process, and the target process is migrated to the idle core.
[0028] If there are multiple idle cores under the local NUMA node, control the multiple idle cores to start the migration function at the same time, and use the CAS lock-free mechanism to compete for the request queue.
[0029] If no free cores exist under the local NUMA node, search for free cores across NUMA nodes in the global table and schedule CPU-intensive processes across nodes.
[0030] In some embodiments, executing a process scheduling policy based on the task scheduling queue attributes includes:
[0031] Retrieves the non-busy cores across NUMA nodes from the global table and returns the cross-NUMA nodes for which I / O-intensive and PM-intensive memory access-intensive processes do not exist in the linked list of global table request queue properties.
[0032] Processes with the same attribute are selected and scheduled to idle cores across NUMA nodes. The performance of accessing memory nodes is iteratively calculated, sensitive processes are marked, and the sensitive processes are kept in the local node in the next scheduling.
[0033] If there are no free cores across NUMA nodes, reallocate the three-level cache resources in the local NUMA node.
[0034] In some embodiments, the replanning of the three-level cache resources includes:
[0035] Track and sample process accesses to the L3 cache using the Intel PEBS Precise Event Sampling Tool;
[0036] Predict the CPU L3 cache miss rate of the process during access and allocate L3 cache resources to the process.
[0037] In some embodiments, the step of tracking and sampling process accesses to the L3 cache using the Intel PEBS Precise Event Sampling Tool includes:
[0038] Configure IHK / McKernel and start the PEBS driver, specify the L2 cache miss time as the exact event of the Intel PEBS, use the Intel PEBS to sample the internal execution state of the CPU, and periodically store snapshots of the CPU in main memory;
[0039] All load and store instructions are tracked and cached in the kernel buffer; when the last record written to the kernel buffer in Intel PEBS exceeds the threshold configured in the kernel buffer, a CPU interrupt handler is triggered; wherein, the interrupt handler processes Intel PEBS data, writes the data in the kernel buffer to the user buffer, and clears the kernel buffer;
[0040] When the number of occurrences of the precise event reaches a preset threshold, a performance monitor interrupt will be generated. The proportion of addresses accessed by preset instructions in the user-configured PEBS buffer will be calculated to obtain the number of hits and misses in the second-level cache, and the number of accesses in the third-level cache will be obtained.
[0041] In some embodiments, allocating three-level cache resources to the process includes:
[0042] Compare the L3 cache usage of I / O-intensive processes and PM-intensive processes under the local NUMA node, and use Intel CAT to perform isolation partitioning of the L3 cache based on the comparison results;
[0043] Share some Level 3 cache resources and iteratively search for the optimal Level 3 cache solution.
[0044] On the other hand, embodiments of the present invention provide a CPU scheduling system for hybrid memory architectures, comprising:
[0045] The first module is used to periodically collect CPU resource usage data.
[0046] The second module is used to analyze the busy level of each physical core under the NUMA node based on the CPU resource usage level;
[0047] The third module is used to analyze the task scheduling queue attributes of busy cores in the physical cores. The task scheduling queue attributes include CPU-intensive processes, PM-intensive processes, and I / O-intensive processes.
[0048] The fourth module is used to execute process migration strategies and process scheduling strategies based on the attributes of the task scheduling queue.
[0049] The CPU scheduling method for hybrid memory architecture provided in this embodiment of the invention has the following beneficial effects:
[0050] This embodiment periodically collects CPU resource usage data during hybrid architecture applications, analyzes the busyness of each physical core under the NUMA node based on CPU resource usage data, analyzes the task scheduling queue attributes of busy cores, and then executes process migration and process scheduling strategies based on the task scheduling queue attributes. This enables reasonable process scheduling and improves CPU resource utilization.
[0051] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description
[0052] The present invention will be further described below with reference to the accompanying drawings and embodiments, wherein:
[0053] Figure 1 This is a flowchart illustrating a CPU scheduling method for a hybrid memory architecture according to an embodiment of the present invention;
[0054] Figure 2 This is an application flowchart of a CPU scheduling method for a hybrid memory architecture according to an embodiment of the present invention;
[0055] Figure 3 This is a schematic diagram of an application system for a CPU scheduling method for a hybrid memory architecture according to an embodiment of the present invention. Detailed Implementation
[0056] Embodiments of the present invention are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.
[0057] In the description of this invention, "several" means one or more, "multiple" means two or more, "greater than," "less than," and "exceeding" are understood to exclude the stated number, while "above," "below," and "within" are understood to include the stated number. The use of "first" and "second" in the description is merely for distinguishing technical features and should not be construed as indicating or implying relative importance, or implicitly indicating the number of indicated technical features, or implicitly indicating the order of the indicated technical features.
[0058] In the description of this invention, the terms "one embodiment," "some embodiments," "illustrative embodiment," "example," "specific example," or "some examples," etc., refer to specific features or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0059] In related technologies, PM devices have two configuration modes: Memory Mode and App Direct Mode. Memory Mode is based on a memory management system, treating the PM as a large-capacity dynamic random access main memory, with DRAM serving as a high-speed cache. App Direct Mode is based on a block storage device management system, treating the PM as a high-speed disk and using a byte-level access interface to read and write data stored in the PM. Typically, to achieve higher read / write performance in scenarios with frequent access to large amounts of data, developers configure the PM as a high-speed disk on the server and use advanced interfaces for traditional external storage access to read and write data in the PM.
[0060] When a hybrid memory architecture is applied in this scenario, configuring the memory module (PM) as a high-speed disk can lead to inefficient use of CPU resources. This is because the PM resides on the memory bus, resulting in a lack of support for Direct Memory Access (DMA) technology. DMA technology enables direct data transfer between memory and devices residing on external storage buses such as PCI without consuming CPU resources. Therefore, processes accessing the PM require CPU resources to transfer data between the PM and DRAM. In contrast, a regular disk can use DMA technology to transfer data between DRAM and external storage, allowing data transfer and CPU computation to occur in parallel. Consequently, the hybrid memory architecture disrupts the traditional memory architecture's division of labor where DMA is used for input / output and the CPU is used for other tasks. This leads to a competition for CPU resources between processes frequently reading and writing to the PM and processes frequently using the CPU for computation.
[0061] Furthermore, the existing Linux process scheduler cannot adequately handle the "idle core" problem that arises in hybrid memory architectures. This means that even when there are idle CPU cores in the system, some CPU cores still have more than two processes running on them. This indicates not only that processes in Linux are not being scheduled properly, but also that the CPU resources of the idle cores are not being fully utilized.
[0062] When processes that frequently access the Process Manager (PM) and processes that frequently perform computational operations run on the same CPU core, and there are idle cores available in the system, these two types of processes will not only compete for resources but also result in the CPU's computing performance not being fully utilized. Therefore, it is necessary to design a reasonable technical solution to distinguish and identify the process type on each CPU core and to design a strategy for the appropriate scheduling of the identified processes. However, existing process scheduling strategies do not take into account the impact of processes accessing the PM on CPU resources and do not propose a scheme to identify processes accessing the PM. This can easily lead to unfavorable resource contention conflicts between the two types of processes within the existing scheduling framework.
[0063] Furthermore, since there is still a certain difference in read and write speed between PM and DRAM, and the CPU L3 cache is allocated according to the first-come, first-served principle, the allocation of CPU L3 cache resources to processes that read and write to different devices will be unfair and competitive due to the difference in read and write speed. This is especially true in extreme cases, such as when executing "stress-ng" or other "noisy neighbor" processes. These processes will consume most of the L3 cache space on the CPU they are running on, thereby impairing the performance of other processes running on the same CPU. Therefore, it is necessary to monitor the CPU L3 cache usage of processes to avoid excessive usage that could affect the running efficiency of other processes: (1) When repeatedly and sequentially reading and writing disk data, processes accessing ordinary disks will benefit from the DRAM page cache and will preferentially occupy more CPU L3 cache resources and may swap the PM data out of the CPU L3 buffer, thus affecting the running speed of processes reading and writing PM; (2) In application scenarios where disk data processes perform non-repetitive and random reading and writing, when the page cache of processes accessing ordinary disks is not hit, compared with processes accessing PMs, the former needs to consume more reading and writing time to obtain data from external storage. Therefore, processes reading and writing PMs will preferentially occupy more CPU L3 cache resources and may swap the ordinary disk data out of the CPU L3 buffer, thus affecting the running speed of processes reading and writing ordinary disks. In summary, the existing CPU L3 cache allocation strategy does not take into account the impact of introducing PM. The inconsistent read and write performance of different external storage devices will have the above-mentioned impact on the CPU L3 cache miss rate of the process. Therefore, a reasonable CPU L3 cache allocation strategy needs to be adopted for different types of processes to reduce the number of CPU L3 cache swaps in and out, thereby ensuring reasonable I / O performance of the process.
[0064] Based on this, embodiments of the present invention provide a CPU scheduling method and system for hybrid memory architectures. By identifying the types of processes in the CPU request queue, determining the busy level of CPU computing resources, and providing reasonable strategies for scheduling CPU computing resources for different types of processes under a hybrid memory architecture, the method is designed to avoid a significant decrease in the L3 cache hit rate caused by competition for the CPU L3 cache by different types of processes.
[0065] The embodiments of this application will be described in detail below with reference to the accompanying drawings:
[0066] Reference Figure 1 This invention provides a CPU scheduling method for hybrid memory architectures, including but not limited to the following steps:
[0067] Step S110: Periodically collect CPU resource usage.
[0068] Step S110: Analyze the busy level of each physical core under the NUMA node based on the CPU resource usage.
[0069] Step S110: Analyze the task scheduling queue attributes of busy cores in the physical cores. The task scheduling queue attributes include CPU-intensive processes, PM-intensive processes, and I / O-intensive processes.
[0070] Step S110: Execute the process migration strategy and process scheduling strategy according to the task scheduling queue attributes.
[0071] In this embodiment, a PM-intensive process refers to a process that "frequently consumes CPU resources to read and write data to novel persistent memory." Under traditional memory architectures, processes can be categorized into three types based on their utilization of CPU and I / O resources: CPU-intensive processes, ordinary processes, and I / O-intensive processes. Because PM-intensive and CPU-intensive processes compete for CPU resources, they significantly reduce task processing efficiency. Therefore, it is necessary to assess CPU resource usage under a NUMA architecture, identify processes that frequently read and write to the PM from the CPU-intensive process category, and reconsider the CPU resource scheduling scheme for each type of process.
[0072] The following is combined with Figure 2 and Figure 3 This embodiment will be described in detail.
[0073] When summarizing steps S110-S130 as part of the overall CPU resource utilization monitoring design, the step of periodically collecting CPU resource usage data includes, but is not limited to, the following steps:
[0074] Use the perf stat tool to monitor CPU resources and collect CPU usage information;
[0075] The collected samples are handed over to the analysis subroutine for processing. The subroutine analyzes the core utilization rate and total CPU utilization rate of each physical core during the collection cycle based on CPU resources and CPU usage information.
[0076] The new acquisition cycle is calculated based on the total CPU utilization, and the cycle duration of the next acquisition is modified by iteratively inputting the parameters into Algorithm 1 in Table 1.
[0077] When the total CPU utilization exceeds the first threshold, a program is started to monitor the busy level of each physical core under the NUMA node.
[0078] Table 1
[0079]
[0080]
[0081] When analyzing the workload of each physical core under a NUMA node, if the system CPU resources are under unbalanced load, one or more physical cores will inevitably be busy. To achieve load balancing, it is necessary to analyze the location of the busy core and the NUMA node where it resides. Therefore, when the total utilization rate exceeds the first threshold, a subroutine to analyze the workload of each physical core under the NUMA node is immediately triggered. This subroutine stores the system CPU resource workload in a global NUMA information table, as shown in Table 2. The core number represents the logical sequence number of the physical core under the NUMA node. In the table, "0" represents that the physical core is in an unbusy state, and "1" represents that the physical core is in a busy state.
[0082] Table 2
[0083]
[0084] When analyzing the task scheduling queue attributes of busy cores in the physical cores, due to the Linux memory management system's First Touch Strategy, processes should stably run under the same NUMA node to obtain local memory node data, thereby avoiding remote memory access via high-speed links. Therefore, achieving CPU resource load balancing requires not only analyzing the busy cores of the NUMA node, but also analyzing the task attributes in the request queue under the NUMA node—CPU-intensive processes, PM-intensive processes, and I / O-intensive processes. Then, it's necessary to schedule tasks with the same attributes and determine whether a task should be scheduled under the local NUMA node or across NUMA nodes.
[0085] To address the issue of severe CPU resource contention and negative impacts on parallel program execution when CPU-intensive and PM-intensive processes run on the same physical core, it is necessary to analyze system CPU resources and implement appropriate process scheduling strategies.
[0086] To uniformly manage the task attributes of the request queues in each physical core of the CPU resources, this embodiment designs a global table. For convenient unified task scheduling, this global table stores the three types of task process IDs in each physical core using three linked lists, as shown in Table 3. The descriptions of the fields in Table 3 are shown in Table 4.
[0087] Table 3
[0088]
[0089] Table 4
[0090]
[0091] In this embodiment, analyzing the task scheduling queue attributes of busy cores in the physical cores may include, but is not limited to, the following steps:
[0092] Runtime stubbing is used on busy cores, setting the LD_PRELOAD environment variable based on the dynamic linker and performing secondary encapsulation of the standard library's read / write functions;
[0093] File access path analysis was added to the source code of the standard library read / write functions to intercept calls to the standard library functions during runtime.
[0094] When the frequently read / written file path is a path that accesses PM, it indicates that the process corresponding to the read / write is a PM-intensive process, and the PID of the process corresponding to the read / write is stored in the PM_pid_list linked list; when the frequently read / written file path is a path that accesses ordinary external storage, it indicates that the process corresponding to the read / write is an I / O-intensive process, and the PID of the process corresponding to the read / write is stored in the IO_pid_list linked list.
[0095] In addition, it is necessary to determine whether the process is a CPU-intensive process. When the CPU utilization of a process detected by perf top is greater than the second threshold and it is not a PM-intensive process, that is, when the CPU utilization is very high and it is not a PM-intensive process, the detected process is determined to be a CPU-intensive process, and the PID number of the detected process is stored in the CPU_pid_list linked list.
[0096] Specifically, the task scheduling queue attributes of busy cores in the physical cores can be analyzed using the algorithm shown in Table 5:
[0097] Table 5
[0098]
[0099]
[0100] In heterogeneous memory architectures, when CPU-intensive processes and PM-intensive processes run on the same core, severe CPU contention occurs, leading to decreased process performance and reduced PM read / write throughput. Therefore, external methods are needed to achieve CPU resource load balancing. Furthermore, due to the first-contact policy in Linux memory management and the influence of CPU cache affinity, it is necessary to estimate the performance degradation of cross-NUMA node scheduling and minimize remote memory accesses. To address these issues, this embodiment proposes process migration and scheduling strategies.
[0101] Specifically, the process migration strategy includes, but is not limited to, the following steps:
[0102] First, it is necessary to determine whether there are any idle cores under the local NUMA node where the busy core is located, based on the physical core information in the NUMA information table updated in the latest collection cycle;
[0103] If only one idle core exists under the local NUMA node, it means that processes can still use the local CPU cache and local memory. Therefore, the idle core is stored in a linked list using the local CPU cache and local memory. Simultaneously, the migration process function is initiated, randomly selecting processes with the same attribute (CPU-intensive or PM-intensive processes) from the request queue of busy cores as target processes, and migrating this series of target processes to idle cores. If multiple idle cores exist under the local node, to improve CPU utilization, the migration function is initiated simultaneously on multiple idle cores. First, processes of the same series are migrated to their own cores, then steps S110 and S120 are returned to iteratively analyze whether the core is busy and the new task scheduling queue attributes. To speed up the migration, a CAS lock-free mechanism is used to compete for the request queue.
[0104] If no free cores exist under the local NUMA node, a cross-node free core needs to be searched in the global table. Furthermore, since persistent memory modules are inserted into the CPU socket and bound to a specific NUMA node, PM-intensive processes can achieve the best read / write performance when accessing the local NUMA node. To avoid affecting the memory node affinity of PM-intensive processes, only processes marked as CPU-intensive are scheduled across nodes.
[0105] The process migration strategy in this embodiment can be executed using Algorithm 3 as shown in Table 6:
[0106] Table 6
[0107]
[0108]
[0109] Specifically, when I / O-intensive processes and PM-intensive processes run on the same core, due to resource contention, both types of processes require CPU cache and memory resources. Frequent page replacements can severely impair process read / write performance, necessitating appropriate process scheduling strategies to improve resource utilization. The process scheduling strategy in this embodiment includes, but is not limited to, the following steps:
[0110] Retrieves the non-busy cores across NUMA nodes from the global table and returns the cross-NUMA nodes for which I / O-intensive and PM-intensive memory access-intensive processes do not exist in the linked list of global table request queue properties.
[0111] Since both I / O-intensive and PM-intensive processes are affected by the first-encounter policy and CPU affinity in the memory management system, cross-node scheduling may experience a decrease in read / write performance during the initial stage. This embodiment employs a process scheduling strategy that randomly selects processes with similar attributes and schedules them to idle cores across nodes, then iteratively calculates the performance of accessing memory nodes. This method also identifies memory-sensitive processes, aiming to keep them on the local node during subsequent scheduling.
[0112] If there are no free cores in the cross-NUMA node, the three-level cache resources should be re-planned on the local node to minimize resource contention and re-plan the three-level resource scheduling.
[0113] In this embodiment, replanning the L3 cache resources includes designing a CPU L3 resource monitor, predicting the CPU L3 cache miss rate of a process, and allocating L3 cache resources to the process.
[0114] Specifically, the design of the CPU Level 3 resource monitor includes using Intel PEBS precise event sampling technology to track and sample process accesses to the Level 3 cache, as well as recording the corresponding process's Level 3 cache hits and misses.
[0115] This embodiment uses Intel PEBS (Processor Event Based Sampling) technology to track and sample process accesses to the L3 cache. PEBS is a monitoring tool based on the Intel microarchitecture, built on top of Intel's Performance Counter Monitor (PCM) and extending PCM. The access process in this embodiment includes, but is not limited to, the following steps:
[0116] Configure IHK / McKernel and start the PEBS driver to quickly manage kernel-level features for heterogeneous memory architectures. When using Intel PEBS, a precise event needs to be specified. In this example, the precise event specified is the L2 cache miss event, because the number of L2 cache miss events is equivalent to the number of L3 cache access requests. Then, Intel PEBS samples the CPU's internal execution state (including recently accessed virtual addresses) and periodically stores CPU snapshots in main memory.
[0117] McKernel tracks all load and store instructions and caches them in the kernel buffer. Then, the user defines a user-space Intel PEBS buffer. The "PEBS record" records architecture registers and status information into the user-space buffer. That is, when the record last written to the kernel buffer from Intel PEBS exceeds the threshold configured in the kernel buffer, a CPU interrupt is triggered. The interrupt handler processes the Intel PEBS data, writes the data from the kernel buffer to the user buffer, and clears the kernel buffer after completing the task, thus allowing the CPU to continue storing more records.
[0118] The user cache contains records of linear addresses of memory references that trigger L2 cache misses and L3 cache accesses. A performance monitor interrupt is generated when the PEBS buffer overflows, i.e., when the number of exact events reaches a preset threshold. The number of L2 cache hits and misses is obtained by calculating the proportion of addresses accessed by the L2_MISS_LOADS instruction in the user-configured PEBS buffer, and thus the number of L3 cache accesses is derived.
[0119] When recording L3 cache hits and misses for a corresponding process, since accessing the L3 cache does not equate to a L3 cache hit, an address reuse time histogram inference model is needed to obtain the number of L3 cache hits and misses. This embodiment uses addresses accessed via L2_MISS_LOADS to construct a monitoring list, monitoring only a fixed number of addresses over a period of time to collect address reuse counts. Each time a monitored address is reused, its reuse time is recorded, and this address is marked as sampled. Subsequent accesses to this address will not be recorded for any subsequent reuse. This design ensures uniform sampling throughout the tracing process. At the end of sampling, the probability of a reused address is approximately equal to the cache hit rate, and the probability of a non-reused address is approximately equal to the cache miss rate.
[0120] In this embodiment, predicting the CPU L3 cache miss rate of a process and allocating L3 cache resources to the process includes two aspects: using the LRU principle to predict the L3 cache occupancy rate of traditional processes, the L3 cache occupancy rate of processes containing PM, and designing L3 cache allocation strategies.
[0121] In predicting the L3 cache occupancy rate of traditional processes and processes containing PM using the LRU principle, it is unreasonable to allocate L3 cache for PM-intensive processes solely based on the L3 cache access frequency because persistence has a certain read / write latency compared to DRAM. Therefore, this invention considers the latency of PM and uses a method to calculate the speed at which data is loaded from PM into the L3 cache to perform a weighted calculation on the LRU prediction, resulting in the heuristic equations shown in formulas (1), (2), and (3):
[0122]
[0123]
[0124] p1:K:p n :p PM =T1+S Normal :K:T n *S Normal :T PM *S PM Formula (3)
[0125] Specifically, formula (1) represents the speed at which data is loaded from ordinary external storage to the L3 cache, formula (2) represents the speed at which data is loaded from the PM to the L3 cache, and in formula (3) T n The number of times process n accesses the L3 cache is predicted in the previous section. This formula means that the speed of media access to the L3 cache is included in the LRU calculation.
[0126] In this embodiment, allocating L3 cache resources to processes can be achieved by designing a L3 cache allocation strategy. This strategy is suitable for scenarios where an application process that excessively consumes cache (I / O-intensive or PM-intensive process) exhausts the L3 cache or a large amount of memory bandwidth, thus failing to guarantee the read / write performance of other processes. Existing solutions use cgroups to bind CPU and memory resources to processes, but this coarse-grained allocation cannot control such a sensitive and scarce resource as the processor cache. This embodiment utilizes Intel's Cache Allocation Technology (CAT) to allocate L3 cache resources. The allocation process includes, but is not limited to, the following steps:
[0127] Compare the L3 cache usage of I / O-intensive processes and PM-intensive processes on the local node, and then use Intel CAT to isolate the L3 cache based on the comparison results, that is, each type of process enjoys its own L3 cache way (LLC way) resources.
[0128] A process can have its own dedicated cache path or share a cache path with others. To find a local optimum within a finite time, it is necessary to compare the dedicated solution from step one with the shared solution. The shared solution uses an iterative search strategy, sharing a portion of the resources each time, then comparing the new L3 cache miss rates, and returning the solution that found a local optimum within a finite time.
[0129] Specifically, the allocation process in this embodiment can be implemented using Algorithm 4 as shown in Table 7:
[0130] Table 7
[0131]
[0132] In summary, this embodiment adopts a process identification scheme to improve the utilization efficiency of CPU resources. It identifies PM-intensive processes, CPU-intensive processes, and I / O-intensive processes in the CPU's task request queue, determines whether there are idle CPUs in local and remote Non-Uniform Memory Access (NUMA) nodes, records the process types in the task request queues of each CPU, and then adopts a reasonable scheduling strategy to effectively schedule the processes.
[0133] Furthermore, Page Memory (PM) has a higher data throughput than external storage, and the operating system's kernel page buffer does not cache PM data. Therefore, when reading and writing data to PM and ordinary external storage for the first time under the same NUMA node, according to the First Touch Strategy and considering that PM data transfer speed is faster than ordinary external storage, PM will occupy a larger proportion of the CPU's L3 cache, which will significantly impact the performance of processes reading and writing to external storage, especially when the data volume is large. However, when repeatedly accessing the same data, the lack of kernel page buffer caching advantage in PM and the fact that CPU L3 cache resources are occupied by data from other external storage reduce PM's read and write performance. These factors lead to inefficient L3 cache resource sharing.
[0134] Therefore, in order to achieve load balancing of CPU computing resources and to alleviate the contention of CPU L3 cache resources by PM-intensive processes and I / O-intensive processes under the same NUMA node, this embodiment proposes a CPU scheduling method and system for hybrid memory architecture.
[0135] Compared with existing technologies, this embodiment provides a CPU scheduling method and system for hybrid memory architectures, which proposes a novel persistent memory concept. Based on heterogeneous memory architectures, it reconsiders the scheduling strategy of CPU computing resources and CPU L3 cache, using only a small space to record the CPU usage and process type under NUMA architecture, and uses a faster method to record and predict the CPU L3 cache utilization of processes. Under the condition of limited CPU computing resources and CPU L3 cache resources, it realizes a fast resource allocation strategy, thereby improving the read and write performance of processes.
[0136] Furthermore, this embodiment was tested on an Intel Xeon Gold 5218 server that supports PM, and connected to a 128GB Intel Optane Persistent Memory device. Under the experimental environment of Ubuntu Server 16.04.6x86_64 operating system and Linux kernel version 5.1.0, the process migration and scheduling strategy, as well as LLC analysis, prediction and allocation strategy designed in this embodiment were implemented based on the PM file system NOVA (version 5.1). The strategy was tested on mixed processes (CPU-intensive process + I / O-intensive process, CPU-intensive process + PM-intensive process, I / O-intensive process + PM-intensive process, and CPU-intensive process + I / O-intensive process + PM-intensive process). The results show that this embodiment significantly improves performance in mixed process architectures with PM-intensive processes. The average read / write bandwidth is improved by 24% for CPU-intensive process + PM-intensive process, by 18% for I / O-intensive process + PM-intensive process, and by 22% for CPU-intensive process + I / O-intensive process + PM-intensive process.
[0137] This invention provides a CPU scheduling system for hybrid memory architectures, comprising:
[0138] The first module is used to periodically collect CPU resource usage data.
[0139] The second module is used to analyze the busy level of each physical core under the NUMA node based on the CPU resource usage.
[0140] The third module is used to analyze the task scheduling queue attributes of busy cores in the physical cores. The task scheduling queue attributes include CPU-intensive processes, PM-intensive processes, and I / O-intensive processes.
[0141] The fourth module is used to execute process migration and process scheduling strategies based on the task scheduling queue attributes.
[0142] The content of the method embodiments of the present invention is applicable to the system embodiments. The specific functions implemented in the system embodiments are the same as those in the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above methods.
[0143] The embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited to the above embodiments, and various changes can be made within the scope of knowledge possessed by those skilled in the art without departing from the spirit of the present invention. Furthermore, the embodiments of the present invention and the features thereof can be combined with each other unless otherwise specified.
Claims
1. A CPU scheduling method for hybrid memory architectures, characterized in that, Includes the following steps: Periodically collect CPU resource usage data; The busy level of each physical core under the NUMA node is analyzed based on the CPU resource usage. The task scheduling queue attributes of busy cores in the physical cores are analyzed. These attributes include CPU-intensive processes, PM-intensive processes, and I / O-intensive processes. PM-intensive processes refer to processes that frequently occupy CPU resources to read and write data to new persistent memory. Execute process migration and process scheduling strategies based on the aforementioned task scheduling queue attributes; The step of executing the process migration strategy based on the task scheduling queue attributes includes: Determine whether there are any idle cores under the local NUMA node where the busy core is located; If there is only one idle core under the local NUMA node, the idle core is stored in a linked list using local CPU cache and local memory, and a process with the same attribute is randomly selected from the request queue of the busy core as the target process, and the target process is migrated to the idle core. If there are multiple idle cores under the local NUMA node, control the multiple idle cores to start the migration function at the same time, and use the CAS lock-free mechanism to compete for the request queue. If there are no free cores under the local NUMA node, search for free cores across NUMA nodes in the global table and schedule CPU-intensive processes across nodes. The step of executing the process scheduling policy according to the task scheduling queue attributes includes: Retrieves the non-busy cores across NUMA nodes from the global table and returns the cross-NUMA nodes for which I / O-intensive and PM-intensive memory access-intensive processes do not exist in the linked list of global table request queue properties. Processes with the same attribute are selected and scheduled to idle cores across NUMA nodes. The performance of accessing memory nodes is iteratively calculated, sensitive processes are marked, and the sensitive processes are kept in the local node in the next scheduling. If there are no free cores across NUMA nodes, reallocate the three-level cache resources in the local NUMA node.
2. The CPU scheduling method for hybrid memory architecture according to claim 1, characterized in that, The periodic collection of CPU resource usage includes: Use the perf stat tool to monitor CPU resources and collect CPU usage information; Based on the CPU resources and CPU usage information, analyze the core utilization rate and total CPU utilization rate of each physical core during the acquisition cycle; The cycle duration is then modified after calculating the new acquisition cycle based on the total CPU utilization. When the total CPU utilization exceeds a first threshold, a program is started to monitor the busy level of each physical core under the NUMA node.
3. The CPU scheduling method for hybrid memory architecture according to claim 1, characterized in that, The analysis of the busy level of each physical core under the NUMA node includes: Store the system CPU busy status in the NUMA information table; The NUMA information table includes a core sequence number, which represents the logical sequence number of the physical core under the NUMA node. A logical sequence number of 0 indicates that the physical core is in a non-busy state, and a logical sequence number of 1 indicates that the physical core is in a busy state.
4. The CPU scheduling method for hybrid memory architecture according to claim 1, characterized in that, The analysis of the task scheduling queue attributes of busy cores in the physical cores includes: A runtime stubbing mechanism is used for the busy core, setting the LD_PRELOAD environment variable based on the dynamic linker and performing secondary encapsulation on the standard library's read and write functions; Add file access paths to the source code of the read and write functions in the standard library to intercept calls to the read and write standard library functions in the process at runtime; When the file path being read or written is a path that accesses PM, it indicates that the process corresponding to the read / write is a PM-intensive process, and the PID number of the process corresponding to the read / write is stored in the PM_pid_list linked list; when the file path being read or written is a path that accesses ordinary external storage, it indicates that the process corresponding to the read / write is an I / O-intensive process, and the PID number of the process corresponding to the read / write is stored in the IO_pid_list linked list. When perf top detects that a process's CPU utilization is greater than the second threshold and it is not a PM-intensive process, the detected process is determined to be a CPU-intensive process, and the PID of the detected process is stored in the CPU_pid_list linked list.
5. The CPU scheduling method for hybrid memory architecture according to claim 1, characterized in that, The replanning of the three-level cache resources includes: Track and sample process accesses to the L3 cache using the Intel PEBS Precise Event Sampling Tool; Predict the CPU L3 cache miss rate of the process during access and allocate L3 cache resources to the process.
6. The CPU scheduling method for hybrid memory architecture according to claim 5, characterized in that, The process of tracking and sampling access to the L3 cache using the Intel PEBS Precise Event Sampling Tool includes: Configure IHK / McKernel and start the PEBS driver, specify the L2 cache miss time as the exact event of the Intel PEBS, use the Intel PEBS to sample the internal execution state of the CPU, and periodically store snapshots of the CPU in main memory; All load and store instructions are tracked and cached in the kernel buffer; when the last record written to the kernel buffer in Intel PEBS exceeds the threshold configured in the kernel buffer, a CPU interrupt handler is triggered; wherein, the interrupt handler processes Intel PEBS data, writes the data in the kernel buffer to the user buffer, and clears the kernel buffer; When the number of occurrences of the precise event reaches a preset threshold, a performance monitor interrupt will be generated. The proportion of addresses accessed by preset instructions in the user-configured PEBS buffer will be calculated to obtain the number of hits and misses in the second-level cache, and the number of accesses in the third-level cache will be obtained.
7. A CPU scheduling method for hybrid memory architecture according to claim 6, characterized in that, The allocation of three-level cache resources to the process includes: Compare the L3 cache usage of I / O-intensive processes and PM-intensive processes under the local NUMA node, and use Intel CAT to perform isolation partitioning of the L3 cache based on the comparison results; Share some Level 3 cache resources and iteratively search for the optimal Level 3 cache solution.
8. A CPU scheduling system for hybrid memory architectures, characterized in that, include: The first module is used to periodically collect CPU resource usage data. The second module is used to analyze the busy level of each physical core under the NUMA node based on the CPU resource usage level; The third module is used to analyze the task scheduling queue attributes of busy cores in the physical cores. The task scheduling queue attributes include CPU-intensive processes, PM-intensive processes, and I / O-intensive processes. The PM-intensive processes refer to processes that frequently occupy CPU resources to read and write data to the new persistent memory. The fourth module is used to execute process migration strategies and process scheduling strategies based on the attributes of the task scheduling queue. The step of executing the process migration strategy based on the task scheduling queue attributes includes: Determine whether there are any idle cores under the local NUMA node where the busy core is located; If there is only one idle core under the local NUMA node, the idle core is stored in a linked list using local CPU cache and local memory, and a process with the same attribute is randomly selected from the request queue of the busy core as the target process, and the target process is migrated to the idle core. If there are multiple idle cores under the local NUMA node, control the multiple idle cores to start the migration function at the same time, and use the CAS lock-free mechanism to compete for the request queue. If there are no free cores under the local NUMA node, search for free cores across NUMA nodes in the global table and schedule CPU-intensive processes across nodes. The step of executing the process scheduling policy according to the task scheduling queue attributes includes: Retrieves the non-busy cores across NUMA nodes from the global table and returns the cross-NUMA nodes for which I / O-intensive and PM-intensive memory access-intensive processes do not exist in the linked list of global table request queue properties. Processes with the same attribute are selected and scheduled to idle cores across NUMA nodes. The performance of accessing memory nodes is iteratively calculated, sensitive processes are marked, and the sensitive processes are kept in the local node in the next scheduling. If there are no free cores across NUMA nodes, reallocate the three-level cache resources in the local NUMA node.