An NVMeoF computing node virtualization method based on intermediary transparent transmission

By introducing an NVMeoF middleware layer into the cloud platform, the problem of NVMeoF connection and interaction between virtual machines and the host machine is solved, providing an efficient and secure virtualization method, avoiding the poor performance and high cost problems of existing technologies, and realizing a transparent virtual NVMe device interface.

CN116418857BActive Publication Date: 2026-06-26SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2023-04-13
Publication Date
2026-06-26

Smart Images

  • Figure CN116418857B_ABST
    Figure CN116418857B_ABST
Patent Text Reader

Abstract

The application discloses a kind of NVMeoF computing node virtualization methods based on intermediate transparent transmission, it is related to cloud computing technology field.The high concurrency implementation in virtual machine NVMe driver is multiplexed in the present application, without modifying virtual machine NVMe driver, with better transparency, and with higher performance;By introducing NVMeoF intermediate layer transparent transmission NVMe device, the discovery and connection of NVMeoF are executed by cloud service provider, and cloud tenant can directly use the unmodified virtual machine NVMe driver in virtual machine to complete various requests, without manually configuring the discovery and connection of NVMeoF, convenient to use, and the storage node of cloud service provider cannot be directly accessed by malicious cloud tenant;NVMeoF intermediate layer is realized by software, without requiring network card to have NVMe hardware interface, and NVMeoF intermediate layer works in host operating system, without using on-chip operating system in intelligent network card, with lower cost and easy to maintain.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of cloud computing technology, and in particular to an NVMeoF computing node virtualization method based on mediator pass-through. Background Technology

[0002] With the continuous advancement of cloud computing technology, cloud platform service providers are constantly assembling higher-performance hardware and adopting more advanced network protocols. The performance of high-performance remote storage devices is often directly related to the interests of both cloud platform service providers and cloud tenants. Cloud platform servers can utilize high-performance remote storage devices using various storage protocols that allow network access. The NVMeoF specification aims to enable access to remote, NVMe-compatible devices. A detailed description of NVMeoF is provided in NVMe Basic Specification Version 1.4. NVMeoF-compatible devices provide compute nodes with high-performance NVMe devices accessible over a network.

[0003] Figure 1 This is an example of an NVMeoF command sent through different transport layers such as Fibre Channel, InfiniBand, RoCE, iWARP, or Transmission Control Protocol. NVMeoF supports transmitting NVMe commands to remote devices on storage nodes via the network. Combining RDMA with NVMeoF leverages any RDMA technology, including InfiniBand, RoCE, and iWARP. NVMe and RDMA protocols support queue-level switching; underlying data remains unchanged, and network data can be directly copied to buffers in the NVMe protocol stack by an RDMA-compatible network card via Direct Memory Access (DMI). Buffer data in the NVMe protocol stack can also be directly copied to the network by an RDMA-compatible network card via DMI. RDMA can bypass the operating system to directly read or write the contents of the connection buffer. The network card implements a DMI engine and creates a channel from its RDMA engine to application memory via the bus. The send and receive queues used to transmit job requests are called queue pairs.

[0004] Virtualization aims to divide a host machine into several virtual machines with specific functions by logically abstracting components and isolating resources. With the help of virtualization technology, it becomes possible to run multiple architectures and operating systems on a single host machine. Furthermore, for cloud platform service providers, virtualization technology hides the attributes of the host machine; fatal errors in some virtual machines will not harm the host machine, ensuring the isolation and security of the host and virtual machines. Additionally, for high-performance configurations on the host machine, such as NVMeoF connections, a single host machine cannot fully utilize its performance advantages; in such cases, virtualization technology can improve the utilization rate of the host machine's high-performance configurations. With the widespread use of virtualization technology, virtual machine monitoring programs have emerged. A virtual machine monitoring program is computer software, firmware, or hardware that creates and runs virtual machines.

[0005] Several solutions have emerged to fully utilize NVMeoF connectivity on compute nodes, but all have some shortcomings:

[0006] Software solutions can provide storage devices within virtual machines. Examples include QEMU-emulated NVMe devices and the Vhost NVMe storage performance development kit. However, software solutions suffer from higher latency and lower performance because the formation of NVMeoF packets is performed by software executed by the processor. For software solutions accessing NVMe devices on remote storage nodes, the layered copying of data between memory allocated to the virtual device and memory allocated to the physical device leads to increased latency.

[0007] Figure 2 A pass-through solution is demonstrated where NVMeoF transactions are processed within a user virtual machine in user space. The Linux kernel and Windows server allow NVMeoF connections to be set up within the virtual machine. In this solution, the virtual machine connects to a remote NVMeoF storage node via an RDMA-compatible network interface card (NIC) run by an NVMeoF driver executed within the virtual machine. In this pass-through solution, the cloud tenant needs to execute discovery and connection commands to the remote storage node. The command format is as follows:

[0008] #Discovery: nvme discover –t rdma –a $ storage node address

[0009] #Connection: nvme connect –t rdma –a $storage node address –n $internal NVMe qualified name of storage node

[0010] Cloud tenants executing discovery and connection commands to remote storage nodes may cause configuration errors, increase user burden, and pose security risks. Cloud tenants may misunderstand the meaning of the valid NVMe names within the storage node and thus mistype them, leading to connection establishment failures. Furthermore, exposing storage node addresses to cloud tenants may pose security risks, exposing the storage node address and potentially surrounding addresses to unauthorized cloud tenant content access requests.

[0011] Figure 3 An example of a smart network interface card (NIC) with an NVMe hardware interface is demonstrated. Smart NICs perform the reverse operations of NVMe connection setup and configuration, supporting the reverse provision of NVMe hardware interfaces over NVMeoF connections for use by virtual machines. However, smart NICs are significantly more expensive than regular NICs or RDMA-compatible NICs.

[0012] Therefore, those skilled in the art are dedicated to developing an NVMeoF compute node virtualization method based on mediated pass-through. This method fully utilizes high-performance configurations such as NVMeoF connectivity on compute nodes to leverage their performance advantages. Summary of the Invention

[0013] In view of the above-mentioned deficiencies of the prior art, the technical problem to be solved by the present invention is to solve the technical problem of how NVMe devices in virtual machines can connect and interact with NVMeoF in the host machine, and to define a standard method for NVMeoF virtualization in cloud platforms.

[0014] To achieve the above objectives, this invention provides an NVMeoF compute node virtualization method based on mediated pass-through, comprising the following steps:

[0015] Step 1: The virtual machine interacts with the NVMe driver to trigger the transmission of NVMe management commands;

[0016] Step 2: The virtual machine NVMe driver adds an entry containing NVMe management commands to the management commit queue;

[0017] Step 3: The virtual machine NVMe driver updates the corresponding commit queue tail bell as defined in Chapter 3.1.24 of the NVMe Basic Specification, triggering the virtual machine to exit to the NVMeoF intermediate layer;

[0018] Step 4: The NVMeoF intermediate layer copies the new management commit queue entry or pointer to the entry as defined in Chapter 4.2 of the NVMe Basic Specification to the RDMA send queue and triggers the RDMA doorbell;

[0019] Step 5: The RDMA device sends commands to the remote NVMeoF storage node using the RDMA protocol; the remote NVMeoF storage node receives and processes the commands.

[0020] Step 6: The remote NVMeoF storage node sends a response to the RDMA device;

[0021] Step 7: The RDMA device directly copies the response to the response buffer in the completion queue accessible to the virtual machine NVMe driver via direct memory access.

[0022] Step 8: The RDMA device injects an interrupt into the virtual machine NVMe driver to notify the virtual machine NVMe driver that a response from the remote NVMeoF storage node is ready;

[0023] Step 9: The virtual machine NVMe driver checks the NVMe completion queue, and the virtual machine processes the responses in the completion queue.

[0024] Furthermore, step 3 completes the simulation of the PCIe registers, including the tail bell of the submission queue.

[0025] Furthermore, the virtual machine NVMe driver interacts with VFIO PCIe.

[0026] Furthermore, the NVMeoF intermediate layer provides an NVMe PCIe device model for interaction with VFIO PCIe.

[0027] Furthermore, the NVMe PCIe device model simulates PCIe registers, base address registers, and interrupts.

[0028] Furthermore, step 3 simulates the complete set of registers of the NVMe PCIe device through the vfio-mdev interface in the kernel space mdev intermediate pass-through architecture.

[0029] Furthermore, the NVMeoF intermediate layer replaces the network card in receiving commands and storing addresses.

[0030] Furthermore, the NVMeoF intermediate layer replaces the network card and NVMe driver in communication.

[0031] Furthermore, the NVMeoF intermediate layer completes the translation of NVMe commands into NVMeoF.

[0032] Furthermore, the discovery and connection commands for the virtual machine remote storage node are executed by the cloud platform service provider.

[0033] In a preferred embodiment of the present invention, the NVMe Basic Specification defines PCIe-based NVMe high-performance storage, and the NVMeoF Specification defines NVMeoF based on remote NVMe high-performance storage. However, both focus on independent systems and pay little attention to the virtualization characteristics required by cloud platforms. Several solutions have emerged to virtualize NVMeoF in cloud platforms, but each has its own drawbacks: software solutions have poor performance, solutions using pass-through network interface cards (NICs) are inconvenient to use and pose security risks, and smart NIC solutions are expensive. The present invention aims to solve the technical problem of how NVMe devices in virtual machines connect and interact with NVMeoF in the host machine, defining a standard method for NVMeoF virtualization in cloud platforms.

[0034] Software solutions suffer from high latency and low performance, requiring customized modifications to the virtual machine NVMe driver. This invention, however, generates NVMeoF packets entirely in hardware. Data does not need to be copied to the memory allocated to the physical device or the operating system. The virtual machine NVMe driver is reused. NVMe and RDMA protocols support queue-level switching, allowing underlying data to remain unchanged. Network data can be directly copied to the buffer in the NVMe protocol stack by an RDMA-compatible network card via direct memory access. Buffer data in the NVMe protocol stack can also be directly copied to the network by an RDMA-compatible network card via direct memory access. RDMA can bypass the operating system to directly read or write the contents of the connection buffer. A virtual NVMe device is provided for the virtual machine NVMe driver, resulting in low latency and high performance. Cloud tenants can use it transparently without modifying the virtual machine NVMe driver.

[0035] The pass-through solution requires cloud tenants to execute discovery and connection commands to remote storage nodes. This can lead to configuration errors, user burden, and security risks. Cloud tenants may not understand the meaning of the valid NVMe names within the storage node and may misspell them, causing connection establishment failures. Furthermore, exposing the storage node address to cloud tenants may pose security risks, exposing the storage node address and potentially surrounding addresses to unauthorized cloud tenant content access requests. This invention, however, executes the remote storage node discovery and connection commands through the cloud platform service provider. An NVMeoF middleware layer is introduced above the NVMeoF connection in the compute node host machine, providing a virtual NVMe device interface for the compute node virtual machine's NVMe driver to interact with. The NVMeoF middleware layer encapsulates the NVMe commands submitted by the compute node virtual machine into NVMeoF commands for NVMeoF connection interaction. The NVMeoF middleware layer also decapsulates the NVMeoF responses received on the NVMeoF connection into NVMe responses for the virtual machine's NVMe driver to receive. This avoids the configuration errors, user burden, and security risks that may result from cloud tenants executing commands.

[0036] Smart network interface cards (NICs) with NVMe hardware interfaces are significantly more expensive than ordinary NICs or RDMA-compatible NICs, requiring the maintenance of an on-chip operating system, resulting in high maintenance costs. This invention eliminates the need for smart NICs with NVMe hardware interfaces and avoids the need for on-chip operating system maintenance. The virtual NVMe device interface is implemented in software by the NVMeoF middleware layer. The encapsulation and decapsulation between NVMe commands and responses and NVMeoF commands and responses are implemented in software by the NVMeoF middleware layer, without requiring an on-chip operating system. This leads to higher economic efficiency and easier maintenance.

[0037] Compared with the prior art, the present invention has the following obvious substantive features and significant advantages:

[0038] 1. This invention solves the technical problem of how NVMe devices in virtual machines can connect and interact with NVMeoF in the host machine, and defines a standard method for NVMeoF virtualization in cloud platforms.

[0039] 2. Compared to existing solutions, and compared to software solutions, this technical solution can reuse the high-concurrency implementation in the virtual machine NVMe driver without modifying the virtual machine NVMe driver, offering better transparency and higher performance. Compared to pass-through NIC solutions, this technical solution introduces an NVMeoF middleware layer that transmits NVMe devices. NVMeoF discovery and connection are performed by the cloud service provider. Cloud tenants can directly use the unmodified virtual machine NVMe driver in the virtual machine to complete various requests without manually configuring NVMeoF discovery and connection, making it convenient to use and protecting the cloud service provider's storage nodes from direct access by malicious cloud tenants. Compared to smart NIC solutions, the NVMeoF middleware layer introduced in this technical solution is implemented in software, eliminating the need for the NIC to have an NVMe hardware interface. Furthermore, the NVMeoF middleware layer operates within the host operating system, eliminating the need for the on-chip operating system in the smart NIC, resulting in lower cost and easier maintenance.

[0040] The following will further explain the concept, specific structure, and technical effects of the present invention in conjunction with the accompanying drawings, so as to fully understand the purpose, features, and effects of the present invention. Attached Figure Description

[0041] Figure 1 Here is an example of an NVMeoF command;

[0042] Figure 2 It is a direct solution;

[0043] Figure 3 This is an example of a smart network interface card with an NVMe hardware interface;

[0044] Figure 4 This is a schematic diagram of the overall design of a preferred embodiment of the present invention;

[0045] Figure 5 This is an example of a management command timing diagram of a preferred embodiment of the present invention;

[0046] Figure 6 This is an example of an I / O read command timing diagram of a preferred embodiment of the present invention. Detailed Implementation

[0047] The following description, with reference to the accompanying drawings, illustrates several preferred embodiments of the present invention to make its technical content clearer and easier to understand. The present invention can be embodied in many different forms, and the scope of protection of the present invention is not limited to the embodiments mentioned herein.

[0048] In the accompanying drawings, components with the same structure are indicated by the same numerical designation, and components with similar structures or functions are indicated by similar numerical designations. The dimensions and thicknesses of each component shown in the drawings are arbitrary, and the present invention does not limit the dimensions and thicknesses of each component. To make the illustrations clearer, the thickness of some components has been appropriately exaggerated in the drawings.

[0049] Current NVMe basic specifications define high-performance NVMe storage based on PCIe, while the NVMeoF specification defines NVMeoF based on remote high-performance NVMe storage. However, both focus on independent systems and pay less attention to the virtualization features required by cloud platforms. Several solutions have emerged to virtualize NVMeoF in cloud platforms, but each has its own drawbacks. Software solutions suffer from poor performance, solutions using pass-through network interface cards (NICs) are inconvenient to use and pose security risks, and smart NIC solutions are expensive. This invention aims to solve the technical problem of how NVMe drivers in virtual machines can correctly interact with the NVMeoF connection on the host machine to function properly when there is no NVMe device on the host machine but only an NVMeoF connection. It defines a standard method for NVMeoF virtualization in cloud platforms.

[0050] Figure 4This is one embodiment of the overall technical solution. The compute node communicates with the target device of the remote storage node via its network interface card (NIC) to complete NVMeoF transactions. The cloud administrator specifies the storage pool address and the qualified NVMe name within the storage pool, and uses NVMeoF commands to set up remote storage connections for the virtual machine. The virtual machine or application issues storage or memory access requests containing NVMe management or I / O commands through the NVMe driver. The NVMe driver issues NVMe commands to the kernel-level PCIe interface via VFIO. In NVMeoF, the PCIe device on the compute node is the NIC, not the NVMe device. NVMe devices are centralized in the storage node, and NICs without NVMe hardware interfaces cannot directly process NVMe commands. Therefore, the NVMeoF middleware layer acts as an intermediary, receiving commands and associated storage addresses instead of the NIC. When the virtual machine or application completes adding an entry containing commands to the submission queue, it triggers the queue's doorbell register, which can then trigger the NVMeoF middleware layer to intercept these entries. This invention allocates queues for virtual machines, passing memory addresses through the management commit queue and management completion queue registers defined in Chapters 3.1.9 and 3.1.10 of the NVMe Basic Specification. The NVMeoF middleware layer replaces the network interface card (NIC) and NVMe driver for communication. This invention receives NVMe commands containing management, read, or write operations from the NVMe driver and acts as an intermediary between the NVMe driver and RDMA. By copying a new NVMe commit queue entry defined in Chapter 4.2 of the NVMe Basic Specification to the RDMA send queue, or by referencing the content pointed to by the commit entry to the RDMA send queue and triggering the RDMA doorbell, the RDMA-enabled NIC is triggered to transmit NVMe commands to the compute node target device via the network using data packets according to any available protocol, including Ethernet. For the response to command execution completion, the NVMeoF middleware layer copies the NVMe response contained in the RDMA response from the RDMA receive queue to the NVMe completion queue for virtual machine access, or references the NVMe response contained in the RDMA response from the RDMA receive queue to the NVMe completion queue for virtual machine access. The translation of NVMe commands to NVMeoF is performed by the NVMeoF middleware layer, not the network interface card (NIC). Therefore, a smart NIC with an NVMe hardware interface is not required; only a NIC with RDMA enabled is needed. The NVMeoF middleware layer operates in kernel space, a memory region reserved for privileged operating system kernels, kernel extensions, and some device drivers. Performing the translation in kernel space avoids exposing storage addresses to cloud tenants, ensuring security. Furthermore, the storage node address and the internal NVMe qualified name of the storage node are specified by the cloud administrator, preventing cloud tenants from specifying incorrect storage addresses or valid internal NVMe names.In contrast, user-space memory regions can be read or written by applications, corresponding to the memory regions protected by VFIO for the CPU in this embodiment. The cloud administrator uses Block Multi-Queued NVMe to create virtual storage devices for the compute nodes. Block Multi-Queued uses the NVMeoF application interface to access the NVMeoF stack. Block Multi-Queued NVMe provides an internal Linux kernel application interface between the block layer and the virtual storage devices created after establishing a connection to the remote storage node target device. The network interface card (NIC) physical function enables RDMA for virtual machine access. The NVMeoF intermediate layer does not require the use of a smart NIC with an NVMe hardware interface. The cloud administrator sends NVMeoF commands from the compute node to set up a remote storage connection for the virtual machine. The cloud administrator specifies the storage node address and the qualified NVMe name within the storage node to complete the discovery and connection to the remote storage node target device; the command format has been described in the background section. The cloud administrator creates a virtual NVMe device instance for the NVMeoF intermediate layer using the remote storage connection by accessing the kernel space mdev mediator pass-through framework running on the compute node. The hypervisor associates the created virtual NVMe device instance with the virtual machine. The virtual machine communicates with the virtual NVMe device instance through the driver to issue NVMe management or I / O commands to read data from or write data to the remote storage node NVMe device.

[0051] Figure 5 Here is a sequence diagram example of the transmission management commands for the entire technical solution:

[0052] 1. The virtual machine interacts with the NVMe driver to trigger the transmission of NVMe management commands.

[0053] 2. The virtual machine NVMe driver adds an entry containing NVMe management commands to the management commit queue.

[0054] 3. The virtual machine NVMe driver update corresponds to the tail bell of the commit queue defined in Chapter 3.1.24 of the NVMe Basic Specification, which will trigger the virtual machine to exit to the NVMeoF intermediate layer.

[0055] 4. The NVMeoF intermediate layer copies the new management commit queue entry or pointer to the entry as defined in Chapter 4.2 of the NVMe Basic Specification to the RDMA send queue and triggers the RDMA doorbell.

[0056] 5. RDMA devices use the RDMA protocol to send commands to remote NVMeoF storage nodes via a network or other wired or wireless means. The remote NVMeoF storage node receives and processes these commands.

[0057] 6. The remote NVMeoF storage node sends a response to the RDMA device.

[0058] 7. RDMA devices copy responses directly to a response buffer in the completion queue accessible to the virtual machine NVMe driver via direct memory access.

[0059] 8. The RDMA device injects an interrupt into the virtual machine NVMe driver to notify the virtual machine NVMe driver that a response from the remote NVMeoF storage node is ready.

[0060] 9. The virtual machine NVMe driver checks the NVMe completion queue, and the virtual machine processes the responses in the completion queue.

[0061] To ensure the successful completion of the entire technical solution, this invention requires the correct execution of steps 3 and 4. For step 3, this invention needs to simulate the PCIe registers, including the commit queue tail bell. This is because the compute node host machine has only one set of PCIe registers, including the commit queue tail bell. Simply allocating this set of PCIe registers directly to a virtual NVMe device would only virtualize one virtual NVMe device, serving only a single cloud tenant, failing to achieve the goal of sharing among multiple cloud tenants to improve utilization. As can be seen, the NVMe driver in the virtual machine interacts with VFIO PCIe. Therefore, this invention needs to provide an NVMe PCIe device model in the NVMeoF intermediate layer to correctly interact with VFIO PCIe. To provide the NVMe PCIe device model, it is necessary to simulate PCIe registers, base address registers, and interrupts. This invention simulates the complete set of registers for the NVMe PCIe device through the vfio-mdev interface in the kernel space mdev intermediate pass-through architecture. However, in this entire technical solution, the compute node itself does not have an NVMe PCIe device; its PCIe device is an RDMA device, while the NVMe PCIe device resides in the storage node. Simulating the NVMe device's PCIe registers based on the RDMA device's PCIe registers is not feasible. Therefore, this invention obtains the status of remote NVMe PCIe device registers through an RDMA device and maintains multiple sets of virtual NVMe PCIe registers in the compute node using copy-on-write for virtual machines. Through copy-on-write, when an operation on a register is a read operation, the same register address can be shared; copying only occurs when a register is written, thus maintaining multiple register addresses. For example, assuming the commit queue tailbell registers of virtual devices A and B, and the commit queue tailbell register of the storage node physical device C, are both 3, then read operations on the commit queue tailbell registers of virtual devices A and B can be performed directly by reading the commit queue tailbell register of the storage node physical device C. However, for write operations to the commit queue tailbell register of virtual devices, for example, if virtual device A writes the commit queue tailbell register to 4, then directly writing to the commit queue tailbell register of the physical storage node C would also modify the commit queue tailbell register of virtual device B, causing errors in the virtual machine NVMe driver corresponding to virtual device B. Therefore, this invention allocates a memory space, copies the commit queue tailbell register of virtual device A into this memory space, and then writes it to 4, ensuring that the commit queue tailbell register of virtual device B is unaffected. Subsequent operations related to the commit queue tailbell register of virtual device A are then performed based on this newly allocated memory space.The effect of this approach is that it provides NVMe PCIe virtual devices to the virtual machine. The virtual machine can use these devices directly without any modification based on the native NVMe driver, ensuring transparency. It can also leverage the high concurrency implementation in the native NVMe driver to provide excellent performance. Moreover, register copying only occurs during writes, saving memory space. Similarly, this invention also maintains various attributes of the virtual device, such as the starting offset and size. The starting offset represents the offset of the virtual device's starting address in the actual physical device of the storage node, and the size represents the virtual device's capacity. This invention tightly divides the actual physical device of the storage node; that is, the starting offset of the first virtual device is 0. Starting from the second virtual device, the starting offset of the next virtual device can be obtained by summing the starting offset of the previous virtual device and the size of the previous virtual device. This ensures both 100% resource utilization and that the virtual devices can work correctly without interference. For step 4, this invention also needs to simulate the management queues, including the management commit queue and the management completion queue. This is because there is only one pair of management queues, including the management commit queue and the management completion queue, in the compute node host machine. Simply assigning this pair of management queues directly to a virtual NVMe device will lead to the same problem as with PCIe registers. To address this issue, this invention maintains a pair of virtual management queues for each virtual NVMe device in the compute node. Before copying the commit queue entry to RDMA, the NVMeoF middleware adds a unique identifier to the entry to identify each virtual management queue. After the storage node completes execution and sends back a response, the NVMeoF middleware uses this unique identifier to distinguish which virtual management queue submitted the command, and then copies the response to the corresponding virtual management completion queue in memory.

[0062] After implementing these two steps in the NVMeoF middleware layer, the partitioning of I / O queues can refer to the handling of various attributes of virtual devices. Since there are multiple pairs of I / O queues in the compute node host and storage node, we maintain the starting offset and number of virtual device I / O queues. The starting offset identifies the offset of the virtual device's I / O queue from the actual physical device I / O queue in the storage node, and the number identifies the quantity of virtual device I / O queues. This invention tightly partitions the I / O queues of the actual physical device in the storage node; that is, the starting offset of the first virtual device is 0. Starting from the second virtual device, the starting offset of the next virtual device's I / O queue can be obtained by summing the starting offset of the previous virtual device's I / O queue and the number of the previous virtual device's I / O queues. This ensures both 100% resource utilization and that each virtual device can work correctly without interference. Figure 6This is a timing diagram example of the entire technical solution that transmits NVMe read commands from the NVMe submission queue to the RDMA-enabled network interface card and receives RDMA write command responses; that is, it transmits I / O read commands.

[0063] 1. The virtual machine interacts with the NVMe driver to trigger the transmission of NVMe I / O read commands.

[0064] 2. The virtual machine NVMe driver adds an entry containing NVMe I / O read commands to the I / O submission queue.

[0065] 3. The virtual machine NVMe driver update corresponds to the I / O commit queue tail bell defined in Chapter 3.1.24 of the NVMe Basic Specification, which will trigger the virtual machine to exit to the NVMeoF intermediate layer.

[0066] 4. The NVMe intermediate layer copies the new I / O commit queue entry defined in Chapter 4.2 of the NVMe Basic Specification to the RDMA send queue and triggers the RDMA doorbell.

[0067] 5. RDMA devices use RDMA to send I / O read commands to remote NVMeoF storage nodes via a network or other wired or wireless means. The remote NVMeoF storage node receives this command and processes it accordingly.

[0068] 6. For a valid read operation, the remote NVMeoF storage node sends a response to the RDMA device. The RDMA device receives an RDMA write command with data from the remote NVMeoF storage node.

[0069] 7. RDMA devices directly copy data associated with RDMA writes to a memory region accessible to the virtual machine via direct memory access.

[0070] 8. The remote NVMeoF storage node sends a response to the RDMA device, indicating that all data has been written via the RDMA write command.

[0071] 9. The NVMeoF intermediate layer copies the response to the NVMe completion queue.

[0072] 10. The RDMA device injects an interrupt into the virtual machine NVMe driver to notify the virtual machine NVMe driver that a response from the remote NVMeoF storage node is ready.

[0073] 11. The virtual machine NVMe driver checks the NVMe completion queue, and the virtual machine processes the responses in the completion queue.

[0074] Similarly, before copying the commit queue entry to RDMA, the NVMeoF middleware layer adds a unique identifier to the entry to identify each virtual I / O queue. After the storage node completes the execution and responds, the NVMeoF middleware layer uses this unique identifier to distinguish which virtual I / O queue submitted the command, and then copies the response to the corresponding virtual I / O completion queue memory.

[0075] Compared to existing solutions and software solutions, this technical solution can reuse the high-concurrency implementation in the virtual machine NVMe driver without modifying the virtual machine NVMe driver, offering better transparency and higher performance. Compared to pass-through NIC solutions, this technical solution introduces an NVMeoF middleware layer that transmits NVMe devices. NVMeoF discovery and connection are performed by the cloud service provider. Cloud tenants can directly use the unmodified virtual machine NVMe driver in the virtual machine to complete various requests without manually configuring NVMeoF discovery and connection, making it convenient to use and protecting the cloud service provider's storage nodes from direct access by malicious cloud tenants. Compared to smart NIC solutions, the NVMeoF middleware layer introduced in this technical solution is implemented in software, eliminating the need for the NIC to have an NVMe hardware interface. Furthermore, the NVMeoF middleware layer operates within the host operating system, eliminating the need for the on-chip operating system in the smart NIC, resulting in lower cost and easier maintenance.

[0076] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.

Claims

1. A virtualization method for NVMeoF computing nodes based on intermediate pass-through, characterized in that, Includes the following steps: Step 1: The virtual machine interacts with the NVMe driver to trigger the transmission of NVMe management commands; Step 2: The virtual machine NVMe driver adds an entry containing NVMe management commands to the management commit queue; Step 3: The virtual machine NVMe driver update submission queue is queued to trigger the virtual machine to exit to the NVMeoF intermediate layer located in the host kernel space. The NVMeoF intermediate layer simulates an independent NVMe PCIe virtual device register set for each virtual machine through the vfio-mdev interface in the mdev intermediate pass-through architecture, and maintains multiple sets of virtual NVMe PCIe registers for each virtual machine through copy-on-write technology. The read operation of the register shares the value of the physical register of the storage node, while the write operation of the register is copied to an independent memory space and then modified. Step 4: The NVMeoF middleware layer copies the new management commit queue entry or pointer to the entry to the RDMA send queue and triggers the RDMA doorbell; before copying, the NVMeoF middleware layer adds a unique identifier to the entry to distinguish different virtual management queues. Step 5: The RDMA device uses the RDMA protocol to send commands to the remote NVMeoF storage node; Step 6: The remote NVMeoF storage node sends a response to the RDMA device; Step 7: The RDMA device directly copies the response to the response buffer in the completion queue accessible to the virtual machine NVMe driver via direct memory access; the NVMeoF intermediate layer routes the response to the corresponding virtual completion queue based on the unique identifier. Step 8: The RDMA device injects an interrupt into the virtual machine NVMe driver; Step 9: The virtual machine NVMe driver checks the NVMe completion queue, and the virtual machine processes the responses in the completion queue; The discovery and connection commands for the remote NVMeoF storage nodes are executed by the cloud platform service provider, rather than by the cloud tenant within the virtual machine.

2. The NVMeoF compute node virtualization method based on intermediate pass-through as described in claim 1, characterized in that, Step 3 completes the simulation of the PCIe registers, including the tail bell of the submission queue.

3. The NVMeoF compute node virtualization method based on intermediate pass-through as described in claim 1, characterized in that, The virtual machine NVMe driver interacts with VFIO PCIe.

4. The NVMeoF compute node virtualization method based on intermediate pass-through as described in claim 1, characterized in that, The NVMeoF middleware layer provides an NVMe PCIe device model for interaction with VFIO PCIe.

5. The NVMeoF compute node virtualization method based on intermediate pass-through as described in claim 4, characterized in that, The NVMe PCIe device model simulates PCIe registers, base address registers, and interrupts.

6. The NVMeoF compute node virtualization method based on intermediate pass-through as described in claim 1, characterized in that, The NVMeoF intermediate layer replaces the network card in receiving commands and storing addresses.

7. The NVMeoF compute node virtualization method based on intermediate pass-through as described in claim 1, characterized in that, The NVMeoF intermediate layer replaces the network card and NVMe driver for communication.

8. The NVMeoF compute node virtualization method based on intermediate pass-through as described in claim 1, characterized in that, The NVMeoF intermediate layer completes the translation of NVMe commands into NVMeoF.