A method and apparatus for performing FPGA tasks in a confidential computing architecture
By introducing a transport task module and a granularity protection table (GPT) into the Arm confidential computing architecture, an isolated execution environment for stub tasks and real tasks is created, solving the security problem of FPGA accelerators in the Arm confidential computing architecture and realizing confidential computing protection for FPGA tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTHERN UNIVERSITY OF SCIENCE AND TECHNOLOGY
- Filing Date
- 2024-06-21
- Publication Date
- 2026-06-16
Smart Images

Figure CN118779100B_ABST
Abstract
Description
Technical Field
[0001] This specification relates to one or more embodiments of a confidential computing framework, and more particularly to a method and apparatus for performing FPGA tasks within a confidential computing framework. Background Technology
[0002] With the development of computing technologies across industries and the increase in cloud and end-user traffic, people are storing vast amounts of data on various computer devices. Alongside this industry development, concerns about device and data security are also growing. To ensure device and data security, various architecture vendors have proposed their own solutions, such as ARM's TrustZone technology, AMD's Secure Virtual Machine Encryption (SEV) technology, and Intel's Software Protection Extensions (SGX) technology, among others. These solutions provide users with a secure and trusted execution environment for confidentially storing and processing data, protecting it from untrusted kernels and legacy applications. Taking Arm's TrustZone technology as an example, it treats the runtime environment of the traditional kernel and applications as an insecure world and creates an isolated secure world, defining a security layer with the highest privileges for world switching. The insecure world cannot directly access the secure world; access to specific resources requires firmware verification through the security layer.
[0003] While the ARM confidential computing architecture effectively ensures user data security, it still has some shortcomings, one of which is the inability to provide support for confidential computing on dedicated accelerators such as FPGAs. This makes task acceleration using FPGAs within this technology framework a significant security challenge, and there is a need to improve this aspect. Summary of the Invention
[0004] This specification describes one or more embodiments of a method and apparatus for executing FPGA tasks in a confidential computing architecture, which can provide a confidential computing environment for the execution of FPGA tasks based on the hardware characteristics of existing confidential computing architectures and support confidential computing on FPGAs.
[0005] According to a first aspect, a method for performing FPGA tasks in a confidential computing architecture is provided, the confidential computing architecture comprising a secure world, a domain world, an insecure world, and a root world; the method comprising:
[0006] The FPGA software in the insecure world configures the stub data structure of the FPGA task in the insecure world segment of memory based on the FPGA's data requirements, including the data buffer area.
[0007] In the root world, the root monitor configures a real data structure corresponding to the stub data structure in the first segment of the memory corresponding to the first domain. This includes a domain cache corresponding to the data cache and a transfer descriptor. The domain cache is used to store confidential data to be processed. The transfer descriptor is used to describe the data to be transferred via direct memory access (DMA).
[0008] The root monitor sets a granularity protection table (GPT), which includes a first GPT table for the CPU and a second GPT table for the FPGA, such that in the first GPT table, the FPGA memory-mapped MMIO belongs to the root world, and in the second GPT table, only the first segment is an accessible segment.
[0009] The root monitor interacts with the FPGA via FPGA MMIO, enabling the FPGA to read the confidential data via DMA transfer based on the transfer descriptor and execute FPGA tasks.
[0010] According to a second aspect, a root monitor is provided in a confidential computing architecture, the confidential computing architecture including a secure world, a domain world, a non-secure world, and a root world; the root monitor is located in the root world and includes a transport task module and an FPGA protection module, wherein:
[0011] The transfer task module is configured such that, in response to the insecure world FPGA software, it configures a stub data structure for the FPGA task in the insecure world segment of memory based on the FPGA's data requirements. Meanwhile, in the first segment of memory corresponding to the first domain, a real data structure corresponding to the stub data structure is configured. The stub data structure includes a data buffer; the real data structure includes a domain buffer corresponding to the data buffer and a transfer descriptor; the domain buffer stores confidential data to be processed; and the transfer descriptor describes the data to be transferred via Direct Memory Access (DMA).
[0012] The FPGA protection module is configured to set a granular protection table (GPT), which includes a first GPT table for the CPU and a second GPT table for the FPGA. In the first GPT table, the FPGA memory mapping MMIO belongs to the root world, and in the second GPT table, only the first segment is an accessible segment. The module interacts with the FPGA via the FPGA MMIO, enabling the FPGA to read the confidential data via DMA transfer based on the transfer descriptor and execute FPGA tasks.
[0013] According to a third aspect, a computing device is provided, including a memory and a plurality of processors, the computing device forming a confidential computing architecture, the confidential computing architecture including a secure world, a domain world, an insecure world and a root world; the root world including a root monitor as described in the second aspect.
[0014] In the embodiments provided in this specification, confidential FPGA computation compatible with the Arm Confidential Computing Architecture (CCA) is implemented through transfer tasks. According to the transfer task mechanism, stub tasks without real data are created by the FPGA software in the insecure world and scheduled and managed as usual. After the stub task is submitted, the root monitor creates a data structure for the real FPGA task containing real data in the domain world segment. This includes a domain buffer for storing confidential data and a transfer descriptor describing the data to be transferred via DMA. The root monitor also provides an isolated execution environment for the real FPGA task by configuring the GPT table. This allows the FPGA to read confidential data via DMA transfer based on the transfer descriptor and execute the FPGA task. Thus, confidential FPGA computation is implemented in the Arm Confidential Computing Architecture (CCA). Attached Figure Description
[0015] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0016] Figure 1 A schematic diagram of Arm's confidential computing architecture is shown;
[0017] Figure 2 This illustrates access control of the physical address space by various worlds within a confidential computing architecture.
[0018] Figure 3 A schematic diagram illustrating the execution of an FPGA task in a confidential computing architecture according to one embodiment is shown.
[0019] Figure 4 A method for performing FPGA tasks in a confidential computing architecture is illustrated according to one embodiment;
[0020] Figure 5 A schematic diagram illustrating a plurality of transport descriptors according to one embodiment is shown;
[0021] Figure 6 The image shows the GPT table maintained by the root monitor in an example scenario. Detailed Implementation
[0022] The solution provided in this specification will now be described with reference to the accompanying drawings.
[0023] To ensure data security, ARM provides TrustZone technology. In this technology, the traditional kernel and application runtime environment is treated as a non-secure world, while an isolated secure world is created outside of it, and a security layer with the highest privileges is defined for world switching.
[0024] Specifically, in the Armv8-A architecture, CPU cores categorize exceptions into four levels based on privilege partitioning: EL0 to EL3. EL0 represents the application level, EL1 is used for the system kernel, EL2 for the hypervisor, and EL3 for the security monitor. These four levels can also represent the privilege levels of the runtime environment. In TrustZone technology, the CPU security state is divided into a non-secure (Normal) state and a secure state. EL0 and EL1 can operate in either state; for example, an untrusted operating system (OS) can run in the non-secure EL1 world, while a trusted OS can run in the secure EL1 world. EL2 can be used in the secure state. EL3, the security monitor, always exists in the secure world and is used to switch between security states.
[0025] In this architecture, the insecure world cannot directly access the secure world; access to specific resources requires verification by a security layer monitor. Sensitive or confidential data, as well as high-privilege software applications, run in the secure world, thus providing a Trusted Execution Environment (TEE) for this confidential data.
[0026] Building upon the TrustZone infrastructure, ARM recently released an improved Arm Confidential Compute Architecture (CCA). CCA is part of the Armv9-A architecture, introducing a Realm Management Extension (RME) to the existing TrustZone architecture. This extension adds a Realm world and a Root world to the existing insecure and secure worlds within the TrustZone technology. To support isolation between these worlds, the CCA architecture provides the Realm Management Extension (RME) component at the hardware layer to extend the isolation modes.
[0027] Figure 1 This diagram illustrates the Arm confidential computing architecture. Figure 1As shown, in the Arm Confidential Computing Agency (CCA), the runtime environment is divided into four worlds: the Secure World, the Insecure World (Normal), the Domain World, and the Root World. The Root World runs the Root World Monitor, which has the highest privileges and is responsible for isolation and communication between worlds. The Domain World provides a protected confidential computing environment for virtual machines, called the Confidential Domain. The Domain Management Monitor (RMM) runs in the Domain World, responsible for managing the execution of domain virtual machines and their interaction with the Insecure World. Users can place virtual machines as Domain Virtual Machines into the Confidential Domain to isolate them from unauthorized access by external software. Specifically, users can create virtual machines through the Virtual Machine Manager in the Insecure World and transfer them to the Domain World through the RMM, making them Domain Virtual Machines. The RMM is responsible for security checks and protection related to the Confidential Domain. Domain Virtual Machines are isolated from each other using virtualization technology, and the RMM manages the accessible address spaces of different Domain Virtual Machines. Domain Virtual Machines do not need to trust the Insecure World or the Secure World; they only need to trust the RMM and the Root World Monitor.
[0028] Correspondingly, Arm's confidential computing architecture CCA also divides the physical address space (PAS) of memory into four worlds. Figure 2 This illustrates the access control of the physical address space by the security status of various worlds within a confidential computing architecture. For example... Figure 2 As shown, the root world has the highest access privileges, able to access the address spaces of all four worlds. The insecure world has the lowest access privileges, only able to access the address spaces of other insecure worlds. Secure worlds and realm worlds can access the address spaces of insecure worlds, as well as the address spaces of their own worlds.
[0029] In the Arm confidential computing architecture, address space access control across different worlds is achieved by constructing a Granular Protection Table (GPT) and performing Granular Protection Checks (GPCs) based on this GPT. Specifically, the confidential computing architecture (CCA) maintains the GPT in memory, recording the security status of each segment of physical memory at a fine-grained level. Typically, the granularity of these records is at the memory page level (4KB segments). Thus, the GPT records the security status and access permissions for each memory page. When memory page allocations migrate or change across different worlds, the entries in the GPT can be dynamically updated.
[0030] When the processor accesses memory, the aforementioned RME component in the hardware layer performs a granular protection check (GPC). During the check, the current security state of the CPU is obtained, and the security state of the requested memory page is retrieved by reading the GPT table; the two are then checked for a match. If the GPC check fails (e.g., if a host OS in an insecure world requests access to memory in the domain world), a granular protection exception signal is issued, thus rejecting the memory access and ensuring isolation between worlds. Through this isolation mechanism, the Arm confidential computing architecture further provides an isolated confidential computing environment for the domain virtual machines within the domain world.
[0031] On the other hand, many tasks require accelerated execution via dedicated accelerators, with FPGAs being one type of hardware for this purpose. FPGA (Field Programmable Gate Array) is a programmable hardware circuit structure that can be used to construct circuit logic and execute specific computational instructions using languages such as Verilog. FPGAs have demonstrated their acceleration performance in areas such as neural network acceleration and industrial control, and are increasingly gaining attention from cloud providers, leading to expanded applications.
[0032] However, existing Arm computing frameworks struggle to provide effective confidentiality protection for FPGA computing tasks. This is partly because Arm computing frameworks treat FPGAs as untrusted, ordinary peripherals. Although FPGAs have their own on-chip memory, they still need to interact with host memory. During this interaction, the FPGA shares host memory with the CPU and numerous untrusted peripherals, making it vulnerable to attacks. Furthermore, according to the workflow of most current ARM devices, FPGA task execution and scheduling are managed by FPGA software (such as FPGA drivers and related programming libraries). This FPGA software exists in an insecure environment, making it susceptible to attacks.
[0033] Specifically, FPGA software manages the FPGA's computing environment and interacts with the FPGA hardware. In a typical workflow, to prepare the execution environment, the FPGA software allocates physical memory according to task requirements, loads the data to be processed into host memory, and uses FPGA memory mapping (MMIO) to enable the FPGA hardware to access the data in host memory via direct memory access (DMA) to execute the FPGA task.
[0034] Suppose a powerful adversary controls the entire software stack in both the insecure and secure worlds, including the FPGA software, the untrusted operating system, the Hypervisor virtual machine manager, and the software at the same level in the secure world. This adversary wants to spy on or even tamper with confidential data in FPGA tasks, including task input data, intermediate data, or execution results. The adversary could potentially access host memory to read the confidential data stored therein, or control peripherals capable of DMA to read the aforementioned memory data, thereby launching an attack.
[0035] Several confidentiality protection schemes have been designed for GPU accelerators. However, due to differences in hardware design, the workflows of GPUs and FPGA accelerators differ, making it impossible to directly migrate confidential computing schemes for GPUs to FPGA-accelerated systems. A recent study on Arm CCA proposed a confidential accelerated computing system compatible with both GPUs and FPGAs. While this system ensures FPGA computing security, it does not well conform to the system design of Arm CCA: it relies on a relatively large accelerator software stack to manage accelerated computing, thus significantly increasing the trusted computing base of the confidential virtual machine and requiring substantial modifications to the FPGA software stack. Furthermore, some researchers have used non-Arm secure hardware, or traditional Arm secure hardware (such as the StrongBox system, which relies on Arm TrustZone; and the Cronus system, which relies on Arm secure virtualization), to build confidential accelerated computing systems. However, such systems are not directly compatible with Arm CCA-based system designs, and traditional Arm secure hardware is no longer fully trusted within Arm CCA, leading to reduced system security. In addition, some researchers have proposed defense mechanisms built inside FPGA accelerators, such as building encryption engines and remote authentication modules inside the FPGA. However, such designs require modifications to the accelerator hardware, resulting in reduced hardware compatibility.
[0036] Therefore, overall, providing confidential computation protection for task execution on FPGA accelerators for the Arm CCA architecture remains a challenge.
[0037] In view of this, an embodiment of this specification proposes a solution that, based on the hardware characteristics of the Arm Confidential Computing Architecture (CCA), provides security protection for FPGA computing tasks without affecting the original functional design of the Arm CCA architecture, thereby supporting confidential computing on the FPGA.
[0038] Figure 3 This diagram illustrates an FPGA task running in a confidential computing architecture according to one embodiment. Figure 3The system architecture shown conforms to Arm's Confidential Computing Architecture (CCA). In the insecure world, a host machine runs, containing FPGA drivers (e.g., XDMA) and other peripheral drivers. The Hypervisor virtual machine manager creates and manages several confidential computing "realms." In the newly introduced realm worlds within the CCA, a Realm Management Monitor (RMM) is deployed to achieve memory isolation between different "realms." The root world deploys a root monitor with the highest privileges, used to manage isolation and switching between worlds, and to provide security authentication mechanisms such as key management and remote verification. The root monitor can be implemented as secure firmware.
[0039] In the CCA architecture described above, the Domain Management Monitor (RMM) of the Domain World and the Root World are considered completely trustworthy. This is because these components require very little memory and code, thus exposing a small attack surface and making them less vulnerable to attack. Furthermore, other components are considered untrustworthy, including the software in the Secure World.
[0040] In the embodiments of this specification, to achieve confidential FPGA computation, several additional components are introduced into the root monitor in the root world: a transfer task module, an FPGA protection module, and a domain security module. The transfer task module creates real tasks in the domain memory for transfer tasks created for the FPGA in insecure memory that do not contain real data. The FPGA protection module protects the FPGA operating environment from attacks. The domain security module is used for domain construction, confidential data transmission, and FPGA hardware authentication. Furthermore, in this embodiment, the FPGA software (including the FPGA driver and related function libraries) still runs on the host side in the insecure world, but requires minor modifications to assist in completing the transfer tasks.
[0041] Transfer tasks are a mechanism introduced to make FPGA workflows compatible with Arm's confidential computing architecture. The core idea is to allow FPGA software on the host side to create and manage domain-specific FPGA tasks (also known as stub tasks) in an insecure world. These tasks include, for example, allocating memory, creating buffers, scheduling, and submitting tasks. These stub tasks, for example... Figure 3The stub tasks correspond to FPGA task 1 in domain 1 and FPGA task 2 in domain 2. Their data structures are similar to those of ordinary FPGA tasks, such as including the required data buffers. However, the data buffers of these stub tasks do not contain the actual data to be processed; at most, they provide descriptions of these data buffers. Correspondingly, the data structures of the corresponding real FPGA tasks are constructed in the corresponding domains. FPGA software can submit these stub tasks as usual. However, unlike usual, upon submission, the root monitor in the root world replaces the stub tasks with the real FPGA tasks in the corresponding domains. Real FPGA tasks have data structures similar to stub tasks, but they are filled with the actual confidential data to be processed. Furthermore, based on the execution characteristics of the FPGA, data descriptors required for FPGA direct memory access (DMA) are created and maintained in the corresponding domains. Figure 3 (Transfer descriptors in the domain). Thus, under the control of the root monitor, the FPGA reads the confidential data stored in the corresponding domain based on the transfer descriptors and executes the computational logic within the FPGA. During this process, the FPGA protection module provides the corresponding GPT table to ensure the isolation of the execution environment. Therefore, the solution in this embodiment allows an insecure world to schedule and manage FPGA tasks from different domains without accessing the actual confidential data, which aligns with the concept of Arm's confidential computing architecture.
[0042] The following section describes, using a single FPGA task as an example, the process of scheduling and executing FPGA computing tasks and providing them with a confidential and isolated environment through a task transfer mechanism.
[0043] Figure 4 This illustrates a method for performing FPGA tasks in a confidential computing architecture according to one embodiment; it can be understood that... Figure 4 The method in the middle is based on Figure 3 The confidential computing architecture shown is executed.
[0044] Firstly, during the initialization or preparation phase, a user can apply for a domain for an FPGA and transmit the actual data to be processed by the FPGA task into it via an encrypted channel. Here, the creation of the domain can be achieved through... Figure 3The domain security module, as shown, performs the following: Specifically, upon user request, the domain security module creates a virtual machine via Hypervisor, interacts with the Domain Management Monitor (RMM), and deploys it into the domain world as a confidential domain. For ease of description (and to distinguish it from other specific domains when necessary), the domain requested by the user will be referred to as the first domain below. After creating the first domain, the user can negotiate a key with it to establish a secure channel. Specifically, the user can exchange keys with the first domain using the Diffie-Hellman (DH) protocol, an elliptic curve-based DH protocol, or various other protocols to negotiate an encryption key. Based on the negotiated key, a secure encrypted channel can then be established. Through this secure encrypted channel, the first domain can receive confidential data transmitted by the user and store it within the domain world.
[0045] Unlike GPUs, which instruct computational logic through task code, FPGA execution logic is embedded in the hardware circuitry by the designer through programming. In other words, once designed, an FPGA has fixed execution logic. Typically, a completed FPGA design includes a corresponding description file, which records the FPGA's functional logic, exposed interfaces, and the corresponding input / output data. This description file allows the determination of the FPGA's task data requirements.
[0046] In one embodiment, after creating the first domain, a description file corresponding to the FPGA is recorded in the first domain. When an FPGA task needs to be executed, the first domain can provide data requirement information to the FPGA software on the host side in the insecure world based on the description file, including the target data to be transmitted, data length, data type, etc. To prevent the FPGA software from tampering with the data requirement after an attack, in one embodiment, the first domain also provides signature information, that is, signs the data requirement information and appends the signature to the data requirement information. In other embodiments, the FPGA software can also obtain the data requirement information of the FPGA task through other means, such as being provided to the FPGA software by the user, or reading the description file from a designated storage location according to user instructions to obtain the data requirement information.
[0047] Based on the initialization phase described above, the first domain receives confidential / real data provided by the user and stores it in the protected domain world. The host receives the data requirement information for the FPGA task and stores it in the unprotected area corresponding to the insecure world.
[0048] Building upon this foundation, the FPGA software on the host side in the insecure world can create transmission tasks based on data requirement information. As mentioned earlier, the FPGA software mainly includes FPGA driver software, such as the Xilinx XDMA driver, and related function libraries, such as user-level runtime function libraries. This FPGA software is modified to create stub tasks in the insecure world based on the transmission task mechanism.
[0049] Specifically, such as Figure 4 As shown in step S41, the FPGA software in the insecure world configures the stub data structure of the FPGA task in the insecure world segment of memory according to the FPGA's data requirement information, including the data buffer area.
[0050] Specifically, the FPGA software can create a transfer task and configure the stub data structure accordingly. As mentioned earlier, the data requirement information indicates the data information required for the FPGA task to process, including the number, size, and padding data of the data buffers. Based on this data requirement information, the FPGA software can allocate corresponding memory space in the insecure world segment of memory and create several stub data buffers that meet the requirements of the data requirement information.
[0051] In some embodiments, the data requirement information may instruct the creation of multiple data buffers, for example, one for storing input data and another for storing execution results. Optionally, the data requirement information may also instruct the creation of data buffers for storing intermediate results. The FPGA software allocates these data buffers as stub data buffers according to the requirements of the data requirement information. However, unlike conventional processing, the FPGA simply creates the stub data structure based on the data requirement information, but does not populate the stub data buffers with actual data.
[0052] Since it is not necessary to store actual data in these stub data caches, in one embodiment, when creating stub data caches, they can be created only according to the requirements of the number of caches and data types in the data requirement information, rather than according to the size of the required data. Furthermore, the created stub data caches can store only data description information.
[0053] It's important to note that FPGA hardware interacts with host memory via DMA during task execution. In this process, the FPGA typically uses descriptors for DMA operations. A descriptor is a data structure that contains information required for a DMA transfer, such as source address, destination address, and transfer size. Therefore, when creating a stub task, the FPGA software usually creates or loads descriptors in the aforementioned non-secure section based on the driver file. However, as mentioned above, the stub data structure does not store actual data. Therefore, the descriptor loaded here is only used for alignment and conformity to the FPGA software's regular processing flow; it does not reflect the actual data storage information. Thus, the descriptor loaded in this case can also be called a stub descriptor.
[0054] Optionally, some FPGAs may require the creation of page tables needed to execute the task, i.e., FPGA page tables. In such cases, the FPGA software also generates FPGA page tables for executing the transfer task based on the allocated memory; these can be called stub page tables. This page table records the mapping between virtual addresses and physical memory addresses during FPGA task execution. Initially, the FPGA memory-mapped input / output (MMIO) is configured to point to this stub page table. Specifically, through the FPGA memory-mapped input / output, the register in the FPGA hardware that stores the base address of the page table is mapped to the memory address storing the stub page table, thus pointing to the stub page table.
[0055] Thus, the FPGA software creates the transfer task and configures it with a stub data structure. In a specific example, within a memory segment of the insecure world, the FPGA software allocates data buffer 1 and data buffer 2, where data buffers 1 and 2 can store only the corresponding data descriptions, without storing the actual data. Furthermore, the FPGA software loads stub descriptors and generates a stub page table.
[0056] As can be seen, the process of creating a transport task is similar to that of creating a regular task, except that no real data is populated into its data cache. Therefore, the created transport task is an "empty shell" task without real data, but it has the same data structure as a real task and can be managed and scheduled.
[0057] Therefore, after creating the aforementioned transfer task, the FPGA software inserts it into the FPGA task queue as usual, arranges the execution order of the tasks, and submits the transfer task to the FPGA hardware via the root monitor. Specifically, when submitting the task, the FPGA software can call the SMC (Security Monitor Call) instruction, which can be captured by the root world firmware. After the call, the CPU will enter the root world to execute code.
[0058] Once the root monitor captures the above instructions, it will create a real task in the domain world, that is, execute... Figure 4 Step S42. In this step, the root monitor configures the actual data structure, i.e., the domain data structure, corresponding to the stub data structure in the first segment of memory corresponding to the first domain. The domain data structure includes a domain buffer corresponding to the data buffer and a newly created transfer descriptor. The domain buffer is used to store the actual confidential data, and the transfer descriptor is used to describe the storage information of the actual data to be transferred via DMA. The root monitor stores the confidential data to be processed in the aforementioned domain buffer, or temporarily leaves it empty (e.g., in the domain buffer used to store results).
[0059] Specifically, the root monitor can copy the data requirement information and verify its correctness through signature verification. After verifying its correctness, it can create the data structure of the actual FPGA task in the first segment corresponding to the first domain, based on the data requirement information. This includes a domain buffer corresponding to the data buffer in the stub data structure. The root monitor stores the actual confidential data in the corresponding domain buffer according to the data requirement information. Furthermore, the root monitor reconstructs the descriptor, called the transport descriptor, based on the storage information of the actual data in the domain buffer. In one embodiment, the transport descriptor constructed in the first domain includes at least the following information: the starting address (Src_adr), destination address (Dst_adr), and length (Len) of the data to be transmitted.
[0060] Furthermore, as those skilled in the art know, FPGA hardware can establish a data channel with the host to enable Direct Memory Access (DMA). Different FPGA hardware can establish such a data channel through different specific implementations. In terms of data transmission direction, the aforementioned data channel can be divided into C2H channels (Card to Host, a channel from FPGA-stored data to the host) and H2C channels (Host to Card, a channel from the host to the FPGA). These two channels correspond to different data transmission directions. Accordingly, in one embodiment, the transmission descriptor created in the first domain may further include data channel information (channel ID) corresponding to the buffer area of the data domain.
[0061] In one embodiment, the memory required to store the data description information is small (i.e., the buffer in the stub data structure can be small), while the actual memory required for a single DMA request may be large, making it impossible to import all data into the FPGA in a single transfer. Therefore, in one embodiment, the root monitor calculates the total number of transfers required based on the total data requirement length in the data requirement information and the data length that can be imported into the FPGA in a single transfer. It then constructs multiple transfer descriptors in the first domain segment and connects them in a linked list. Specifically, the transfer descriptor includes an Nxt_adr attribute field to store the starting address of the next transfer descriptor. Optionally, it may also include an Nxt_adj attribute field to store the remaining number of transfers. When constructing multiple transfer descriptors, the root monitor writes the starting address of the next transfer descriptor into the Nxt_adr attribute of the current transfer descriptor and fills the remaining number of transfers into the Nxt_adj attribute, thus forming a linked list structure.
[0062] Figure 5 A schematic diagram illustrating multiple transport descriptors according to one embodiment is shown. Figure 5 As shown, assume that two buffers (i.e., neighborhood buffers) are created in the first neighborhood segment, where buf1 is used to store input data with a size of 0x3000, and buf2 is used to store output data with a size of 0x2000. The FPGA also has on-chip hardware memory M.
[0063] In one example, to execute an FPGA task, two DMA requests are required. The first request uses the H2C data channel to load data from buf1 into the FPGA. The second request uses the C2H data channel to store data from the FPGA's on-chip memory into buf2.
[0064] Assuming the data in buf1 is too long to be imported into the FPGA in a single transfer, then in H2C transmission, as follows... Figure 5 As shown, three descriptors, desc0-1, desc0-2, and desc0-3, are prepared.
[0065] Specifically, in desc0-1, the starting address src_adr points to address A of buf1, the destination address dst_adr points to address B of FPGA hardware memory M, the data length s ize can be 0x1000, and the nxt_adr field points to the address of desc0-2.
[0066] In Desc0-2, the starting address src_adr points to the address A+0x1000 of buf1, the destination address dst_adr points to the address B+0x1000 of the FPGA hardware memory M, the data length s ize is 0x1000, and the nxt_adr field points to the address of desc0-3.
[0067] In Desc0-3, the starting address src_adr points to the address A+0x2000 of buf1, the destination address dst_adr points to the address B+0x2000 of the FPGA hardware memory M, the data length s ize is 0x1000, and the nxt_adr field is set to 0, indicating that there is no subsequent transmission.
[0068] Thus, by using the three serially connected descriptors desc0-1, desc0-2, and desc0-3, data in buf1 can be loaded into the FPGA for a single H2C DMA transfer request.
[0069] Similarly, for C2H transmissions, two descriptors, desc1-1 and desc1-2, can be prepared.
[0070] In desc1-1, the starting address src_adr points to the address C of the FPGA hardware memory M, the destination address dst_adr points to the address D of buf2, the data length s_ize is 0x1000, and the nxt_adr field is set to the address of desc1-2.
[0071] In desc1-2, the starting address src_adr points to the address C+0x1000 of the FPGA hardware memory M, the destination address dst_adr points to the address D+0x1000 of buf2, the data length s ize is 0x1000, and the nxt_adr field is set to 0, indicating that there is no subsequent transmission.
[0072] Thus, by using two concatenated descriptors, desc1-1 and desc1-2, data from the FPGA can be stored in buf2 for a single C2H DMA transfer request.
[0073] In addition, it's noted that some FPGA accelerated computing tasks require the transmission of multiple data types, necessitating frequent calls to the SMC instruction. However, the FPGA typically waits for all data to be transmitted before starting computation. Therefore, in one embodiment, multiple data requests for the same FPGA can be batch-recorded and then merged and sent to the root world's transmission task module. In other words, after creating multiple stub tasks / stub data structures for the multiple data requests for the same FPGA, the SMC instruction is called uniformly. This allows the root monitor to construct the actual data structure and transmission descriptor uniformly for these multiple data requests, thereby reducing the number of SMC calls and world switches, and lowering the performance burden.
[0074] Additionally, as mentioned earlier, in some embodiments, the FPGA requires the creation of page tables to perform computational tasks. For this purpose, the FPGA software generates stub page tables in non-secure memory areas. Correspondingly, the root monitor needs to create the actual FPGA page table based on the stub page table and store it in the first domain area. To do this, the root monitor can first validate the page table entries recorded in the stub page table, for example, checking for duplicate or illegal mappings. If the validation passes, the root monitor constructs the actual FPGA page table by copying or replaying the page table entries. It is important to note that since the data buffer in the stub data structure does not store actual data and does not participate in actual FPGA computation, the entries related to the data buffer in the actual FPGA page table are modified to point to the domain buffer in the actual data structure.
[0075] Continuing the previous example, during the real task creation phase, real data buffer 1 and data buffer 2, corresponding to the two data buffers in the stub task, are created within the protected domain segment. Real confidential data is stored in data buffer 1, while data buffer 2 is temporarily left empty to store result data. Additionally, the root monitor generates a real FPGA page table, which is stored within this first domain segment.
[0076] When a real FPGA task needs to be executed, the FPGA protection module in the root monitor provides a protected execution environment for the execution of the real FPGA task, isolating this execution environment from other software (including untrusted software stacks and other "domains"). Specifically, the memory interaction between the FPGA and the host mainly includes two aspects: (1) DMA data interaction, and (2) FPGA MMIO instruction interaction (i.e., transmitting task execution operations by writing to FPGA registers). For the above two aspects, the root monitor achieves the isolation of the FPGA runtime environment through the memory protection mechanism based on Granularity Protection Check (GPC) provided in the Arm confidential computing architecture.
[0077] Therefore, according to the GPC mechanism, the root monitor executes step S43 to set up a first GPT table for the CPU and a second GPT table for the FPGA, such that in the first GPT table, the FPGA memory-mapped MMIO belongs to the root world, and in the second GPT table, only the first segment is an accessible segment.
[0078] As previously mentioned, the Confidential Computing Architecture (CCA) maintains a granular protection table (GPT) in memory, which records the security status of each segment of physical memory at a fine-grained level for GPC checks, thereby implementing memory isolation. According to the scheme in this embodiment, the root monitor can maintain multiple versions of the GPT table, allowing different objects to have different access permissions.
[0079] Specifically, the root monitor maintains at least a first GPT table and a second GPT table. The first GPT table is used for CPU and other peripherals to access memory. In the first GPT table, the first segment belongs to the first domain and has the domain world permissions. Applications from non-secure worlds and other domains have no right to access this first segment, thereby achieving domain isolation. Furthermore, to protect the FPGA memory-mapped MMIO segment and prevent untrusted software stacks from accessing this area, it can be set to belong to the root world in the aforementioned first version of the GPT table. Figure 2 The permissions shown in the different worlds, and the various applications requesting memory access through the CPU or other peripherals, including software in the secure world, cannot access the first segment and the FPGA memory-mapped MMIO segment mentioned above.
[0080] The second GPT table is a GPT table for the first domain of the aforementioned FPGA. This GPT table can be generated and initialized when the first domain is created, wherein only the first segment corresponding to the first domain is set as an accessible segment. When the first domain performs confidential FPGA calculations, the root monitor writes the address of the first GPT table to the SMMU (a memory access control hardware that supports GPC) controlling the FPGA, causing it to perform GPC checks according to the first GPT table, thereby controlling memory access from the FPGA.
[0081] Furthermore, in one embodiment, to simplify the division of permission states, for each "domain's" FPGA GPT table, the memory of that domain can be marked as "insecure," and other memory spaces can be marked as "root," thus allowing only the FPGA hardware to access the memory of its corresponding domain. That is, for the FPGA in the first domain, the corresponding first GPT table is set so that the first segment of the first domain is in an "insecure" state and can be accessed; all other segments are in a "root" state and cannot be accessed. In this way, the configuration of the FPGA's GPT table only requires two states (i.e., "insecure" and "root," representing the FPGA's accessible and inaccessible areas, respectively), without needing to specifically identify memory in other states (such as secure and domain states), greatly simplifying the FPGA GPT configuration process and further optimizing performance overhead.
[0082] Figure 6 This shows the GPT table maintained by the root monitor in an example scenario. Figure 6 As shown, in this example scenario, the domain world includes at least domain R1 and domain R2. Assume that users in each domain respectively request the execution of FPGA task 1 based on domain R1 and the execution of FPGA task 2 based on domain R2. Figure 3 Correspondence. Among them, combination Figure 4 The FPGA task described is executed in the first domain, assuming it corresponds to FPGA task 1 executed in domain R1.
[0083] To provide isolated execution environments for FPGA Task 1 and FPGA Task 2 respectively, the root monitor must maintain at least [missing information]. Figure 6 The four GPT tables shown.
[0084] In the GPT table for the CPU, the memory segments corresponding to domains R1 and R2 still belong to the domain world segments as usual. Furthermore, the FPGA memory-mapped MMIO segments are set to belong to the root world.
[0085] The GPT table for untrusted peripherals is generally similar to that for the CPU, with identical settings for domains R1 and R2. The difference lies in the memory access restrictions: peripherals have corresponding memory access limitations. Certain memory segments accessible to the CPU (such as the small segment at the very beginning of the diagram) are set to belong to the root world in the peripheral's GPT table, and the peripheral has no access rights to them.
[0086] The FPGA GPT table for Domain 1 is the GPT table applicable when the FPGA executes FPGA Task 1 corresponding to Domain 1, i.e., the aforementioned second GPT table. In this table, Domain R1 is set to belong to the non-secure world and is accessible; all other segments are set to belong to the root world and are inaccessible. This means that when the FPGA executes FPGA Task 1 corresponding to Domain 1, it can only access the memory data of Domain R1 and has no right to access data in any other segment.
[0087] The FPGA GPT table for Domain 2 is the GPT table applicable when the FPGA executes FPGA Task 2 corresponding to Domain 2. In this table, Domain R2 is set to belong to the non-secure world and is accessible; all other segments are set to belong to the root world and are inaccessible. This means that when the FPGA executes FPGA Task 2 corresponding to Domain 2, it can only access the memory data of Domain R2 and has no right to access data in any other segment.
[0088] When hardware (CPU, FPGA, or peripheral) requests memory access, the RME in the hardware layer performs a GPC check based on the applicable GPT table to control memory access.
[0089] As can be seen from the two FPGA GPT tables for Domain 1 and Domain 2, memory isolation is also implemented between FPGA tasks in different domains to ensure the safety of the execution environment. It is understandable that if more FPGA tasks need to be executed based on more domains, more GPT tables need to be maintained.
[0090] With the above settings, step S44 can be executed, whereby the root monitor interacts with the FPGA through FPGA memory mapping MMIO, enabling the FPGA to read the confidential data based on the descriptor and execute FPGA tasks.
[0091] Specifically, under the memory access control of GPC based on the aforementioned GPT table, the root world will replace the original software stack to interact with the FPGA MMIO, executing FPGA transfer tasks. This involves providing the source address of the transfer task (including the address of the descriptor) to specific registers, writing control instructions, and realizing data transmission and result reception. Specifically, when an FPGA page table is required, the base address of the FPGA page table is first pointed to the actual FPGA page table stored in the first domain. Thus, the stub page table is replaced with the actual FPGA page table, and memory access is performed based on the actual FPGA page table. Some FPGA tasks can also be executed without an FPGA page table. In this case, the root monitor interacts with the FPGA via the FPGA MMIO, enabling the FPGA to read confidential data based on the descriptor and execute the FPGA task.
[0092] To ensure the confidentiality of data during FPGA execution, after executing the FPGA task as described above, the FPGA is instructed to completely clear the most recently used memory in its on-chip memory before the isolation protection of the execution environment is removed, for example, the permissions of the FPGA MMIO are restored to normal.
[0093] To recap the process above, confidential FPGA computation compatible with the Arm Confidential Computing Architecture (CCA) is achieved through transfer tasks. According to the transfer task mechanism, the FPGA software in the insecure world creates stub tasks that do not contain real data and schedules and manages these tasks as usual. After the stub task is submitted, the root monitor creates a real FPGA task containing real data, generates a real descriptor, and provides a protected execution environment by setting the GPT table. Then, the root monitor interacts with the FPGA through the protected FPGA MMIO, enabling the FPGA hardware to read data from the descriptor and execute the real FPGA task within the protected execution environment. Thus, confidential FPGA computation is implemented within the Arm Confidential Computing Architecture (CCA).
[0094] On the other hand, corresponding to the above-described method and process, embodiments of this specification also disclose a root monitor in a confidential computing architecture, the confidential computing architecture including a secure world, a domain world, an insecure world, and a root world; the root monitor is located in the root world. The root monitor may include a transport task module and an FPGA protection module.
[0095] The transfer task module is configured such that, in response to the insecure world FPGA software, it configures a stub data structure for the FPGA task in the insecure world segment of memory based on the FPGA's data requirements. Meanwhile, in a first segment of memory corresponding to a first domain, it configures a real data structure corresponding to the stub data structure. The stub data structure includes a data buffer, and the real data structure includes a domain buffer corresponding to the data buffer, and a transfer descriptor. The domain buffer stores confidential data to be processed, and the transfer descriptor describes the storage information of the data to be transferred via Direct Memory Access (DMA).
[0096] The FPGA protection module is configured to set a first GPT table for the CPU and a second GPT table for the FPGA, such that in the first GPT table, the FPGA memory-mapped MMIO belongs to the root world, and in the second GPT table, only the first segment is an accessible segment; the FPGA interacts with the FPGA through the FPGA memory-mapped MMIO, enabling the FPGA to read the confidential data based on the descriptor via DMA transfer and execute FPGA tasks.
[0097] For specific execution process examples of the transmission task module and FPGA protection module, please refer to the previous examples. Figure 4 and Figure 5 The description is omitted.
[0098] In a typical embodiment, the root monitor is implemented as secure firmware.
[0099] According to another embodiment, a computing device is also provided, including a memory and a plurality of processors, the computing device forming a confidential computing architecture, the confidential computing architecture including a secure world, a domain world, an insecure world and a root world; the root world including the aforementioned root monitor.
[0100] Those skilled in the art will recognize that, in one or more of the examples above, the functions described in this invention can be implemented using hardware, software, firmware, or any combination thereof. When implemented in software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or code on a computer-readable medium.
[0101] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made on the basis of the technical solution of the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for performing FPGA tasks in a confidential computing architecture, the confidential computing architecture comprising: a secure world, a domain world, an insecure world, and a root world; the method comprising: The FPGA software in the insecure world configures the stub data structure of the FPGA task in the insecure world segment of memory based on the FPGA's data requirements, including the data buffer area. The root monitor in the root world configures a real data structure corresponding to the stub data structure in the first segment of the memory corresponding to the first domain, including a domain cache corresponding to the data cache and a transfer descriptor. The domain cache is used to store confidential data to be processed; The transfer descriptor is used to describe the data that is to be transferred via Direct Memory Access (DMA). The first domain is located in the domain world; The root monitor sets a granular protection table (GPT), which includes a first GPT table for the CPU and a second GPT table for the FPGA, such that in the first GPT table, the first segment has permissions in the domain world and the FPGA memory mapping MMIO belongs to the root world, and in the second GPT table, only the first segment is an accessible segment. The root monitor interacts with the FPGA via FPGA MMIO, enabling the FPGA to read the confidential data via DMA transfer based on the transfer descriptor and execute FPGA tasks.
2. The method according to claim 1, wherein, Before configuring the stub data structure for the FPGA task, the following is also included: The first domain provides the data requirement information and its signature to the FPGA software; Configure the actual data structure corresponding to the pile data structure, including: After verifying that the signature is correct, the root monitor configures the real data structure.
3. The method of claim 1, further comprising, before configuring the stub data structure of the FPGA task: The first field receives the confidential data provided by the user through a secure channel.
4. The method according to claim 3, further comprising: The first domain and the user negotiate a key through a key negotiation protocol; The secure channel is constructed based on the key.
5. The method according to claim 1, wherein, The transmission descriptor includes the following information: the starting address of the data to be transmitted, the destination address, the data length, and the data channel information corresponding to the local buffer. The data channel is selected from the C2H channel from the FPGA to the host and the H2C channel from the host to the FPGA.
6. The method according to claim 1, wherein, The transfer descriptor includes multiple descriptors connected in a linked list format; A single descriptor corresponds to a single data import from the FPGA; Each descriptor has a target field that indicates the address of the next descriptor; In the linked list, the target field of the current descriptor is filled with the starting address of the next descriptor stored in the first segment.
7. The method according to claim 1, wherein, The data requirement information includes multiple data requirements for the same FPGA; Before configuring the actual data structure corresponding to the stub data structure, the method further includes: After creating stub data structures for the multiple data requirements, the FPGA software uniformly calls the smc instruction to switch to the root world.
8. The method according to claim 1, wherein, The domain cache area includes an input data cache area and a result data cache area. The input data cache area stores the confidential data, and the result data cache area is used to store the execution results of the FPGA task.
9. The method according to claim 1, wherein, Configure the stub data structure for the FPGA task, including: generating a stub page table based on the data cache; Configure the real data structure corresponding to the stub data structure, including: generating a real FPGA page table based on the stub page table and the domain cache.
10. The method according to claim 9, wherein, Before the root monitor interacts with the FPGA via the FPGA MMIO, the following is also included: The root monitor modifies the FPGA MMIO to point to the actual FPGA page table.
11. The method according to claim 1, wherein, In the second GPT table, the first segment is set to belong to the unsafe world, and all other memory segments are set to belong to the root world.
12. The method according to claim 1, further comprising: After executing the FPGA task, instruct the FPGA to completely clear the most recently used memory from its on-chip memory.
13. A root monitor in a confidential computing architecture, the confidential computing architecture comprising a secure world, a domain world, an insecure world, and a root world; the root monitor is located in the root world and includes a transport task module and an FPGA protection module, wherein: The transmission task module is configured such that, in response to the FPGA software in the insecure world, it configures the stub data structure of the FPGA task in the insecure world segment of memory according to the data requirement information of the FPGA, and in the first segment of memory corresponding to the first domain, it configures the real data structure corresponding to the stub data structure, wherein the stub data structure includes a data buffer area. The actual data structure includes a domain cache corresponding to the data cache area, and a transport descriptor; The domain cache is used to store confidential data to be processed; The transfer descriptor is used to describe the data that is to be transferred via Direct Memory Access (DMA). The first domain is located in the domain world; The FPGA protection module is configured to set a granular protection table (GPT), including a first GPT table for the CPU and a second GPT table for the FPGA. In the first GPT table, the first segment has domain world permissions, and the FPGA memory mapping MMIO belongs to the root world. In the second GPT table, only the first segment is an accessible segment. The module interacts with the FPGA via the FPGA MMIO, enabling the FPGA to read the confidential data via DMA transfer based on the transfer descriptor and execute FPGA tasks.
14. A computing device comprising a memory and a plurality of processors, the computing device forming a confidential computing architecture, the confidential computing architecture comprising a secure world, a domain world, an insecure world and a root world; the root world comprising the root monitor of claim 13.