Heterogeneous computing platform and computing method for artificial intelligence
By integrating a CPU, a hardware zero-copy interconnect module, a GPU, and an FPGA on the same board, efficient data transmission between computing units in a heterogeneous computing platform is achieved, solving the problem of low computing efficiency in existing technologies and improving resource utilization and overall processing efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TSINGHUA UNIVERSITY
- Filing Date
- 2026-04-13
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies that combine CPUs with a single acceleration chip result in low computational efficiency and low utilization of computing resources. Furthermore, the lack of efficient interconnect architecture design leads to idle high parallel computing power of GPUs and slow feedback of customized processing results from FPGAs, severely restricting the overall collaborative efficiency of heterogeneous architectures.
It adopts a central processing unit (CPU), a hardware zero-copy interconnect module, a graphics processing unit (GPU), and a field-programmable gate array (FPGA) integrated on the same board. The hardware zero-copy interconnect module enables efficient data transmission between computing units, improves collaborative efficiency, and avoids idle computing unit resources.
While combining the high efficiency of parallel computing with the low latency and high real-time performance of FPGA, it improves the utilization rate of computing resources in each computing unit and enhances the overall processing efficiency of the heterogeneous computing platform system.
Smart Images

Figure CN122285291A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of heterogeneous computing platform technology, and in particular to a heterogeneous computing platform and computing method for artificial intelligence (AI). Background Technology
[0002] In the field of high-performance embedded computing, traditional single-board computers rely on the single core computing power of the central processing unit (CPU), which cannot support the computational needs of complex tasks.
[0003] In related technologies, attempts have been made to employ a combination of a CPU integrated on a single board and a single accelerator chip, either a Field-Programmable Gate Array (FPGA) or a Graphics Processing Unit (GPU), to perform tasks. However, such solutions suffer from several drawbacks. First, the high overhead of general interconnect protocols results in a low effective bandwidth ratio, and the data transmission rate cannot match the chip's computing power. Second, the lack of efficient interconnect architecture design leads to significant data transmission latency, causing the GPU's high parallel computing power to be idle due to untimely data supply. Furthermore, the customized processing results of the FPGA cannot be quickly fed back to other units, severely restricting the overall collaborative efficiency of the heterogeneous architecture. Therefore, existing CPU-plus-single-accelerator-chip solutions result in low computational processing efficiency and low utilization of computing resources. Summary of the Invention
[0004] This invention provides a heterogeneous computing platform and computing method for artificial intelligence (AI), which addresses the shortcomings of existing technologies that combine CPUs with a single acceleration chip, resulting in low computing efficiency and low utilization of computing resources. It achieves the high efficiency of parallel computing of GPUs and the low latency and high real-time performance of FPGAs, while effectively improving the efficiency of data transmission between computing units through a hardware zero-copy interconnect module. This enhances the collaborative efficiency between units, avoids idle computing unit resources, and significantly improves the utilization of computing resources in each unit, thereby improving the overall processing efficiency of the heterogeneous computing platform system.
[0005] This invention provides a heterogeneous computing platform for artificial intelligence (AI), comprising: a central processing unit (CPU), a hardware zero-copy interconnect module, a graphics processing unit (GPU), and a field-programmable gate array (FPGA) integrated on the same board, wherein the CPU is connected to the GPU and the FPGA respectively through the hardware zero-copy interconnect module; The CPU is used to decompose the task to be processed into a general computing subtask, an AI computing power parallel subtask, and a customized AI acceleration subtask, and send the AI computing power parallel subtask to the GPU and the customized AI acceleration subtask to the FPGA. The GPU is used to execute the AI computing power parallel subtask based on the first task data in the second memory address of the GPU, obtain a first result, and write the first result to the second memory address. The FPGA is used to execute the customized AI acceleration subtask based on the second task data in the third memory address in the FPGA, obtain a second result, and write the second result into the third memory address; The hardware zero-copy interconnect module is used to read the first result from the second memory address and the second result from the third memory address based on cross-unit memory mapping information, and write the read first result and second result into the first memory address in the CPU; wherein, the cross-unit memory mapping information includes the mapping relationship of memory addresses between the computing units in the CPU, GPU, and FPGA; The CPU is further configured to fuse the first result and the second result to obtain a first fusion result.
[0006] According to a heterogeneous computing platform for artificial intelligence (AI) provided by the present invention, the hardware zero-copy interconnect module is further configured to read the first task data from the first memory address and write the first task data to the second memory address based on the cross-unit memory mapping information; and to read the second task data from the first memory address and write the second task data to the third memory address.
[0007] According to a heterogeneous computing platform for artificial intelligence (AI) provided by the present invention, the CPU is further configured to, during the process of the CPU executing the general computing subtask, the GPU executing the AI computing power parallel subtask, and the FPGA executing the customized AI acceleration subtask, collect the operating status data of each computing unit in the CPU, GPU, and FPGA, as well as the link status of the hardware zero-copy interconnect module, through an intelligent platform management interface bus; the CPU is further configured to, in response to detecting an operating abnormality based on the operating status data of each computing unit, dynamically adjust the operating status of each computing unit; the CPU is further configured to, in response to detecting an abnormal transmission status of the hardware zero-copy interconnect module based on the link status of the hardware zero-copy interconnect module, adjust the transmission status of the hardware zero-copy interconnect module.
[0008] According to a heterogeneous computing platform for artificial intelligence (AI) provided by the present invention, the CPU is further configured to send a first direct-connection collaboration instruction to the hardware zero-copy interconnect module when the second result includes data that needs to be processed collaboratively by the GPU; the hardware zero-copy interconnect module is further configured to, in response to the first direct-connection collaboration instruction, read the second result from the third memory address based on the cross-cell memory mapping information, and write the second result into the second memory address; the GPU is further configured to execute the AI computing power parallel subtask based on the second result and the first task data to obtain the first result.
[0009] According to a heterogeneous computing platform for artificial intelligence (AI) provided by the present invention, the FPGA is further configured to acquire timestamp data and write it to the third memory address, execute the customized AI acceleration subtask based on the timestamp data and the second task data, and obtain the second result, wherein the second result is timestamped; the hardware zero-copy interconnect module is further configured to read the timestamp data from the third memory address based on the cross-cell memory mapping information, and write the timestamp data to the second memory address; the GPU is further configured to execute the AI computing power parallel subtask based on the timestamp data and the first task data, and obtain the first result, wherein the first result is timestamped.
[0010] According to a heterogeneous computing platform for artificial intelligence (AI) provided by the present invention, the CPU is further configured to verify the first result and the second result respectively based on a unit verification strategy; if the verification of the first result fails, a reprocessing instruction is sent to the GPU; if the verification of the second result fails, a reprocessing instruction is sent to the FPGA; until both the first result and the second result are successfully verified, a matching detection is performed on the first result and the second result to obtain a detection result; based on the detection result, the first result and the second result are fused to obtain the first fused result.
[0011] According to a heterogeneous computing platform for artificial intelligence (AI) provided by the present invention, the CPU is further configured to, when the detection result indicates that the first result and the second result match, write the first result and the second result into the same result area in the first memory address, and perform fusion processing on the first result and the second result in the same result area to obtain the first fusion result.
[0012] According to a heterogeneous computing platform for artificial intelligence (AI) provided by the present invention, the hardware zero-copy interconnect module is further configured to write the first result and the second result into the first memory address when the second result does not include real-time control signals.
[0013] According to a heterogeneous computing platform for artificial intelligence (AI) provided by the present invention, the CPU is further configured to send a direct transmission instruction to the hardware zero-copy interconnect module; the hardware zero-copy interconnect module is further configured to, in response to the direct transmission instruction, read the first fusion result from the first memory address based on the cross-cell memory mapping information, and write the first fusion result to the third memory address; the FPGA is further configured to transmit the first fusion result to a control device through a specified serial port, so that the control device executes corresponding control instructions based on real-time signals.
[0014] According to the heterogeneous computing platform for artificial intelligence (AI) provided by the present invention, the CPU is further configured to output the first fusion result to a display via a multimedia display interface, or send the first fusion result to a host computer mounted on the board via an Ethernet port, or write the first fusion result to a solid-state drive.
[0015] According to a heterogeneous computing platform for artificial intelligence (AI) provided by the present invention, the hardware zero-copy interconnect module is further configured to read the first result from the second memory address and write the first result to the third memory address when the second result includes real-time control signals; the FPGA is further configured to perform fusion processing on the first result and the second result in the third memory address to obtain a second fusion result, and output the second fusion result to a control device so that the control device performs real-time control based on the second fusion result.
[0016] This invention also provides a computing method applied to the aforementioned heterogeneous computing platform for artificial intelligence (AI). The heterogeneous computing platform for AI includes a central processing unit (CPU), a hardware zero-copy interconnect module, a graphics processing unit (GPU), and a field-programmable gate array (FPGA) integrated on the same board. The CPU is connected to the GPU and the FPGA respectively via the hardware zero-copy interconnect module. The method includes: The CPU breaks down the task to be processed into a general computing subtask, an AI computing power parallel subtask, and a customized AI acceleration subtask, and sends the AI computing power parallel subtask to the GPU and the customized AI acceleration subtask to the FPGA. The GPU executes the AI computing power parallel subtask based on the first task data in the second memory address of the GPU to obtain a first result, and writes the first result to the second memory address. The customized AI acceleration subtask is executed by the FPGA based on the second task data in the third memory address of the FPGA, a second result is obtained, and the second result is written to the third memory address; The hardware zero-copy interconnect module reads the first result from the second memory address and the second result from the third memory address based on cross-unit memory mapping information, and writes the read first result and second result into the first memory address in the CPU; wherein, the cross-unit memory mapping information includes the mapping relationship of memory addresses between the computing units in the CPU, GPU, and FPGA; The CPU performs a fusion process on the first result and the second result to obtain a first fusion result.
[0017] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the calculation method described above.
[0018] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the computation method as described above.
[0019] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the calculation method as described above.
[0020] The heterogeneous computing platform and method for artificial intelligence (AI) provided by this invention utilizes a central processing unit (CPU), a hardware zero-copy interconnect module, a graphics processing unit (GPU), and a field-programmable gate array (FPGA) integrated on the same board to form a heterogeneous computing platform. The CPU decomposes the task to be processed into general computing subtasks, AI computing power parallel subtasks, and customized AI acceleration subtasks. The AI computing power parallel subtasks are sent to the GPU, and the customized AI acceleration subtasks are sent to the FPGA. The GPU executes the AI computing power parallel subtasks based on the first task data in a second memory address in the GPU, obtains a first result, and writes the first result to the second memory address. The customized AI acceleration subtask is executed using the FPGA based on the second task data in the third memory address of the FPGA, to obtain a second result, which is then written to the third memory address. A hardware zero-copy interconnect module, based on cross-unit memory mapping information, reads the first result from the second memory address and the second result from the third memory address, and writes the read first and second results to the first memory address of the CPU. The cross-unit memory mapping information includes the mapping relationship of memory addresses between the computing units in the CPU, GPU, and FPGA. The CPU then fuses the first and second results to obtain a first fused result. Thus, while maintaining the high efficiency of parallel computing of the GPU and the low latency and high real-time performance of the FPGA, the hardware zero-copy interconnect module effectively improves the efficiency of data transmission between computing units, enhances the collaborative efficiency between units, avoids idle computing unit resources, and greatly improves the utilization rate of computing resources in each computing unit, thereby improving the overall processing efficiency of the heterogeneous computing platform system. Attached Figure Description
[0021] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0022] Figure 1 This is one of the structural schematic diagrams of the heterogeneous computing platform for artificial intelligence (AI) provided by the present invention.
[0023] Figure 2 This is the second structural schematic diagram of the heterogeneous computing platform for AI provided by the present invention.
[0024] Figure 3 This is a schematic diagram illustrating the principle of computation implemented by the heterogeneous computing platform for AI provided by the present invention.
[0025] Figure 4 This is a flowchart illustrating the calculation method provided by the present invention.
[0026] Figure 5 This is a schematic diagram of the structure of the electronic device provided by the present invention.
[0027] Figure label: 100: Heterogeneous computing platform for artificial intelligence (AI); 110: CPU; 120: Hardware zero-copy interconnect module; 130: GPU; 140: FPGA. Detailed Implementation
[0028] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0029] The embodiments of this invention can be applied to electronic devices such as terminal devices, computer systems, and servers, and can operate together with a wide range of other general-purpose or special-purpose computing system environments or configurations. Well-known examples of terminal devices, computing systems, environments, and / or configurations suitable for use with electronic devices such as terminal devices, computer systems, and servers include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments including any of the above systems, etc.
[0030] Electronic devices such as terminal devices, computer systems, and servers can be described in the general context of computer system executable instructions (such as program modules) executed by a computer system. Typically, program modules can include routines, programs, object programs, components, logic, data structures, etc., which perform specific tasks or implement specific abstract data types. Computer systems / servers can be implemented in distributed cloud computing environments, where tasks are executed by remote processing devices linked through communication networks. In distributed cloud computing environments, program modules can reside on local or remote computing system storage media, including storage devices.
[0031] As the demands for computing power, real-time responsiveness, and deployment flexibility continue to rise in scenarios such as industrial control and intelligent signal processing, heterogeneous computing architecture has become a core technology direction. Currently, the core requirements in this field are concentrated in three dimensions: "parallel processing of multimodal tasks," "deployment in confined spaces," and "low-latency data interaction." For example, devices need to achieve artificial intelligence (AI) target detection and real-time signal analysis within a standard 6U size; industrial control scenarios require microsecond-level closed-loop response to ensure precise equipment control; and control devices need to integrate multiple chips, multiple interfaces, and multiple memory configurations within a limited space to avoid complex wiring and volume redundancy.
[0032] Currently, there are two main technical approaches to address the aforementioned shortcomings. One is distributed computing frameworks. Distributed computing frameworks such as Spark and Hadoop are essentially "offline batch processing" or "near real-time computing" architectures, relying heavily on task scheduling and data regrouping between cluster nodes. Even Spark stream processing, optimized for real-time scenarios, still requires millisecond-level minimum batch processing intervals, making it difficult to overcome physical latency limitations. Target scenarios, however, require microsecond-level response capabilities to ensure timely signal processing and precise control commands. The latency levels of existing distributed frameworks are completely insufficient to meet these requirements, making them unsuitable for high-real-time scenarios. Furthermore, distributed computing frameworks rely on "CPU cluster parallelism" as their core computing power, enabling efficient processing of single-type tasks such as statistical computation of structured data and simple machine learning. They lack the parallel floating-point computing capabilities of GPUs and the customized hardware acceleration capabilities of FPGAs. The target scenario needs to simultaneously support multimodal tasks such as "AI inference (relying on GPU computing power) + customized signal processing (relying on FPGA computing power)," for example, "GPU extracts target features + FPGA correlates echo signals" in the system, and "FPGA synchronizes timestamps of multiple cameras + GPU fuses multiple frames of images" in video surveillance. Existing distributed frameworks cannot achieve parallel processing of such complex tasks due to the lack of computing power types.
[0033] Another type is highly integrated embedded computing modules, with the Raspberry Pi series as a typical example. These modules are only the size of a credit card, integrating a CPU, memory, and multiple communication interfaces. They can run operating systems such as Linux and offer advantages such as low cost and ease of secondary development. They are widely used in IoT device development and small smart device control, enabling the rapid construction of simple, small computing systems. Their core feature is small-scale deployment through high integration, but due to hardware limitations, they do not integrate independent GPUs and FPGA chips, and can only handle basic serial tasks. The computing power of these modules is only sufficient for simple scenarios such as IoT device control and basic computation of small smart devices. Compared to the parallel computing requirements of "AI inference + signal processing" needed for the target scenario, the computing power gap is two orders of magnitude, making it impossible to support the smooth operation of high-density computing tasks such as single-shot target detection algorithm fifth generation (YOLOv5) target detection and complex signal filtering.
[0034] In terms of heterogeneous computing hardware architecture, some existing technologies attempt to use a combination of CPU and a single accelerator chip (FPGA or GPU). FPGAs, with their hardware programmability, have advantages in customized signal processing; GPUs, relying on massive computing cores, excel in floating-point operations of deep learning models. However, these frameworks, primarily based on "CPU cluster parallelism," can only handle statistical computations of structured data and simple machine learning tasks. They cannot meet the processing needs of complex computing tasks, such as the parallel computation requirements of multimodal data. Furthermore, in the CPU-plus-external-GPU solution, the transmission latency between the CPU and GPU is too high, resulting in idle CPU computing power and low utilization of computing resources. Therefore, the existing CPU-plus-single-accelerator-chip solution leads to low computing efficiency and low utilization of computing resources.
[0035] Furthermore, existing heterogeneous solutions combining CPUs with a single accelerator chip not only suffer from a mismatch in computing power types but also face performance limitations in data interaction between the CPU and the accelerator chip. On the one hand, the high overhead of general interconnect protocols results in a low effective bandwidth ratio, and the data transmission rate cannot match the chip's computing power. On the other hand, the lack of efficient interconnect architecture design leads to widespread data transmission latency, causing the GPU's high parallel computing power to be idle due to untimely data supply, and the customized processing results of the FPGA cannot be quickly fed back to other units, severely restricting the overall collaborative efficiency of the heterogeneous architecture. Therefore, existing solutions combining CPUs with a single accelerator chip result in low computational processing efficiency and low utilization of computing resources.
[0036] Moreover, in order to achieve configurations with multiple chips, multiple interfaces, and multiple memory, traditional high-performance computing devices often adopt a multi-board combination solution of CPU board, GPU board, FPGA board and interface expansion board. The overall size far exceeds the standard 6U size limited by the scenario, making it impossible to deploy. Even if some devices adopt a single-board design, due to insufficient integration, additional acceleration modules or interface modules need to be connected, which not only increases the overall size of the system, but also increases the wiring complexity and failure risk, making it difficult to meet the high integration deployment requirements in a small space.
[0037] Based on the aforementioned problems, this invention provides a heterogeneous computing platform for artificial intelligence (AI). While being compatible with the high efficiency of parallel computing of GPUs and the low latency and high real-time performance of FPGAs, it effectively improves the efficiency of data transmission between computing units through a hardware zero-copy interconnect module, enhances the collaborative efficiency between units, avoids idle computing unit resources, greatly improves the utilization rate of computing resources of each computing unit, and thus improves the overall processing efficiency of the heterogeneous computing platform system.
[0038] The following is combined Figures 1 to 2 This invention describes a heterogeneous computing platform for artificial intelligence (AI).
[0039] Figure 1 This is one of the structural schematic diagrams of the heterogeneous computing platform for artificial intelligence (AI) provided by the present invention, such as... Figure 1 As shown, the heterogeneous computing platform 100 for artificial intelligence (AI) includes: a central processing unit (CPU) 110, a hardware zero-copy interconnect module 120, a graphics processing unit (GPU) 130, and a field-programmable gate array (FPGA) 140 integrated on the same board. The CPU 110 is connected to the GPU 130 and the FPGA 140 respectively through the hardware zero-copy interconnect module 120. The CPU110 is used to decompose the task to be processed into a general computing subtask, an AI computing power parallel subtask, and a customized AI acceleration subtask, and send the AI computing power parallel subtask to the GPU and the customized AI acceleration subtask to the FPGA. The GPU130 is used to execute the AI computing power parallel subtask based on the first task data in the second memory address of the GPU, obtain a first result, and write the first result to the second memory address. The FPGA140 is used to execute the customized AI acceleration subtask based on the second task data in the third memory address in the FPGA, obtain a second result, and write the second result into the third memory address; The hardware zero-copy interconnect module 120 is used to read the first result from the second memory address and the second result from the third memory address based on cross-unit memory mapping information, and write the read first result and second result into the first memory address in the CPU; wherein, the cross-unit memory mapping information includes the mapping relationship of memory addresses between the computing units in the CPU, GPU, and FPGA; The CPU 110 is further configured to fuse the first result and the second result to obtain a first fusion result.
[0040] In this embodiment of the invention, the board is mounted on a hardware device, which can be an electronic device with computing capabilities. The hardware device can perform calculations through the board plugged into it. The board can be a single board integrating a CPU, GPU, FPGA, and a hardware zero-copy interconnect module.
[0041] In this embodiment of the invention, the CPU can decompose the task to be processed into multiple subtasks with different computational characteristics. Based on the computational characteristics of each subtask and the hardware characteristics of the GPU and FPGA computing units, subtasks with different computational characteristics are assigned to the two types of units. A "complementary computing power division of labor" logic can be constructed through the high-density parallelism of the GPU and the low-latency customization of the FPGA, allowing each unit to focus on its core strengths and reducing cross-unit computing power redundancy. The GPU possesses high parallelism and high computing power hardware characteristics. Correspondingly, relying on the computing power of the GPU and its dedicated memory (i.e., the second memory address), AI computing power parallel subtasks can be allocated to the GPU within each subtask. The FPGA possesses high real-time performance and strong adaptability hardware characteristics. Based on the FPGA's hardware reconfigurability and its dedicated memory (i.e., the third memory address), customized AI acceleration subtasks can be allocated to the FPGA.
[0042] For example, basic subtasks with low logical complexity and no parallel requirements can be executed by the CPU itself as general-purpose computing subtasks. High-performance tasks requiring large-scale parallel computation can be assigned to the GPU as AI computing power parallel subtasks. Tasks with high real-time requirements (e.g., microsecond-level latency) and fixed logic can be assigned to the FPGA as customized AI acceleration subtasks.
[0043] After the GPU and FPGA complete their tasks, they transfer the corresponding results to the CPU's dedicated memory, which is the first memory address, through a hardware zero-copy interconnect module.
[0044] In one possible embodiment, when the CPU decomposes the task to be processed into AI computing power parallel subtasks, customized AI accelerated subtasks, and general computing subtasks, it is specifically used for: The task to be processed is broken down into multiple subtasks. The computational characteristics are extracted based on the task attribute data of these multiple subtasks to obtain the computational characteristic data of each subtask. Based on the computational characteristic data of these multiple subtasks, attribute matching is performed with each predefined subtask in the subtask matching rules to classify these multiple subtasks into AI computing power parallel subtasks, customized AI acceleration subtasks, and general computing subtasks. The subtask matching rule includes multiple predefined subtasks with different computational characteristics and at least one computational characteristic index associated with each predefined subtask. These multiple predefined subtasks include AI computing power parallel subtasks, customized AI acceleration subtasks, and general computing subtasks.
[0045] For example, the CPU can match the computational characteristic indicators of each predefined subtask in the subtask matching rules with the data of each dimension in the computational characteristic data.
[0046] For example, basic tasks with low logical complexity and no parallel requirements can be treated as general-purpose computing subtasks and executed by the CPU itself. High-performance tasks requiring large-scale parallel computation can be treated as AI computing power parallel subtasks and assigned to the GPU for execution. Tasks with high real-time requirements (microsecond-level latency) and fixed logic can be treated as customized AI acceleration subtasks and assigned to the FPGA for execution.
[0047] For example, the subtask matching rule can be in the form of a matching rule table of "subtask type - calculation characteristic index". This matching rule table is shown in Table 1 below: Table 1 Subtask type Calculate characteristic indicators Customized AI-accelerated subtasks (e.g., real-time signal processing) The data type is device signal; the real-time requirement is high; the link identifier is a real-time signal link; the associated device is vehicle-mounted or airborne equipment; the output requirement is real-time feedback. AI computing power parallel subtasks; (e.g., AI inference, large-scale computing) The data type is image data or model file; the computational complexity is high; the data volume is ≥1GB; the real-time requirement is medium or low; the output requirement is display or Ethernet transmission. General computing subtasks (e.g., task allocation and scheduling for GPUs and FPGAs). The data type is control instructions; the computational complexity is low; the real-time requirement is medium or low; and the output requirement is local storage. The CPU can determine the subtask type by matching any predefined subtask according to a preset matching rule. If the data in each dimension of the computational characteristic data of the subtask to be matched matches three or more computational characteristic indicators associated with any predefined subtask in the matching rule table, then that predefined subtask is the subtask type corresponding to the subtask to be matched.
[0048] Based on the embodiments of the present invention, by matching the computational characteristic data of the multiple subtasks with the subtask matching rules, the multiple subtasks can be effectively and accurately classified, thereby achieving a refined decomposition of the tasks to be processed and thus achieving accurate task allocation for GPUs and FPGAs.
[0049] In this embodiment of the invention, the CPU can finely decompose the task to be processed into multiple subtasks according to the execution steps or process of the task. The CPU extracts computational characteristics of at least one dimension from the task attribute data of each subtask to obtain computational characteristic data for each subtask. This computational characteristic data may include at least one of the following dimensions: data type, task real-time requirements, computational complexity, output requirements, and associated device type. The data type may include, but is not limited to, image data, device signals, control instructions, model files, etc. For example, the data type can be extracted from the data identifier bits in the task header of the task to be processed.
[0050] Real-time requirements for tasks may include, but are not limited to: high real-time processing time ≤ 10 microseconds. Real-time processing time 10 -1ms (milliseconds), low real-time processing time ≥1ms. For example, the real-time requirements of a task can be extracted from the latency threshold field of the task to be processed.
[0051] Computational complexity can include, but is not limited to: high complexity (e.g., data size ≥ 1 gigabyte (GB) or containing convolution operations or matrix operations), medium complexity (e.g., data size between 100 megabytes (MB) and 1 GB), and low complexity (e.g., data size < 100 MB). For example, computational complexity can be obtained by performing data volume statistics and operation type analysis on task-related data, i.e., task data.
[0052] In this embodiment of the invention, after obtaining the fusion result, the CPU can output the fusion result. Output requirements include, but are not limited to, multimedia display, Ethernet backhaul, local storage, or multipath output. Associated device types may include, but are not limited to, industrial cameras, vehicle / airborne devices, and host computers.
[0053] For example, the associated device type can be obtained by parsing the interface protocol frame format. For instance, if an industrial camera transmits high-resolution image data via the Low-Voltage Differential Signaling (LVDS) protocol, then the associated device includes the industrial camera if the interface uses the LVDS protocol.
[0054] If the interface uses the Recommended Standard 422 (RS422) serial communication protocol, the associated devices include vehicle / airborne equipment. If the interface uses the Ethernet protocol, the associated devices include the host computer, i.e., hardware devices.
[0055] In one possible embodiment, the AI computing power parallel subtask may include at least one of a deep learning inference task or a large-scale data computation task; the first task data may include at least one of a preprocessed image or model parameters. The customized AI acceleration subtask may include at least one of a real-time signal processing task or an interface adaptation and data preprocessing task; the second task data may include at least one of a real-time digitized signal or a filtering configuration parameter.
[0056] In one possible example, the task to be processed may include tasks for artificial intelligence (AI) computation or AI-related tasks. For instance, a deep learning inference task may include tasks that utilize AI network models for inference, and a large-scale data processing task may also include tasks related to AI computation, such as image encoding / decoding and Fourier transform of data during AI network model inference. For example, an interface adaptation and data preprocessing task may include receiving raw images from an industrial camera via a 32×LVDS interface, performing noise reduction, cropping, and other preprocessing on the raw images so that the preprocessed images can be used to perform AI model inference tasks. For example, a real-time signal processing task may include filtering noisy signals.
[0057] For example, the GPU obtains the subtask association data corresponding to the AI computing power parallel subtask through the hardware zero-copy interconnect module. The subtask association data corresponding to the GPU can also be called the first task data. The GPU executes the AI computing power parallel subtask based on the corresponding subtask association data.
[0058] In one possible example, at least one of the preprocessed image or model parameters included in the task-associated data can be used as the first task data required for the GPU to execute the task; at least one of the real-time digitized signal or filter configuration parameters included in the task-associated data can be used as the second task data required for the FPGA to execute the task.
[0059] For deep learning inference tasks, the GPU utilizes a compatible CUDA 10.02 or later development environment to call deep learning frameworks, as well as deep learning acceleration libraries (ixDNN) and computer vision acceleration libraries (CV-CUDA) to efficiently run convolutional neural networks (object detection models). During computation, the GPU only relies on its own dedicated memory to store intermediate data and does not consume CPU or FPGA memory resources. The subtask-related data for deep learning inference tasks may include at least one of preprocessed images or model parameters.
[0060] For example, the first result may include the inference results of the AI model, including but not limited to: the target coordinates, category, confidence level, etc. of YOLOv5.
[0061] For large-scale data processing tasks, GPUs handle tasks requiring massive amounts of repetitive computation, such as "image encoding and decoding accelerated by the image encoding and decoding library (ixJPEG)," "Fourier transform accelerated by the Fast Fourier Transform library (ixFFT)," and "linear algebra operations accelerated by the Basic Linear Algebra Subroutine Library (ixBLAS)." These tasks would take several seconds to execute on a CPU, but GPUs can compress them to milliseconds through parallel cores. The associated data for subtasks corresponding to large-scale data processing tasks may include at least one of the following: preprocessed images or model parameters.
[0062] For example, the first result may include the result of data processing, and the reasoning result may include, but is not limited to, image encoding / decoding results, Fourier transform results, etc.
[0063] Furthermore, the GPU performs parallel computations based on the received data. The GPU calls the ixDNN acceleration library to complete AI computing power parallel subtasks based on the data in 16GB of dedicated memory, such as performing object detection (e.g., identifying faulty parts of a device). Intermediate computation results are temporarily stored in the GPU's own dedicated memory and do not consume CPU resources.
[0064] Then, the GPU transmits the results back via the hardware zero-copy interconnect module. After completing the calculation, the GPU writes the first result to the GPU's second memory address and transmits it back to the CPU's first dedicated DDR4 memory via the hardware zero-copy interconnect module. For example, the results such as target coordinates, confidence level, and detection timestamp are transmitted back.
[0065] For example, the FPGA obtains the corresponding subtask-related data through the hardware zero-copy interconnect module. The subtask-related data of the FPGA can also be called the second task data. The FPGA executes customized AI acceleration subtasks based on the corresponding subtask-related data.
[0066] For real-time signal processing tasks, fixed algorithms are implemented directly through the FPGA hardware logic, avoiding the instruction overhead of software execution. Latency can be controlled at the microsecond level, far lower than that of CPUs (milliseconds) and GPUs (tens of milliseconds). The subtask-related data corresponding to the real-time signal processing task may include at least one of the following: real-time digitized signal or filter configuration parameters.
[0067] For example, the second result may include real-time signal processing results, which include equipment signal data at the time of the fault.
[0068] For interface adaptation and data preprocessing tasks, the FPGA acts as a "bridge between external devices and the CPU / GPU," adapting to dedicated interface protocols and performing data preprocessing to reduce the basic workload of the CPU / GPU. Specifically, the FPGA receives raw images from industrial cameras via a 32×LVDS interface, performs noise reduction and cropping preprocessing, and then transmits them to the GPU; the FPGA receives device signals via two RS422 serial ports, performs filtering and timing calibration, and then transmits them to the CPU, avoiding wasting computing power on the GPU processing raw data. The subtask-related data for interface adaptation and data preprocessing tasks may include at least one of the following: real-time digitized signals or filter configuration parameters.
[0069] For example, the second result may include data preprocessing results, such as: noise reduction, cropping of the preprocessed image, and RS422 filtered equipment rotation speed data.
[0070] Furthermore, the FPGA performs customized calculations based on the read data. The FPGA implements RS422 signal filtering and B-code time synchronization through hardware logic. For example, the FPGA can filter noisy RS422 signals, outputting a denoised standardized digital signal. The filtered data, such as equipment speed and temperature, is then bound to a timestamp and temporarily stored in the FPGA's third dedicated memory. For instance, for RS422 signal filtering, the FPGA can process the input noisy raw differential signal based on configurable parameters, outputting a denoised standardized digital signal and status feedback, which must meet the accuracy and real-time requirements of subsequent data processing. This input-output process achieves microsecond-level filtering delay through the parallel processing capabilities of the FPGA hardware logic, perfectly meeting the "high reliability, low latency" processing requirements of Phytium single-board computers for RS422 signals in industrial control, aerospace, and other scenarios.
[0071] The FPGA transmits results back via a hardware zero-copy interconnect module. For example, the FPGA transmits timestamped signal data back to the CPU's first dedicated memory via the hardware zero-copy interconnect module. The target detection results from the GPU are stored in the same "result area" memory address in the first dedicated memory, awaiting fusion by the CPU.
[0072] In this embodiment of the invention, the hardware zero-copy interconnect module enables zero-copy data transmission between the CPU and both the GPU and FPGA computing units. Specifically, the hardware zero-copy interconnect module records the memory addresses of each computing unit in the CPU, GPU, and FPGA, and uses these recorded memory addresses to implement zero-copy data transmission.
[0073] The CPU can register its own 16GB of DDR4 memory as the first memory address, set read and write permissions, and allow the GPU and FPGA to access it through the hardware zero-copy interconnect module.
[0074] Specifically, the CPU can send a memory interconnect configuration instruction to the GPU, triggering the GPU driver to register 16GB of DDR6 memory as a second memory address. For example, the address range of the second memory address can be 0xC0000000-0xFFFFFFF, and the GPU generates an MR (Memory Region) handle and synchronizes it to the CPU.
[0075] The CPU can send memory interconnect configuration instructions to the FPGA, driving the FPGA to register 2×4GB of DDR3 as a third memory address. For example, the address range of the third memory address can be 0x40000000-0x7FFFFFFF.
[0076] The hardware zero-copy interconnect module records the memory addresses of the CPU, GPU, and FPGA to complete cross-cell address mapping. In one possible example, this pre-configured cross-cell memory mapping information may include the memory addresses of the CPU, GPU, and FPGA.
[0077] The hardware zero-copy interconnect module can realize data transmission between the CPU and GPU through the P1 interface, and data transmission between the CPU and FPGA through the P2 / P3 interface.
[0078] In addition, the CPU is configured with the following transmission rules: 70% of the bandwidth for AI inference data transmission is allocated to the P1 interface (in addition, the GPU can also handle image processing for ordinary algorithms). 50% of the bandwidth for real-time signal transmission is allocated to the P2 / P3 interfaces to avoid bandwidth contention by high-priority tasks.
[0079] It should be noted that the CPU's first memory address can include processing memory for storing data related to task execution and shared memory for storing data shared with other units. If the CPU sends data directly to the GPU or FPGA, the CPU must first read the data to be sent from the processing memory and write the read data to the shared memory; that is, the CPU performs a data copying process internally. Then, the CPU sends the data from its shared memory to the GPU. Similarly, the GPU's second memory address can include processing memory for storing data related to task execution and shared memory for storing data shared with other units. The GPU must write the data to be sent from the CPU to its shared memory and then transfer it from the shared memory to its processing memory before it can use the data to execute the corresponding task.
[0080] In this embodiment of the invention, a hardware zero-copy interconnect module is designed. The cross-unit memory mapping information recorded by this module includes the authorized access memory of each computing unit, which can be the processing memory of each computing unit. Based on this, the cross-unit memory mapping information can directly read the data to be sent from the CPU's processing memory and directly write it into the GPU's processing memory; similarly, the cross-unit memory mapping information can directly read the data to be sent from the CPU's processing memory and directly write it into the FPGA's processing memory. This eliminates the need for the CPU to copy data from its own processing memory to its own shared memory, i.e., zero-copy data; thus achieving fast and efficient unit interconnection.
[0081] In one possible example, the hardware zero-copy interconnect module can be a module implemented based on Remote Direct Memory Access (RDMA) technology.
[0082] In real-world scenarios, the heterogeneous computing system platform of the present invention, which uses "hardware-level protocol processing + direct memory access", eliminates the need for CPU relay and software parsing. Through the hardware zero-copy interconnect module, the transmission latency between the CPU and GPU is reduced to ≤1μs, and the latency between the CPU and FPGA is reduced to ≤0.5μs, fully covering the latency threshold of real-time scenarios and meeting the low-latency requirements of multi-computing unit collaboration.
[0083] In one possible example, the pre-configured cross-cell memory mapping information may include the address of the processing memory in the respective memory addresses of the CPU, GPU, and FPGA.
[0084] Specifically, the hardware zero-copy interconnect module supports both Write and Read operations. When the CPU needs to transfer data to the GPU or FPGA, a Write operation can be used. For example, the hardware zero-copy interconnect module can directly write the subtask data from the CPU's first memory address to the corresponding GPU or FPGA's processing memory without requiring a response from the GPU or FPGA. When the GPU or FPGA needs to send the task results back to the CPU after executing the task, the hardware zero-copy interconnect module can directly read the task results from the GPU or FPGA's processing memory and send them directly to the CPU's processing memory.
[0085] In this embodiment of the invention, the CPU can further fuse the results from the GPU and FPGA.
[0086] For example, in intelligent detection and real-time signal analysis scenarios, the CPU fuses the GPU's device fault detection results with the FPGA's fault signal data at the time of the fault according to timestamps. Specifically, the CPU reads the GPU target detection results and FPGA signal data from 16GB of DDR4 memory and performs correlation and fusion according to timestamps: binding the "gear wear result detected by the GPU at 10:05:30" with the "abnormal rotational speed (1500rpm→800rpm) signal filtered by the FPGA at the same time" to generate a "fault-signal correlation report," which is temporarily stored in 16GB of DDR4 memory.
[0087] Based on embodiments of the present invention, a heterogeneous computing platform is constructed by integrating a central processing unit (CPU), a hardware zero-copy interconnect module, a graphics processing unit (GPU), and a field-programmable gate array (FPGA) on the same board. The CPU decomposes the task to be processed into general computing subtasks, AI computing power parallel subtasks, and customized AI acceleration subtasks. The AI computing power parallel subtasks are sent to the GPU, and the customized AI acceleration subtasks are sent to the FPGA. The GPU executes the AI computing power parallel subtasks based on the first task data in the second memory address of the GPU, obtains a first result, and writes the first result to the second memory address. The FPGA then executes the AI computing power parallel subtasks based on the first task data in the second memory address of the GPU. The second task data in the third memory address of the FPGA executes the customized AI acceleration subtask, obtains a second result, and writes the second result to the third memory address. Using a hardware zero-copy interconnect module based on cross-unit memory mapping information, the first result is read from the second memory address, and the second result is read from the third memory address. The read first and second results are then written to the first memory address of the CPU. The cross-unit memory mapping information includes the mapping relationship of memory addresses between the computing units in the CPU, GPU, and FPGA. The CPU fuses the first and second results to obtain a first fused result. Thus, while maintaining the high efficiency of parallel computing of the GPU and the low latency and high real-time performance of the FPGA, the hardware zero-copy interconnect module effectively improves the efficiency of data transmission between computing units, enhances the collaborative efficiency between units, avoids idle computing unit resources, greatly improves the utilization rate of computing resources in each computing unit, and ultimately improves the overall processing efficiency of the heterogeneous computing platform system.
[0088] Furthermore, the hardware zero-copy interconnect module is also used to read the first task data from the first memory address and write the first task data to the second memory address based on the cross-unit memory mapping information; and to read the second task data from the first memory address and write the second task data to the third memory address.
[0089] In one possible embodiment, the CPU can obtain the task-related data, i.e., task data, of the task to be processed by any of the following: If the task-related data includes real-time signal data, the FPGA is used to receive real-time signal data sent by external devices. The FPGA sends the real-time signal data to the CPU through a Gigabit Transceiver High-speed (GTH) link. The CPU receives the real-time signal data from the FPGA through the GTH link and writes the real-time signal data to the CPU's first memory address. If the task-related data includes non-real-time data, the CPU receives the non-real-time data through the Ethernet port and writes the non-real-time data to the CPU's first memory address. If the task-related data includes general peripheral data, the CPU receives the general peripheral data through the Universal Serial Bus (USB) interface and stores the general peripheral data to the solid-state drive of the hardware device.
[0090] If the task-related data includes real-time signal data, the low-latency characteristics of the FPGA and its links can be utilized for transmission. For example, RS422 device operation signals can be directly connected to the FPGA via the onboard "2 RS422 serial ports"; LVDS industrial camera images can be directly connected to the FPGA via the onboard "32×LVDS interface". The FPGA first temporarily stores the raw data in its own third dedicated memory. The FPGA can preprocess the real-time signal data, such as filtering and noise reduction. The preprocessed data is then transmitted to the CPU's first dedicated memory via the GTH link of the P2 / P3 interface with a single-channel rate ≥10 gigabits per second (Gbps). Alternatively, if the task-related data includes real-time signal data, the FPGA can also send the real-time signal data to the CPU via a hardware zero-copy interconnect module.
[0091] Based on this, by utilizing the low latency characteristics (microsecond level) of the GTH link, it is ensured that real-time signals are not lost or delayed; the latency of real-time signal data is reduced, and the transmission efficiency of real-time signal data is improved.
[0092] If the task-related data includes non-real-time data, such as YOLOv5 model files and task configuration instructions issued by hardware devices, the CPU is connected via a "1×1000Base-T Gigabit Ethernet port," and data is directly written to the CPU's first dedicated memory without going through the FPGA, thus avoiding resource waste. Here, 1000Base-T indicates 1000 Mbps baseband Ethernet transmission over twisted-pair cable. If the task-related data includes general peripheral data, since general peripheral data is externally accessed non-real-time data, such as historical image datasets stored on a USB drive, the CPU can connect to the general peripheral data via the three USB ports on the front panel and temporarily store the general peripheral data on the hardware device's M.22TB SSD. The data will then be read into the CPU's first dedicated memory when the task starts.
[0093] In one possible embodiment, the CPU can perform general preprocessing on the data temporarily stored in the first dedicated memory to ensure that the data format matches the processing requirements of the corresponding GPU / FPGA.
[0094] Specifically, for image data, the CPU can convert the image data into a format that the GPU can recognize, and it can also compress the data to improve transmission efficiency. For example, the CPU converts the raw LVDS image (such as RAW format) into an RGB / BGR format that the GPU can recognize, compresses it to a resolution of 640×640 to adapt to YOLOv5 input, and transmits the preprocessed data to the GPU's 16GB dedicated memory through the SRIO link (rate ≥10Gbps) of the P1 interface, while also transmitting configuration parameters such as "image size and number of channels". For signal data, the CPU can convert the signal data into a format that the FPGA can process, and can also configure its parameters. For example, it can convert the raw RS422 signal (such as an analog voltage value) into a digital signal (such as 16-bit integer data) that the FPGA can process, configure parameters such as "filter algorithm coefficients and sampling frequency", and transmit the data and parameters synchronously to the FPGA's third dedicated memory through the GTH link of the P3 interface.
[0095] The CPU can parse and process task instructions. For example, it can parse the task type (such as target detection / signal analysis) and output requirements (such as High-Definition Multimedia Interface (HDMI) display / Ethernet backhaul) of the total task issued by the host computer, and generate CPU scheduling instructions and GPU / FPGA task lists, which are temporarily stored in the instruction area of the first dedicated memory.
[0096] Furthermore, the CPU is also used to collect the operating status data of each computing unit in the CPU, GPU and FPGA, as well as the link status of the hardware zero-copy interconnect module, through the intelligent platform management interface bus during the process of the CPU executing the general computing subtask, the GPU executing the AI computing power parallel subtask, and the FPGA executing the customized AI acceleration subtask. The CPU is also configured to dynamically adjust the operating status of each computing unit in response to detecting an operating abnormality based on the operating status data of each computing unit. The CPU is further configured to adjust the transmission state of the hardware zero-copy interconnect module in response to detecting an abnormal transmission state of the hardware zero-copy interconnect module based on the link state of the hardware zero-copy interconnect module.
[0097] Among them, the running status data can be the status of the computing unit during the execution of tasks, such as CPU load rate, GPU computing power utilization rate, FPGA logic resource utilization rate, temperature of each computing unit, etc.
[0098] The CPU can achieve this by executing general computing power subtasks. Specifically, by executing general computing power subtasks, it can collect the operating status data of each computing power unit and perform status detection. If any computing power unit is running abnormally, it can schedule tasks for each computing power unit. In addition, by executing general computing power subtasks, it can transmit the status data of the hardware zero-copy interconnection module and make adaptive dynamic adjustments.
[0099] For example, GPU operating status data may include, but is not limited to, GPU computing load rate, unit temperature, etc. For instance, if the GPU computing load is detected to exceed 80%, it is determined that the operating status is abnormal. In this case, the general subtasks of the CPU with low priority are suspended, and the CPU allocates new general subtasks of AI computing power parallel subtasks to the GPU. It is also possible to prioritize the allocation of the P1 interface (6×4SRIO / GTH) bandwidth of the VPX bus to the GPU to ensure data transmission efficiency.
[0100] For example, the FPGA's operational status data may include, but is not limited to, FPGA processing latency, logic resource utilization, etc. For instance, if the FPGA processing latency is detected to exceed a threshold, and the operational status is determined to be abnormal, the FPGA's clock frequency (which determines the FPGA's execution speed; lowering it slows down the execution speed) is adjusted via the X100 chipset, or the number of its parallel processing signal channels is reduced (from one channel processing 8 signals to 6 channels) to ensure that real-time performance meets the requirements.
[0101] For example, CPU operating status data may include, but is not limited to, CPU load rate and unit temperature. For instance, if the CPU load rate exceeds a threshold, indicating an abnormal operating status, non-core subtasks within the CPU's general computing subtasks can be migrated to the GPU's idle computing power for processing. These non-core subtasks may include result formatting subtasks, data statistics subtasks, etc.
[0102] For example, the interconnection transmission status data of the hardware zero-copy interconnect module may include, but is not limited to, the transmission delay, bandwidth utilization, and number of data verification failures of the P1, P2, or P3 interfaces. For instance, if the transmission delay of the P1 interface exceeds 1μs, the CPU can pause low-priority general computing subtasks to free up the bandwidth of the P1 interface.
[0103] Based on the embodiments of the present invention, when the CPU detects an abnormality in the operating status of a computing unit based on the operating status data of each computing unit, it can quickly and dynamically adjust the CPU to restore the computing power to normal, ensuring efficient and stable execution of computing tasks. Furthermore, the CPU can also adjust the transmission status of the hardware zero-copy interconnect module in real time, ensuring the high efficiency of interconnection between units, thereby improving the reliability and efficiency of the heterogeneous computing platform of the system.
[0104] In one possible embodiment, the hardware zero-copy interconnect module, when writing the first and second results read to the first memory address, is specifically used for: Perform data integrity verification on the first and second results read; In response to the failure of the first result verification, a retransmission command is sent to the GPU; In response to the failure to verify the second result, a retransmission command is sent to the FPGA; In response to the successful verification of the first and second results, the read first and second results are written to the first memory address.
[0105] For example, the hardware zero-copy interconnect module can support data verification operations, enabling data integrity verification of the transmitted data. For instance, it can perform CRC32 (A cyclic redundancy check) verification on the first and second results. Additionally, the hardware zero-copy interconnect module can also perform data integrity verification on the first and second task data.
[0106] If the verification fails, the hardware zero-copy interconnect module can send a retransmission command to the corresponding computing unit to ensure data integrity.
[0107] Based on the embodiments of the present invention, the accuracy of data transmission between computing units is ensured by the verification of the hardware zero-copy interconnect module during the transmission process, thereby further improving the accuracy and reliability of the heterogeneous computing platform operation.
[0108] Furthermore, the CPU is also configured to send a first direct-connection collaboration instruction to the hardware zero-copy interconnect module when the second result includes data that requires collaborative processing by the GPU; The hardware zero-copy interconnect module is also used to respond to the first direct connection collaboration instruction, read the second result from the third memory address based on the cross-unit memory mapping information, and write the second result into the second memory address; The GPU is also used to execute the AI computing power parallel subtask based on the second result and the first task data to obtain the first result.
[0109] A hardware zero-copy interconnect module can be used to enable collaborative processing between GPUs and FPGAs during task execution. For example, if the execution result of a subtask of a GPU or FPGA may include data required by the other party when executing its subtask, then fast and efficient data interconnection can be achieved directly through the hardware zero-copy interconnect module.
[0110] For example, the second result can be a real-time image stream processed by the FPGA, such as an image data stream in RGB (Red Green Blue) format. The AI computing power parallel subtask can be AI inference for the real-time image. In scenarios where the GPU needs to process the real-time image stream preprocessed by the FPGA, the real-time image stream processed by the FPGA can be directly read from the processing memory at the third memory address of the FPGA through a hardware zero-copy interconnect module and directly written into the processing memory of the GPU for AI inference.
[0111] Of course, for scenarios where the FPGA needs to process the data processed by the GPU, the interaction between the various computing units and the hardware zero-copy interconnect module can include: The first result includes data that requires the FPGA to process collaboratively. The CPU is also used to send a second direct-connection cooperative instruction to the hardware zero-copy interconnect module; The hardware zero-copy interconnect module is also used to respond to the second direct-connect cooperative instruction, read the first result from the second memory address of the GPU based on the pre-configured cross-cell memory mapping information, and write the read first result into the third memory address of the FPGA; The FPGA is also used to execute a customized AI acceleration subtask based on the first result in the third memory address, obtain a fourth result, and write the fourth result to the third memory address. The hardware zero-copy interconnect module is also used to read a fourth result from the third memory address based on pre-configured cross-cell memory mapping information, and write the read fourth result to the first memory address of the CPU.
[0112] For example, the FPGA correlates the AI inference results of the GPU; for instance, the first result is the result data of the GPU's AI inference execution.
[0113] For example, in some industrial scenarios, the GPU can perform AI inference based on multiple machine images to obtain the confidence scores of these images. Images with confidence scores below a preset threshold correspond to images of machines with anomalies. The hardware zero-copy interconnect module can write the confidence scores of the multiple machines inferred by the GPU into the FPGA's processing memory, enabling the FPGA to select the signal of the abnormal machine from the signals transmitted by the multiple machines and process that signal.
[0114] For example, in a radar system, the target features extracted by the GPU are directly transmitted to the third memory address of the FPGA through a hardware zero-copy interconnect module. The FPGA can then associate its own radar echo signal with the target features of the GPU.
[0115] For example, in video surveillance, a hardware zero-copy interconnect module can be used to directly transmit the timestamp data of multiple cameras on the FPGA to the second memory address of the GPU. The GPU can then combine the timestamps of multiple cameras to fuse multiple frames of images from multiple cameras at the same time.
[0116] Based on the embodiments of the present invention, the GPU and FPGA can directly interconnect the task results during task execution through a hardware zero-copy interconnect module, without going through the CPU. This reduces the CPU utilization rate, allowing the CPU to release more computing power to support more complex multi-task concurrent scenarios and improve the collaborative efficiency between computing units.
[0117] Furthermore, the FPGA is also used to acquire timestamp data and write it to the third memory address, and to execute the customized AI acceleration subtask based on the timestamp data and the second task data to obtain the second result, wherein the second result is timestamped. The hardware zero-copy interconnect module is also used to read the timestamp data from the third memory address based on the cross-unit memory mapping information, and write the timestamp data into the second memory address; The GPU is also used to execute the AI computing power parallel subtask based on the timestamp data and the first task data to obtain the first result, wherein the first result is timestamped.
[0118] In this embodiment of the invention, the FPGA can obtain current time data in real time through a timestamp interface, bind timestamps to task results, and temporarily store them in the DDR3 memory result area. Examples include timestamped equipment rotation speed data and equipment temperature data.
[0119] Furthermore, the latest timestamp data can be read from the FPGA's processing memory and directly written to the GPU's processing memory through a hardware zero-copy interconnect module. When the GPU detects real-time image streams, it can directly bind the latest timestamp to the currently executed image detection result, thus realizing timestamp binding of the GPU execution result.
[0120] Based on this invention, a new "GPU-FPGA direct connection path" (latency ≤ 1.5μs) is added. The GPU can directly read the timestamp data in the FPGA's DDR3 memory, meeting the low-latency requirements of multi-computing unit collaboration. Specifically, the latest timestamp data is obtained in real time through the FPGA, and the low-latency characteristics of the FPGA are combined with the GPU's task execution process through a hardware zero-copy interconnect module. This allows the different computing power characteristics of each computing unit to be effectively combined, improving the practicality of the heterogeneous computing platform.
[0121] Furthermore, the CPU is also configured to verify the first result and the second result respectively based on a unit verification strategy; if the verification of the first result fails, a reprocessing instruction is sent to the GPU; if the verification of the second result fails, a reprocessing instruction is sent to the FPGA; until both the first result and the second result are successfully verified, a matching test is performed on the first result and the second result to obtain a test result; based on the test result, the first result and the second result are fused to obtain the first fused result.
[0122] In this embodiment of the invention, the unit verification strategy can be a verification strategy for the task execution results of the computing power unit. The verification strategies for different computing power units can be the same or different.
[0123] For example, the CPU performs format verification and data integrity verification on the first result; the CPU performs timing verification and precision verification on the second result.
[0124] The CPU verifies the format of the first result, for example, whether it conforms to a preset JSON format, and whether it contains fields such as "target coordinates, confidence level, and timestamp". The CPU also verifies the data integrity of the first result, for example, whether there is packet loss.
[0125] The CPU performs timing verification on the second result. For example, for RS422 filtered equipment speed data and B-code synchronization timestamps, the timing of the signal data can be checked to verify whether the timestamps are continuous. As another example, for speed data, its accuracy can be checked to verify whether it is within a reasonable range.
[0126] In this embodiment of the invention, a matching test is also performed on the first result and the second result. Based on the test result, the first result and the second result are fused to obtain a first fusion result.
[0127] Based on the embodiments of the present invention, the CPU can perform targeted verification on the results of different computing units, ensuring the reliability of the results of each computing unit.
[0128] In one possible embodiment, when the CPU performs a matching detection on the first result and the second result to obtain a detection result, it is specifically used for: The first result is parsed into a first data structure containing a preset field, and the second result is parsed into a second data structure containing the preset field. The field name of the preset field includes at least one of timestamp, task identifier, or task keyword. The matching result is obtained by performing a matching test on fields with the same field names in the first and second data structures.
[0129] The first and second results can be parsed into data structures containing fixed fields: "timestamp, task identifier, and task keyword." The timestamp indicates the execution time of the corresponding subtask. The task identifier represents the pending task from which the corresponding subtask originates. The task keyword can represent the task content, associated device, etc., of the corresponding subtask.
[0130] Specifically, if any field with the same name in the first data structure and the second data structure has a matching content, then the first result and the second result match. If the content of any field with the same name does not match, then the first result and the second result do not match.
[0131] Based on the embodiments of the present invention, the CPU can further analyze the results of different computing units, specifically parsing the results of different computing units into data structures containing the same fields, so as to facilitate matching detection for the same field names, which can improve the accuracy of matching detection of the first result and the second result, and further improve the accuracy of the output of the system's heterogeneous computing platform.
[0132] In one possible embodiment, the CPU performs a matching check based on fields with the same field names in the first and second data structures, and the result of the check is specifically used for at least one of the following: If the field name of the preset field includes a timestamp, compare the timestamp fields in the first data structure and the second data structure. If the timestamp fields in the first data structure and the second data structure are aligned, determine that the first result and the second result match. If the field name of the preset field includes a task identifier, compare the task identifier fields in the first data structure and the second data structure. If the task identifier fields in the first data structure and the second data structure indicate that they come from the same task to be processed, determine that the first result and the second result match. If the field name of the preset field includes a task keyword, perform scenario matching on the task keyword fields in the first data structure and the second data structure. If the task keyword fields between the first data structure and the second data structure indicate that the scenarios are related, determine that the first result and the second result match.
[0133] Specifically, if the time interval between the timestamps in the first data structure and the second data structure does not exceed a preset threshold, then the timestamp fields of the first and second data structures are aligned, and the first and second results are matched. If the time interval exceeds the preset threshold, then the timestamp fields are not aligned.
[0134] For example, the "gear wear result detected by the GPU at 10:05:30" can be linked with the "abnormal speed (1500rpm→800rpm) signal after FPGA filtering at the same time" to generate a "fault-signal association report".
[0135] If the task identifier field between the first data structure and the second data structure indicates the same source task, that is, if the two subtasks are obtained by disassembling the same task to be processed, then the first result and the second result match.
[0136] For example, a CPU can break down a large number of tasks to be processed and distribute them to a GPU or FPGA. Subtasks from different tasks can run on the GPU or FPGA. During fusion, the results of multiple subtasks from the same task can be selected for fusion.
[0137] For example, different tasks can be used to inspect machines and equipment in different workshops. Based on this, the inspection results of machines and equipment in the same workshop can be merged. For example, the inspection results of equipment 1, equipment 2, equipment 3, etc. in workshop 1 can be output as a single merged result. Similarly, the inspection results of equipment 1, equipment 2, equipment 3, etc. in workshop 2 can be output as a single merged result.
[0138] The system can pre-configure matching scenarios; for example, it can preset a matching scenario between a device fault test scenario and a device fault signal analysis scenario. If the scenario indicated by the task keyword in the first data structure is a device fault test scenario, and the scenario indicated by the task keyword in the second data structure is a device fault signal analysis scenario, then the first result and the second result are matched.
[0139] For example, a CPU can break down and distribute a large number of tasks to be processed in different scenarios to a GPU or FPGA. Subtasks specific to different scenarios can then run on the GPU or FPGA. Examples include routine equipment signal analysis, equipment fault signal analysis, equipment fault testing, and periodic equipment inspection. The "equipment fault signal analysis scenario" and the "equipment fault testing scenario" can be treated as related scenarios. Information such as equipment fault confidence, fault diagrams, and fault locations from the GPU's execution tasks (belonging to the equipment fault testing scenario) can be fused with signals such as equipment temperature, rotational speed, vibration acceleration, and power from the FPGA's execution tasks (belonging to the equipment fault signal analysis scenario).
[0140] Based on the embodiments of the present invention, task fusion can be provided based on multiple dimensions such as whether timestamps are aligned, whether they originate from the same overall task, or whether the scenarios match. In actual operation, different dimensions can be selected for fusion according to the actual scenario requirements, which expands the applicability of the heterogeneous computing system platform and improves the practicality of the system's heterogeneous computing platform.
[0141] Furthermore, the CPU is also configured to, when the detection result indicates that the first result and the second result match, write the first result and the second result into the same result area in the first memory address, and perform fusion processing on the first result and the second result in the same result area to obtain the first fusion result.
[0142] In another embodiment of the present invention, if the detection result indicates that the first result and the second result do not match, the first result is written into the first result area in the first memory address, and the second result is written into the second result area in the first memory address. The first memory address is pre-allocated with the first result area corresponding to the GPU and the second result area corresponding to the FPGA, and the first result and the second result are output.
[0143] In this embodiment of the invention, a matching test is also performed on the first result and the second result to fuse the matching results. If they match, the fused result of different computing power units can be output; if they do not match, the individual results of each computing power unit can be output.
[0144] In this embodiment of the invention, the matching results between different computing power units are fused, so that the subtasks between each computing power unit can be effectively combined for output, thereby improving the practicality of collaboration between different computing power units.
[0145] Furthermore, the hardware zero-copy interconnect module is also used to write the first result and the second result into the first memory address when the second result does not include real-time control signals.
[0146] For example, real-time control signals may include device signal data at the time of a fault.
[0147] In this embodiment of the invention, when the second result does not include real-time control signals, the first result and the second result are written to the first memory address so that the CPU can determine the fusion result based on the first result and the second result to complete the computational requirements of complex AI tasks.
[0148] Furthermore, the CPU is also configured to send a direct transmission instruction to the hardware zero-copy interconnect module; the hardware zero-copy interconnect module is also configured to, in response to the direct transmission instruction, read the first fusion result from the first memory address based on the cross-cell memory mapping information, and write the first fusion result to the third memory address; the FPGA is also configured to transmit the first fusion result to the control device through a specified serial port, so that the control device executes corresponding control instructions based on real-time signals.
[0149] In this embodiment of the invention, the fusion result can be directly transmitted to the FPGA via a hardware zero-copy interconnect module, and then output by the FPGA. A lower-latency transmission link can be configured between the FPGA and external devices; for example, the designated serial port can be an onboard dual RS422 serial port. Transmission via the FPGA can significantly reduce transmission latency.
[0150] In this embodiment of the invention, the low latency characteristics of FPGA and its links can be effectively utilized to output the fusion results in real time, thereby improving the efficiency and practicality of the heterogeneous computing platform.
[0151] Furthermore, the CPU is also used to output the first fusion result to a display via a multimedia display interface, or to send the first fusion result to a host computer mounted on the board via an Ethernet port, or to write the first fusion result to a solid-state drive.
[0152] In this embodiment of the invention, the CPU can display the fusion result. The fused result (such as a fault labeling image + signal curve) is output to a display via the front panel HDMI interface for on-site personnel to view. For example, if the fusion result includes real-time display data that needs to be displayed in real time, the CPU can output the fused "fault labeling image + signal curve" to the display for real-time display via one HDMI display interface. The data is read directly from the CPU's DDR4 memory without going through the VPX link.
[0153] For example, the CPU can transmit the fusion results over a network. The results can be uploaded to the host computer via the front panel RJ45 Gigabit Ethernet port or the onboard 1×1000Base-T Ethernet port. Alternatively, the CPU can store the fusion results locally. The complete results, including raw data, processing procedures, and the final report, can be written to an M.22TB SSD for easy traceability and review. For instance, non-real-time data in the fusion results can be archived non-real-time. The CPU can upload the associated report and raw data to the host computer via the 1×1000Base-T Gigabit Ethernet port, or write it to an M.22TB SSD for long-term storage. This leverages the versatility of Ethernet links, eliminating the need to consume core bandwidth in the transmission links between computing units.
[0154] After the CPU finishes its output, it sends a task completion signal to the hardware device and releases the GPU and FPGA resources used for this task, preparing for the next task.
[0155] Based on the embodiments of the present invention, multiple output methods for the fusion results are provided, allowing the fusion results to be displayed and output in real time as needed, or stored to hardware devices, or stored on a hardware solid-state drive. This improves the practicality of the system's heterogeneous computing platform.
[0156] Furthermore, the hardware zero-copy interconnect module is also used to read the first result from the second memory address and write the first result to the third memory address when the second result includes a real-time control signal; The FPGA is further configured to fuse the first result and the second result in the third memory address to obtain a second fusion result, and output the second fusion result to the control device so that the control device can perform real-time control based on the second fusion result.
[0157] In this embodiment of the invention, the FPGA may be configured with two onboard RS422 serial ports and a 32×LVDS interface to ensure low latency in real-time signal transmission. For example, the FPGA can directly receive data transmitted from external devices via the onboard two RS422 serial ports and the 32×LVDS interface. The FPGA can also transmit real-time control signals back to the control device via the onboard two RS422 serial ports, with a latency ≤0.5 seconds. s.
[0158] Based on this, the first result from the GPU can be directly transmitted to the FPGA via a hardware zero-copy interconnect module, and then fused and output by the FPGA, eliminating the CPU intermediate process.
[0159] For example, the results from GPU and FPGA can be fused to generate a related report such as "10:23:45, gear wear detected (confidence level 98%), at which time the equipment speed is abnormal (1500rpm→800rpm)", which can be sent back to the control equipment so that the control equipment can repair or debug the equipment based on the abnormal report.
[0160] Based on the embodiments of the present invention, the high efficiency and low latency characteristics of FPGA and its transmission link can be combined to transmit control signals that need to be controlled in real time to external devices without going through the CPU, which ensures that the external devices can quickly and effectively control the results and improves the efficiency of the system's heterogeneous computing platform.
[0161] Figure 2 This is the second structural schematic diagram of the heterogeneous computing platform for AI provided by the present invention, as shown below. Figure 2 As shown, the heterogeneous system platform adopts a heterogeneous computing architecture of CPU+FPGA+GPU+hardware zero-copy interconnect module. In this heterogeneous computing system platform, the CPU interacts with the GPU and FPGA respectively through the hardware zero-copy interconnect module. The memory addresses of the CPU, FPGA, and GPU are respectively the first memory address, the second memory address, and the third memory address, and the memory address of each computing unit includes processing memory and shared memory. The hardware zero-copy interconnect module can record the addresses of the processing memory between each computing unit. Through the hardware zero-copy interconnect module, data interaction can be realized between the processing memory of the CPU and the processing memory of the GPU, as well as between the processing memory of the CPU and the processing memory of the FPGA.
[0162] Figure 3 This is a schematic diagram illustrating the principle of computation implemented by the heterogeneous computing platform for AI provided by the present invention, as shown below. Figure 3As shown, external data (such as sensor signals, image data, and task instructions) enters the CPU. After the CPU completes data loading, preprocessing, and distribution preparation, it obtains image data, task instructions, and signal data. The image data and signal data are distributed to the GPU and FPGA respectively through the hardware zero-copy interconnect module. For example, image data is sent to the GPU for processing through the hardware zero-copy interconnect module, and signal data is sent to the FPGA for processing through the same module. Task instructions can be processed by the CPU itself. Then, the CPU obtains the results from the GPU and FPGA through the hardware zero-copy interconnect module, fuses them, and outputs them. For example, the CPU can output the fused "fault label image + signal curve" to a monitor for real-time display via an HDMI display interface. Another example is that the CPU transmits data to the FPGA via the P2 / P3 interface GTH link through the hardware zero-copy interconnect module, and the FPGA transmits the data back to the field control equipment via two RS422 serial ports. Yet another example is that the CPU uploads associated reports and raw data to the host computer via an Ethernet port. Finally, the CPU can also output debugging data.
[0163] It should be noted that in related technologies, traditional single-board computers mostly rely on the computing power of a single CPU core, which cannot support the parallel computing needs of complex tasks such as AI inference and signal processing. For example, in scenarios such as object detection based on YOLOv5 and natural language processing based on BERT, relying solely on serial CPU computation will cause task latency to exceed the application threshold.
[0164] The heterogeneous computing system platform for AI provided in this invention integrates a CPU, FPGA, GPU, and hardware zero-copy interconnect module onto a single board, achieving hardware-level collaborative optimization of the heterogeneous architecture and breaking through the efficiency bottleneck of traditional architectures. By allocating sub-tasks matching the computing characteristics of the GPU and FPGA to the CPU, each unit focuses on its core computing advantages, reducing cross-unit computing power redundancy. It can simultaneously achieve the high efficiency of parallel computing on GPUs and the low latency and high real-time performance of FPGAs, while improving the collaborative efficiency between units, thereby enhancing the overall processing efficiency of the heterogeneous computing platform system. Furthermore, this invention employs a four-dimensional parallel technical approach of "CPU scheduling + GPU computing + FPGA preprocessing + hardware zero-copy interconnect transmission," achieving extremely high performance, especially in complex task scenarios and multi-task concurrent scenarios. For example, "real-time signal processing + AI fault detection" significantly shortens end-to-end latency and improves processing efficiency. For example, in a multi-task concurrent scenario of "simultaneously processing 2 YOLOv5 inference channels + 4 RS422 signal analysis channels", the CPU utilization rate in the data transmission stage is greatly reduced due to the zero-copy hardware interconnection and transmission. The computing power released by the CPU can support more complex multi-task concurrency, thereby reducing the CPU load.
[0165] The calculation method provided by the present invention is described below. The calculation method described below can be referred to in correspondence with the heterogeneous computing platform for artificial intelligence (AI) described above.
[0166] Figure 4 This is a flowchart illustrating the computation method provided by the present invention. The computation method is applied to the aforementioned heterogeneous computing platform for artificial intelligence (AI). The heterogeneous computing platform for AI includes a central processing unit (CPU), a hardware zero-copy interconnect module, a graphics processing unit (GPU), and a field-programmable gate array (FPGA) integrated on the same board. The CPU is connected to the GPU and the FPGA respectively through the hardware zero-copy interconnect module. Figure 4 As shown, the method includes: Step 401: The CPU decomposes the task to be processed into a general computing subtask, an AI computing power parallel subtask, and a customized AI acceleration subtask, and sends the AI computing power parallel subtask to the GPU and the customized AI acceleration subtask to the FPGA.
[0167] Step 402: The AI computing power parallel subtask is executed by the GPU based on the first task data in the second memory address of the GPU to obtain a first result, and the first result is written to the second memory address.
[0168] Step 403: Execute the customized AI acceleration subtask through the FPGA based on the second task data in the third memory address of the FPGA, obtain the second result, and write the second result into the third memory address.
[0169] Step 404: Based on the cross-cell memory mapping information, the hardware zero-copy interconnect module reads the first result from the second memory address and the second result from the third memory address, and writes the read first result and second result into the first memory address in the CPU.
[0170] The cross-unit memory mapping information includes the mapping relationship of memory addresses between the computing units in the CPU, GPU, and FPGA.
[0171] Step 405: The CPU performs a fusion process on the first result and the second result to obtain a first fusion result.
[0172] Figure 5 This is a schematic diagram of the structure of the electronic device provided by the present invention, such as... Figure 5As shown, the electronic device may include: a processor 510, a communication interface 520, a memory 530, and a communication bus 540, wherein the processor 510, the communication interface 520, and the memory 530 communicate with each other through the communication bus 540. The processor 510 can call logical instructions in the memory 530 to execute a calculation method. This method is applied to the heterogeneous computing platform for artificial intelligence (AI) described above. The heterogeneous computing platform for AI includes a central processing unit (CPU), a hardware zero-copy interconnect module, a graphics processing unit (GPU), and a field-programmable gate array (FPGA) integrated on the same board. The CPU is connected to the GPU and the FPGA respectively through the hardware zero-copy interconnect module. The method includes: decomposing the task to be processed into a general computing subtask, an AI computing power parallel subtask, and a customized AI acceleration subtask through the CPU; sending the AI computing power parallel subtask to the GPU and the customized AI acceleration subtask to the FPGA; and executing the calculation method based on the first task data in the second memory address of the GPU through the GPU. The system executes the AI computing power parallel subtask to obtain a first result and writes the first result to the second memory address; it then executes the customized AI acceleration subtask based on the second task data in the third memory address of the FPGA to obtain a second result and writes the second result to the third memory address; the hardware zero-copy interconnect module reads the first result from the second memory address and the second result from the third memory address based on cross-unit memory mapping information, and writes the read first result and second result to the first memory address of the CPU; wherein, the cross-unit memory mapping information includes the mapping relationship of memory addresses between the computing power units in the CPU, GPU, and FPGA; and the CPU fuses the first result and the second result to obtain a first fusion result.
[0173] Furthermore, the logical instructions in the aforementioned memory 530 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0174] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the computational methods provided by the above-described methods. This method is applied to the heterogeneous computing platform for artificial intelligence (AI) described above. The heterogeneous computing platform for AI includes a central processing unit (CPU), a hardware zero-copy interconnect module, a graphics processing unit (GPU), and a field-programmable gate array (FPGA) integrated on the same board. The CPU is connected to the GPU and the FPGA respectively through the hardware zero-copy interconnect module. The method includes: decomposing the task to be processed into a general computing subtask, an AI computing power parallel subtask, and a customized AI acceleration subtask through the CPU, and sending the AI computing power parallel subtask to the GPU and the customized AI acceleration subtask to the FPGA. A; The GPU executes the AI computing power parallel subtask based on the first task data in the second memory address of the GPU to obtain a first result, and writes the first result to the second memory address; The FPGA executes the customized AI acceleration subtask based on the second task data in the third memory address of the FPGA to obtain a second result, and writes the second result to the third memory address; The hardware zero-copy interconnect module reads the first result from the second memory address and the second result from the third memory address based on cross-unit memory mapping information, and writes the read first result and second result to the first memory address of the CPU; wherein, the cross-unit memory mapping information includes the mapping relationship of memory addresses between the computing power units in the CPU, GPU, and FPGA; The CPU fuses the first result and the second result to obtain a first fusion result.
[0175] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, is implemented to perform the computational methods provided by the methods described above. This method is applied to the aforementioned heterogeneous computing platform for artificial intelligence (AI), which includes a central processing unit (CPU), a hardware zero-copy interconnect module, a graphics processing unit (GPU), and a field-programmable gate array (FPGA) integrated on the same board. The CPU is connected to the GPU and the FPGA respectively via the hardware zero-copy interconnect module. The method includes: decomposing the task to be processed into a general-purpose computing subtask, an AI computing power parallel subtask, and a customized AI acceleration subtask via the CPU; sending the AI computing power parallel subtask to the GPU; and sending the customized AI acceleration subtask to the FPGA via the GPU based on the FPGA. The first task data in the second memory address of the CPU executes the AI computing power parallel subtask to obtain a first result, and the first result is written to the second memory address; the customized AI acceleration subtask is executed by the FPGA based on the second task data in the third memory address of the FPGA to obtain a second result, and the second result is written to the third memory address; the first result is read from the second memory address and the second result is read from the third memory address by the hardware zero-copy interconnect module based on cross-unit memory mapping information, and the read first result and second result are written to the first memory address of the CPU; wherein, the cross-unit memory mapping information includes the mapping relationship of memory addresses between the computing power units in the CPU, GPU, and FPGA; the first result and the second result are fused by the CPU to obtain a first fusion result.
[0176] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the embodiments of the present invention according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0177] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0178] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A heterogeneous computing platform for artificial intelligence (AI), characterized in that, include: The CPU, hardware zero-copy interconnect module, graphics processing unit (GPU), and field-programmable gate array (FPGA) are integrated on the same board. The CPU is connected to the GPU and the FPGA respectively through the hardware zero-copy interconnect module. The CPU is used to decompose the task to be processed into a general computing subtask, an AI computing power parallel subtask, and a customized AI acceleration subtask, and send the AI computing power parallel subtask to the GPU and the customized AI acceleration subtask to the FPGA. The GPU is used to execute the AI computing power parallel subtask based on the first task data in the second memory address of the GPU, obtain a first result, and write the first result to the second memory address. The FPGA is used to execute the customized AI acceleration subtask based on the second task data in the third memory address in the FPGA, obtain a second result, and write the second result into the third memory address; The hardware zero-copy interconnect module is used to read the first result from the second memory address and the second result from the third memory address based on cross-unit memory mapping information, and write the read first result and second result into the first memory address in the CPU; wherein, the cross-unit memory mapping information includes the mapping relationship of memory addresses between the computing units in the CPU, GPU, and FPGA; The CPU is further configured to fuse the first result and the second result to obtain a first fusion result.
2. The heterogeneous computing platform for artificial intelligence (AI) according to claim 1, characterized in that, The hardware zero-copy interconnect module is further configured to read the first task data from the first memory address and write the first task data to the second memory address based on the cross-unit memory mapping information; and to read the second task data from the first memory address and write the second task data to the third memory address.
3. The heterogeneous computing platform for artificial intelligence (AI) according to claim 1, characterized in that, The CPU is also used to collect the operating status data of each computing unit in the CPU, GPU and FPGA, as well as the link status of the hardware zero-copy interconnect module, through the intelligent platform management interface bus during the process of the CPU executing the general computing subtask, the GPU executing the AI computing power parallel subtask, and the FPGA executing the customized AI acceleration subtask. The CPU is also configured to dynamically adjust the operating status of each computing unit in response to detecting an operating abnormality based on the operating status data of each computing unit. The CPU is further configured to adjust the transmission state of the hardware zero-copy interconnect module in response to detecting an abnormal transmission state of the hardware zero-copy interconnect module based on the link state of the hardware zero-copy interconnect module.
4. The heterogeneous computing platform for artificial intelligence (AI) according to claim 1, characterized in that, The CPU is further configured to send a first direct-connection collaboration instruction to the hardware zero-copy interconnect module when the second result includes data that requires the GPU to process collaboratively. The hardware zero-copy interconnect module is also used to respond to the first direct connection collaboration instruction, read the second result from the third memory address based on the cross-unit memory mapping information, and write the second result into the second memory address; The GPU is also used to execute the AI computing power parallel subtask based on the second result and the first task data to obtain the first result.
5. The heterogeneous computing platform for artificial intelligence (AI) according to claim 1, characterized in that, The FPGA is also used to acquire timestamp data and write it to the third memory address, and to execute the customized AI acceleration subtask based on the timestamp data and the second task data to obtain the second result, wherein the second result is timestamped. The hardware zero-copy interconnect module is also used to read the timestamp data from the third memory address based on the cross-unit memory mapping information, and write the timestamp data into the second memory address; The GPU is also used to execute the AI computing power parallel subtask based on the timestamp data and the first task data to obtain the first result, wherein the first result is timestamped.
6. The heterogeneous computing platform for artificial intelligence (AI) according to any one of claims 1 to 5, characterized in that, The CPU is further configured to verify the first result and the second result respectively based on a unit verification strategy; if the verification of the first result fails, a reprocessing instruction is sent to the GPU; if the verification of the second result fails, a reprocessing instruction is sent to the FPGA. Until both the first result and the second result are successfully verified, a matching test is performed on the first result and the second result to obtain the test result; Based on the detection results, the first result and the second result are fused to obtain the first fusion result.
7. The heterogeneous computing platform for artificial intelligence (AI) according to claim 6, characterized in that, The CPU is further configured to, when the detection result indicates that the first result and the second result match, write the first result and the second result into the same result area in the first memory address, and perform fusion processing on the first result and the second result in the same result area to obtain the first fusion result.
8. The heterogeneous computing platform for artificial intelligence (AI) according to any one of claims 1 to 5, characterized in that, The hardware zero-copy interconnect module is further configured to write the first result and the second result into the first memory address when the second result does not include real-time control signals.
9. The heterogeneous computing platform for artificial intelligence (AI) according to any one of claims 1 to 5, characterized in that, The CPU is also used to send direct transmission instructions to the hardware zero-copy interconnect module; The hardware zero-copy interconnect module is also used to respond to the direct transmission command, read the first fusion result from the first memory address based on the cross-unit memory mapping information, and write the first fusion result to the third memory address; The FPGA is also used to transmit the first fusion result to the control device through a specified serial port, so that the control device executes corresponding control commands based on real-time signals.
10. The heterogeneous computing platform for artificial intelligence (AI) according to any one of claims 1 to 5, characterized in that, The CPU is also used to output the first fusion result to a display via a multimedia display interface, or to send the first fusion result to a host computer mounted on the board via an Ethernet port, or to write the first fusion result to a solid-state drive.
11. The heterogeneous computing platform for artificial intelligence (AI) according to any one of claims 1 to 5, characterized in that, The hardware zero-copy interconnect module is further configured to read the first result from the second memory address and write the first result to the third memory address when the second result includes real-time control signals; The FPGA is further configured to fuse the first result and the second result in the third memory address to obtain a second fusion result, and output the second fusion result to the control device so that the control device can perform real-time control based on the second fusion result.
12. A calculation method, characterized in that, The method is applied to the heterogeneous computing platform for artificial intelligence (AI) as described in any one of claims 1 to 11, wherein the heterogeneous computing platform for AI includes a central processing unit (CPU), a hardware zero-copy interconnect module, a graphics processing unit (GPU), and a field-programmable gate array (FPGA) integrated on the same board, wherein the CPU is connected to the GPU and the FPGA respectively through the hardware zero-copy interconnect module; the method includes: The CPU breaks down the task to be processed into a general computing subtask, an AI computing power parallel subtask, and a customized AI acceleration subtask, and sends the AI computing power parallel subtask to the GPU and the customized AI acceleration subtask to the FPGA. The GPU executes the AI computing power parallel subtask based on the first task data in the second memory address of the GPU to obtain a first result, and writes the first result to the second memory address. The customized AI acceleration subtask is executed by the FPGA based on the second task data in the third memory address of the FPGA, a second result is obtained, and the second result is written to the third memory address; The hardware zero-copy interconnect module reads the first result from the second memory address and the second result from the third memory address based on cross-unit memory mapping information, and writes the read first result and second result into the first memory address in the CPU; wherein, the cross-unit memory mapping information includes the mapping relationship of memory addresses between the computing units in the CPU, GPU, and FPGA; The CPU performs a fusion process on the first result and the second result to obtain a first fusion result.