Fault-tolerant methods, systems, and media for neural network hardware inference in BNCT radiation environments
By performing block-level verification and local rollback on the weights of the deep learning model in the BNCT radiation environment, the model corruption problem caused by single-particle flip is solved, and highly reliable, low-latency inference results are output, meeting the real-time and continuity requirements of the BNCT scenario.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SICHUAN UNIV
- Filing Date
- 2026-06-02
- Publication Date
- 2026-06-30
AI Technical Summary
In the BNCT radiation environment, existing deep learning hardware accelerators are prone to damage to model weight data due to single-particle flip, resulting in invalid inference results. Furthermore, existing error correction mechanisms cannot effectively cope with multi-bit errors and cannot meet the requirements of real-time performance and continuity.
A hardware-software co-processing weight block verification and pipeline local rollback mechanism is adopted. The weights of the 3D deep learning model are divided into multiple blocks, and an integrity check code is generated for each block. Verification is performed before loading, and a local rollback operation is performed when an error is detected to ensure that the inference pipeline is not interrupted and to maintain the continuity of the calculated intermediate feature maps and scheduling states.
It achieves highly reliable and low-latency deep learning inference in the BNCT radiation environment, avoids the cross-layer propagation of errors, maintains the throughput and real-time performance of the computation flow, and meets the high-confidence prediction requirements of BNCT scenarios.
Smart Images

Figure CN122309231A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer architecture and hardware fault tolerance technology, specifically to a neural network hardware inference fault tolerance method, system, and medium for BNCT radiation environments. Background Technology
[0002] With the development of deep learning technology, using 3D deep learning models to process multimodal tensor data (such as medical images) has become a trend. However, 3D deep learning models typically have a large number of parameters, making it difficult to load them all at once into the on-chip cache of resource-constrained edge computing hardware (such as FPGAs and ASICs). Therefore, hardware accelerators typically employ weighted block dynamic loading and dataflow pipeline architectures to complete inference computation.
[0003] In specialized applications such as aerospace, nuclear industry, and medical accelerators, computing devices face severe physical environmental challenges. Taking boron neutron capture therapy (BNCT) as an example, edge computing devices deployed near accelerators are subject to radiation interference from strong neutrons and gamma rays. This high-energy particle radiation can easily penetrate chip packaging, striking static random access memory (SRAM) or triggers, causing single-event upsets (SEUs). This results in unpredictable bit flips (i.e., "silent data corruption") of deep learning weight data stored in external memory or on-chip cache.
[0004] Existing deep learning hardware acceleration architectures are typically built on the assumption that "weight data is absolutely safe and statically immutable," lacking underlying detection and recovery mechanisms for weights being tampered with by physical radiation during runtime. In traditional dataflow inference architectures, once a single-event flip occurs in the weights of a layer, the error is amplified dramatically by the nonlinear activation function and propagates along the pipeline, ultimately leading to the complete invalidation of the output. Although some systems employ ECC (Error Correction Code) memory, it can only correct single-bit errors and cannot cope with multi-bit flips caused by radiation. If traditional software-level exception interrupt mechanisms are used for rollback and recalculation, all intermediate feature maps in the current pipeline must be cleared. This not only leads to the loss of inference scheduling context but also causes significant computational latency, making it impossible to meet the real-time and continuous data requirements of special scenarios such as BNCT.
[0005] Therefore, there is an urgent need for a fault-tolerant underlying hardware architecture for radiation environments that can achieve dynamic verification of model weights and rapid error recovery without interrupting the data flow inference pipeline. Summary of the Invention
[0006] The purpose of this invention is to address the technical problem that existing confined edge computing hardware is prone to model weight data corruption due to single-event flips when performing deep learning inference in environments with high radiation (such as around the BNCT accelerator), leading to pipeline interruptions and inference result failures. This invention provides a fault-tolerant method, system, and medium for neural network hardware inference in the BNCT radiation environment. Through a hardware-software co-processing weight block verification and pipeline local rollback mechanism, this invention achieves highly reliable, low-latency inference of large deep learning models on high-radiation, low-cache hardware.
[0007] The technical solution of the present invention is as follows: A neural network hardware inference fault-tolerant method for BNCT radiation environments, executed by computing hardware deployed near boron neutron capture therapy equipment, for outputting a three-dimensional neutron flux distribution prediction result for a target region, including: Step S1: Obtain a pre-trained 3D deep learning model, which is trained based on multimodal medical images and their corresponding voxel-level neutron flux labels; Step S2: Based on the on-chip storage capacity of the computing hardware, divide the model weights of the 3D deep learning model into multiple weight blocks, and generate an integrity check code for each weight block. Step S3: During the deep learning inference stage, the weight blocks are loaded from the external memory into the on-chip cache of the computing hardware in the order of weight blocks, and an integrity check is performed on the weight block after loading is completed and before it is sent to the computing unit for inference. Step S4: When the verification fails due to BNCT radiation, without interrupting the current data flow inference pipeline, a local rollback operation is performed on the erroneous weight block. The local rollback operation is selected from one or more of reloading the weight block and loading a pre-stored redundant copy, until the verification passes or the rollback is successful. Step S5: After all weight blocks have passed verification, the computing hardware performs three-dimensional deep learning inference calculations in a data stream manner and outputs the three-dimensional neutron flux distribution prediction results.
[0008] Furthermore, the multimodal medical images include one or more of CT images, MRI images, and PET images, and the voxel-level neutron flux labels are generated by Monte Carlo simulation.
[0009] Furthermore, the weight blocks are divided in one or more ways, such as by convolution kernel group, by channel group, or by hierarchical structure, so that the byte size of each weight block is not greater than the on-chip storage capacity of the computing hardware.
[0010] Furthermore, the integrity check code is selected from one or more of Cyclic Redundancy Check (CRC), checksum, and hash digest.
[0011] Furthermore, the computing hardware is one or more hardware accelerators selected from FPGA, ASIC, DSP, or edge GPU; the data flow method is a computing structure in which the input feature map, weight block, and intermediate feature map flow continuously in the on-chip cache in a pipeline form.
[0012] Furthermore, in step S4, the local rollback operation ensures that the calculated intermediate feature maps are not cleared from the on-chip cache, and the pipeline's inference scheduling state remains continuous.
[0013] This invention also proposes a neural network hardware inference fault-tolerant system for BNCT radiation environments, comprising: The weight management module is used to divide the weights of the pre-trained 3D deep learning model used to predict neutron flux distribution into multiple weight blocks according to the on-chip storage capacity of the computing hardware, and generate an integrity check code for each weight block. The weight loading and verification module is used to load weight blocks from external memory to on-chip cache sequentially during the inference phase, and to perform integrity verification after loading and before sending them to the computing unit to detect weight data errors induced by BNCT radiation. The rollback processing module is used to perform a local rollback operation on the erroneous weight block when the verification fails, without interrupting the current data flow inference pipeline. The local rollback operation is selected from one or more of reloading the weight block and loading a pre-stored redundant copy. The dataflow inference module is used to perform 3D deep learning inference calculations in a dataflow manner after all weight blocks have passed verification, and output the 3D neutron flux distribution prediction results of the target region.
[0014] Furthermore, the dataflow inference module is deployed on one or more hardware platforms, including FPGA, ASIC, DSP, or edge GPU.
[0015] Furthermore, both the weight loading and verification module and the rollback processing module are implemented in hardware logic to avoid software scheduling errors caused by radiation.
[0016] The present invention also proposes a computer-readable storage medium for storing instructions that, when executed, cause the method described above to be implemented.
[0017] Compared with existing technologies, the advantages of this invention are: 1. Fine-grained hardware fault tolerance based on block-level verification. This invention physically divides the weights of a massive 3D deep learning model into blocks according to the on-chip storage capacity of the computing hardware and binds them with integrity check codes. Hardware-level verification is performed before data is sent to the computing unit for inference. This not only breaks through the storage bottleneck of limited hardware (such as FPGAs) that cannot deploy large models, but also strictly isolates radiation-induced single-event upset errors at the "block" level, preventing the cross-layer propagation of errors. This represents an architectural upgrade from passively accepting radiation errors to actively detecting silent data corruption.
[0018] 2. Non-blocking local rollback mechanism for the pipeline. To address the drawback of traditional software exception handling leading to pipeline cleanup, this invention designs a non-interrupted local rollback strategy. When radiation-induced weight verification failure is detected, the system only triggers the reloading of the erroneous weight block or replacement with a redundant copy, strictly ensuring that the intermediate feature maps already computed in the on-chip cache are not cleared, thus maintaining the continuity of the inference scheduling state. This mechanism reduces error recovery time from the traditional "network-wide recalculation level" to the "microsecond-level single-block data retrieval level," maximizing the throughput and real-time performance of the computation flow.
[0019] 3. Pure Hardware Logic Dataflow Computation Architecture. The dataflow approach of this invention enables the input feature map, weight block, and intermediate feature map to flow continuously in an on-chip cache in a pipelined manner. Furthermore, the weight loading and verification module and the rollback processing module of this invention are both implemented with pure hardware logic circuits, completely eliminating the dependence on general-purpose processor (CPU) software scheduling. This fundamentally eliminates the hidden danger of the entire operating system or scheduler crashing due to soft errors caused by CPU register radiation, greatly improving the robustness of the system in extreme physical environments.
[0020] 4. A highly adaptable and reliable prediction agent for BNCT scenarios. Based on the aforementioned underlying hardware fault-tolerant architecture, this invention can safely and stably process multimodal medical images and output high-confidence three-dimensional neutron flux distribution prediction results. Without altering the physical mechanisms of existing medical equipment, it provides a deep learning computing power foundation for BNCT radiation environments that maintains high availability even under harsh conditions, resolving the technical contradictions of excessively long traditional Monte Carlo simulation times and the inability of ordinary GPUs to stably survive in radiation zones. Attached Figure Description
[0021] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments recorded in the embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings.
[0022] Figure 1A flowchart of a neural network hardware inference fault-tolerant method for BNCT radiation environment; Figure 2 Here is a flowchart of a fault-tolerant prediction execution method for a BNCT scenario, as proposed in Example 3. Figure 3 The flowchart for integrity verification under the weighted block loading and radiation environment in Example 3 shows the external storage method of weight blocks, DMA loading order, processing flow during verification, and retransmission or backup weight replacement mechanism when verification fails. Figure 4 This is a schematic diagram of FPGA-based convolutional inference in Example 3, showing the encoder-decoder network after pruning and quantization, as well as the input-output channel relationship of each convolutional layer; Figure 5 This is a diagram of the hardware inference fault-tolerant device architecture proposed in Example 4; Figure 6 This diagram illustrates the error recovery process under radiation conditions, showing the local rollback path and intermediate feature retention mechanism when verification fails. Detailed Implementation
[0023] The features and performance of the present invention will be further described in detail below with reference to embodiments.
[0024] Example 1 This embodiment proposes a fault-tolerant method, system, and medium for neural network hardware inference in a BNCT radiation environment. It can be deployed on various heterogeneous computing hardware platforms such as FPGA, ASIC, DSP, and edge GPU to achieve stable deep learning inference in a radiation environment. Specifically addressing the single-event flip effect induced by neutron and gamma radiation near the BNCT accelerator, this embodiment provides weight integrity verification and a local backoff mechanism to ensure stable inference.
[0025] In this embodiment, for details, please refer to... Figure 1 A neural network hardware inference fault-tolerant method for BNCT radiation environments, the method being executed by computing hardware deployed near the boron neutron capture therapy device, for outputting a three-dimensional neutron flux distribution prediction result for the target region, specifically including the following steps: Step S1: Obtain a pre-trained 3D deep learning model, which is trained based on multimodal medical images and their corresponding voxel-level neutron flux labels; Step S2: Based on the on-chip storage capacity of the computing hardware, divide the model weights of the 3D deep learning model into multiple weight blocks, and generate an integrity check code for each weight block. Step S3: During the deep learning inference stage, the weight blocks are loaded from the external memory into the on-chip cache of the computing hardware in the order of weight blocks, and an integrity check is performed on the weight block after loading is completed and before it is sent to the computing unit for inference. Step S4: When the verification fails due to BNCT radiation, without interrupting the current data flow inference pipeline, a local rollback operation is performed on the erroneous weight block. The local rollback operation is selected from one or more of reloading the weight block and loading a pre-stored redundant copy, until the verification passes or the rollback is successful. Step S5: After all weight blocks have passed verification, the computing hardware performs three-dimensional deep learning inference calculations in a data stream manner and outputs the three-dimensional neutron flux distribution prediction results.
[0026] In this embodiment, specifically, the multimodal medical images include one or more of CT images, MRI images, and PET images, and the voxel-level neutron flux labels are generated by Monte Carlo simulation.
[0027] In this embodiment, the weight blocks are specifically divided by one or more of the following methods: division by convolution kernel group, division by channel group, and division by hierarchical structure, so that the byte size of each weight block is not greater than the on-chip storage capacity of the computing hardware.
[0028] In this embodiment, specifically, the integrity check code is selected from one or more of Cyclic Redundancy Check (CRC), checksum, and hash digest.
[0029] In this embodiment, specifically, the computing hardware is one or more hardware accelerators selected from FPGA, ASIC, DSP, or edge GPU; the data flow method is a computing structure in which the input feature map, weight block, and intermediate feature map flow continuously in the on-chip cache in a pipeline form.
[0030] In this embodiment, specifically, in step S4, the local rollback operation ensures that the calculated intermediate feature map is not cleared from the on-chip cache, and the pipeline's inference scheduling state remains continuous.
[0031] Based on the same inventive concept, this embodiment also proposes a neural network hardware inference fault-tolerant system for BNCT radiation environments, comprising: The weight management module is used to divide the weights of the pre-trained 3D deep learning model used to predict neutron flux distribution into multiple weight blocks according to the on-chip storage capacity of the computing hardware, and generate an integrity check code for each weight block. The weight loading and verification module is used to load weight blocks from external memory to on-chip cache sequentially during the inference phase, and to perform integrity verification after loading and before sending them to the computing unit to detect weight data errors induced by BNCT radiation. The rollback processing module is used to perform a local rollback operation on the erroneous weight block when the verification fails, without interrupting the current data flow inference pipeline. The local rollback operation is selected from one or more of reloading the weight block and loading a pre-stored redundant copy. The dataflow inference module is used to perform 3D deep learning inference calculations in a dataflow manner after all weight blocks have passed verification, and output the 3D neutron flux distribution prediction results of the target region.
[0032] In this embodiment, specifically, the data flow inference module is deployed on one or more hardware platforms, including FPGA, ASIC, DSP, or edge GPU.
[0033] In this embodiment, specifically, the weight loading and verification module and the rollback processing module are both implemented in hardware logic to avoid software scheduling errors caused by radiation.
[0034] Based on the same inventive concept, embodiments of the present invention also provide a storage medium storing computer instructions that, when executed on a computer, cause the computer to execute a neural network hardware inference fault-tolerant method for BNCT radiation environments as described above.
[0035] In some alternative embodiments, the present invention also provides aspects of a neural network hardware inference fault-tolerant method for a BNCT radiation environment that can also be implemented as a program product comprising program code that, when the program product is run on a device, causes the control device to perform the steps in a neural network hardware inference fault-tolerant method for a BNCT radiation environment as described above according to various exemplary embodiments of the present invention.
[0036] It should be noted that although several units or sub-units of the apparatus have been mentioned in the detailed description above, this division is merely exemplary and not mandatory. In fact, according to embodiments of the invention, the features and functions of two or more units described above can be embodied in one unit. Conversely, the features and functions of one unit described above can be further divided and embodied by multiple units. Furthermore, although the operation of the method of the invention is described in a specific order in the drawings, this does not require or imply that these operations must be performed in that specific order, or that all the operations shown must be performed to achieve the desired result. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.
[0037] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can be implemented in one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROMs) containing computer-usable program code. The form of a computer program product implemented on ROM, optical memory, etc.
[0038] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a server, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0039] Program code for performing the operations of this invention can be written using any combination of one or more programming languages, including object-oriented programming languages such as Java and C++, as well as conventional procedural programming languages such as C or similar languages. The program code can be executed entirely on the user's computing device, partially on the user's device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.
[0040] In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).
[0041] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0042] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0043] Example 2 Example 2 is a further explanation of Example 1. The specific steps of the neural network hardware inference fault-tolerant method, system, and medium for BNCT radiation environment proposed in Example 1 during practical application and deployment are as follows: First, the input raw medical images are normalized in intensity and aligned spatially, and a unified three-dimensional coordinate system is established to ensure that the input data is consistent in spatial scale and grayscale distribution, thereby reducing the model's sensitivity to input perturbations from the source.
[0044] Based on this, a lightweight 3D deep learning model can be constructed, which can adopt an encoder-decoder structure or other network structures suitable for 3D volumetric data. The model size can be reduced through model pruning and weight quantization to meet the computational and cache requirements of embedded hardware. Specifically, quantization uses an 8-bit integer format, and the formula is as follows: ,in As a scale factor, To quantize the zero-point parameters and ensure accuracy loss is less than 1%, These are quantized 8-bit integer weights used for actual inference calculations. represents the corresponding floating-point weight, and represents the original input data for the quantization operation.
[0045] In model compression, the on-chip storage capacity of the hardware is used as one of the constraints, so that the compressed model weights can be divided into several weight sub-blocks that can be loaded independently, laying the foundation for the controllability and stability of weight loading.
[0046] After obtaining the compressed model, the model weights are organized into blocks according to the on-chip storage capacity of the hardware. Each convolutional layer or several channels corresponds to a set of independent weight blocks. The block partitioning scheme is statically generated during the model export stage and corresponds one-to-one with the model structure, avoiding the additional overhead caused by temporary splitting at runtime. As a result, the hardware can pipeline the loading of each weight block in a fixed order and achieve sustainable inference under limited cache conditions.
[0047] During the inference phase, each weight block is sequentially loaded from external memory into the on-chip cache. Specifically, an integrity check is performed after each weight block is loaded, including using Cyclic Redundancy Check (CRC), hash check, or other lightweight consistency check methods to determine whether the weight block has been corrupted due to radiation effects such as SEU. If the check fails, a partial rollback mechanism is automatically triggered.
[0048] In this embodiment, the rollback mechanism is specifically designed to be block-level independent, ensuring the continuity of intermediate feature maps and scheduling states. The rollback latency is <50μs, avoiding overall inference interruption and supporting the millisecond-level response requirements of BNCT. Specifically, recovery includes reloading the weight block and replacing it from redundant copies, without interrupting the inference process. This approach breaks through the default assumption of "weights are read-only and necessarily correct" in deep learning inference, enabling the model to adapt to the actual working conditions of the BNCT radiation environment.
[0049] After the weight block verification passes, the inference unit inside the hardware performs deep learning computations in a dataflow manner. Input features, weight blocks, and intermediate results flow continuously in an on-chip cache in a pipeline manner, and multiple parallel multiply-accumulate computation units can achieve high-throughput inference under low power conditions. Specifically, through the collaborative design of weight block partitioning, cache reuse, and dataflow computation structure, the model can complete continuous inference without overall loading, significantly reducing the dependence on high-power devices such as GPUs, enabling real-time inference in a constrained environment near the BNCT device.
[0050] In a dataflow inference architecture, intermediate feature maps and scheduling states are continuously passed. If a weight verification fails and the pipeline is stopped and the state is reset, all intermediate features need to be recalculated, significantly increasing inference latency and failing to meet the real-time requirements of BNCT. Specifically, by designing block-level independent verification, scheduling state preservation, and local rollback strategies, this approach ensures that while the pipeline continues to run, only the currently used erroneous weight block is replaced, without destroying the generated intermediate features. This non-interrupted recovery mechanism requires a unified design across cache organization, scheduling control, and block structure, which is not inherent in existing FPGA inference architectures.
[0051] Through the above structure, the various technical features form a synergistic relationship: the lightweight model controls the size of the weight blocks, adapting to embedded hardware; weight block division provides the positioning granularity for the verification and rollback mechanisms; the verification and rollback mechanisms enable inference to adapt to radiation environments; and the dataflow inference structure enables real-time inference on constrained hardware. The absence of any one of these components would prevent the overall technical effect of this invention from being achieved. Therefore, this invention achieves stability, radiation resistance, and real-time performance that are difficult to attain with existing technologies, demonstrating significant technological advancements.
[0052] Through the above mechanism, the deep learning model can maintain continuous and reliable reasoning ability in the BNCT radiation environment. The three-dimensional neutron flux distribution obtained by this invention can be used for BNCT dose assessment, rapid prediction of treatment plans, and plan optimization, providing timely and reliable physical evidence for clinical practice.
[0053] Example 3 Example 3 is a specific application of the neural network hardware inference fault-tolerant method for BNCT radiation environments proposed in Example 1. It describes a fault-tolerant prediction execution flow for BNCT scenarios, as follows: Figure 2 As shown.
[0054] I. Input Data Acquisition and Preprocessing 1. CT image acquisition and formatting Acquire a sequence of CT images of the patient's treatment area, in DICOM format or as converted array data. To ensure model input consistency, perform the following steps on the images: Gray-scale normalization: This linearly maps CT values to the 0-1 range, reducing differences caused by different scanning conditions. Before normalization, linear offset compensation can be applied to the CT values based on secondary calibration parameters to reduce systematic differences between different equipment manufacturers.
[0055] Spatial alignment: Affine transformations are used to adjust the coordinate system so that the CT image is aligned with the neutron beam direction. In some BNCT devices, there may be rotational and translational deviations between the beam coordinate system and the original CT coordinate system. This embodiment uses feature points or fiducial markers based on anatomical structures to perform coordinate correction, thereby improving the spatial consistency of the model.
[0056] Region of Interest (ROI) extraction: Extracting the main regions containing tumors or important structures to reduce irrelevant data. The size of the ROI can be adaptively adjusted according to different treatment sites. For example, a 192×192 clipping window can be used for the head and neck to reduce the input size and reduce the computational cost of the model.
[0057] 2. Neutron Beam Parameter Acquisition The BNCT device provides real-time beam parameters, including: incident direction, beam intensity, energy spectrum or energy grouping, and collimator aperture structure. The beam parameters are encoded as fixed-length vectors and used as model input along with the image.
[0058] In one embodiment, the beam parameter vector can be expanded to include the proportions of cold neutrons, thermal neutrons, and fast neutrons to enhance the model's generalization ability to different energy spectrum conditions.
[0059] II. Deep Learning Model Construction and Training 1. Overall structure of the model This embodiment employs a lightweight encoder-decoder convolutional network, which mainly includes: Four encoding convolutional modules, a skip connection module, four decoding convolutional modules, and an output convolutional layer (1×1) are used to generate the flux value. The number of channels in each layer has been pruned to fit the FPGA's on-chip cache, as shown in the table below: Table 1 Comparison of channels before and after pruning in each convolutional layer
[0060] Structured channel pruning is used during pruning to ensure that the shape of the pruned network layer is consistent with the matrix size supported by the FPGA convolutional array, thus avoiding the need for additional padding during inference.
[0061] 2. Sources of model training data This embodiment employs a training scheme based on few-shot supervised learning. The training data is generated by Monte Carlo simulation and constructed in combination with real BNCT beam parameters.
[0062] (1) Construction of training dataset In this embodiment, approximately 100 simulated head BNCT cases can be constructed using Monte Carlo particle transport simulation software (e.g., MCNP6.2 software) based on publicly available BNCT beam simulation methods. Each case includes: CT image data: for example, a resolution of 256×256×64, with voxel sizes of approximately 1.0×1.0×2.0 mm³. CT images are normalized using the following formula:
[0063] in: The normalized voxel gray values are mapped to the [0,1] interval. This represents the grayscale value of each voxel in the original CT image. This represents the minimum gray value of all voxels in the CT image. This represents the maximum gray value of all voxels in the CT image.
[0064] To improve the adaptability of deep learning models under different scanning conditions. Neutron beam parameters: such as multi-dimensional feature vectors (e.g., 8-dimensional, used to describe the BNCT beam state) composed of energy, beam intensity, incident direction, collimator opening, energy spectrum grouping, etc. Label data: MCNP is used to simulate neutron transport on the above phantom to obtain the true value of voxel-level neutron flux. The number of simulated particles is, for example, 10⁷, which can obtain statistical accuracy that meets training requirements. In an optional approach, to enhance the dose sensitivity of the model, the corresponding linear energy transfer (LET) data can be output simultaneously as an auxiliary training signal.
[0065] To support the fusion of multimodal medical images, this embodiment can combine CT images with MRI or PET images. The specific fusion method employs feature-level fusion: before the model input layer, independent convolutional layers are used to extract features from each modality (e.g., CT extracts density information, MRI extracts soft tissue contrast, and PET extracts metabolic activity), and then these features are fused into a unified input tensor through concatenation or attention mechanisms. For example, for a single case, CT (256×256×64, 1 channel), MRI (same resolution, 1 channel), and PET (same resolution, 1 channel) are fused to form a 3-channel input. The fusion formula is as follows:
[0066] in: For the unified input tensor after fusion, For cascading operations at the channel dimension, 1×1 convolutions are used for channel alignment. The fused model improves the SSIM metric by approximately 5% on the validation set, enhancing its generalization ability to complex tissue structures. During training, real MRI / PET images can be supplemented using public datasets, and corresponding labels are generated through Monte Carlo simulation to ensure data consistency.
[0067] (2) Dataset partitioning The simulated dataset is divided into training, validation, and test sets, for example, by 80, 10, and 10 cases respectively. Patient-level partitioning is used to avoid data leakage between slices.
[0068] (3) Training hyperparameter settings Deep learning models are trained using the Adam optimizer, and the learning rate can be set to... The batch size can be set to, for example, 2. The model can be trained for approximately 300 epochs and can be completed on a server equipped with a high-performance GPU. A cosine annealing learning rate strategy can be used during training to improve the model's convergence stability.
[0069] (4) Loss function The model employs a hybrid loss function consisting of L1 loss and the structural similarity index (SSIM):
[0070] in: For a mixed loss function, To constrain the overall numerical deviation, the pixel-wise absolute error between the predicted flux and the actual flux is used. The index is used to constrain the consistency of spatial structure, making the location of tissue interfaces for predicting flux distribution more accurate, which helps in subsequent dose calculations. Specifically:
[0071] in: For measuring the predicted image With real images Indicators of similarity in terms of brightness, contrast, and spatial structure. For image The mean, For image The mean, For image variance For image variance For image and covariance, , This represents the dynamic range of the image.
[0072] (5) Model performance example Under typical training conditions, the SSIM of the validation set can reach approximately 0.88, and the dose-related error of the test set can be controlled within approximately 6%–7%, which can meet the auxiliary evaluation requirements for BNCT treatment plans.
[0073] In some validation sets with significant variations in energy spectrum conditions, the model can automatically adapt to different energy conditions through beam parameter encoding, demonstrating good generalization ability.
[0074] (6) Model export After training, the model is quantized, for example, exported as an 8-bit quantized model, with a model size of approximately 3MB, for subsequent FPGA inference deployment.
[0075] After quantization, the model will simultaneously generate quantization scale and zero-point parameters to ensure that integer convolution operations during hardware inference can restore the precision of floating-point representation.
[0076] III. Model Lightweighting and Weight Blocking 1. Quantification The convolution weights are converted to 8-bit integer format using the PTQ (Post-Training Quantization) method to reduce storage overhead and the complexity of multiply-accumulate operations on the FPGA convolutional array. Post-training quantization (PTQ) is employed for the quantization, with floating-point weights. Convert to 8-bit integer The calculation formula is: In the formula: ,
[0077] This quantization method is based on the PTQ method, and experiments show that the accuracy loss is <0.5%, making it suitable for Zynq series FPGAs.
[0078] 2. Pruning of branches along the passageway A pruning method based on channel importance is used to remove redundant convolution channels, so that the model can still maintain accuracy under hardware resource constraints.
[0079] The pruning ratio can be automatically calculated based on the number of FPGA BRAMs and DSPs, ensuring that the output channels of each layer are compatible with hardware parallelism.
[0080] 3. Weighted block organization FPGAs have limited on-chip buffer (BRAM) capacity; for example, the Zynq7020 chip has approximately 630KB of available BRAM. To ensure that the weights of a single convolutional layer can fully reside in the buffer during inference, this embodiment employs the following method: The quantization weights of each convolutional layer are divided into multiple weight blocks of a fixed size (e.g., 64 KB); the weight block size... The capacity is determined based on the on-chip storage capacity of the FPGA, and the calculation formula is as follows:
[0081] in, The capacity (bits) of a single BRAM chip is typically 36 Kb. The kernel size is the convolution kernel size. , Number of input / output channels. Number of weight blocks per layer. for:
[0082] In the formula, This represents the total number of bytes of weight for this layer.
[0083] Each block is stored independently and loaded block by block during inference.
[0084] To reduce external DDR access latency, each weighted block is arranged with consecutive addresses in external memory to reduce the number of cross-page accesses.
[0085] The following table shows examples of the Conv3 layer's block structure: Table 2. Examples of Conv3 layer segmentation
[0086] After the above steps, the weight block and checksum have been stored in external storage.
[0087] IV. Weighted Block Loading and Integrity Verification under Radiation Environment The presence of neutrons and gamma rays near the BNCT device may cause bit flips during external memory or data transfer. Please refer to [link to relevant documentation]. Figure 3 and Figure 6 To ensure the reliability of the reasoning, this invention employs the following verification mechanism: 1. DMA block loading process The embedded processor sends a "Load Block i" command to the DMA controller; the DMA moves the weighted block from DDR / flash memory to the FPGA BRAM; verification is performed after loading. The DMA interface can use the AXI-HB bus, supporting a burst transfer bandwidth of approximately 800 MB / s at a maximum clock speed of 150MHz.
[0088] 2. Integrity Verification This invention preferably uses CRC32 checksum, and the checksum polynomial is: ,in, To calculate the generated 32-bit check code polynomial; The input weight block data polynomial; The dummy independent variable is used for the polynomial; To generate the polynomial (using the IEEE 802.3 standard), each block is appended with a 4-byte checksum.
[0089] The pre-stored checksum is stored along with the weight block; after loading, the real-time checksum of the current weight block is calculated and compared; if verification fails, it is retransmitted; if verification fails again, a backup weight block is reloaded or called. The probability of a single block loading failure is:
[0090] In the formula, This represents the probability of a single block loading failure. Neutron-induced SEU rate (typically 10) -7 / bit·h). This probability is based on a typical BNCT injection rate of 10%. 5 Radiation test data at n / cm²·s. Failed again → Use backup weight block.
[0091] This step can effectively reduce the impact of radiation-induced data corruption on prediction results. The backup weight block generates at least two independent quantization models during the training phase and stores them in external non-volatile memory. The primary and backup blocks are redundant with each other.
[0092] V. FPGA-based Convolutional Inference like Figure 4 As shown, the following modules are implemented within the FPGA: convolution array (multiple parallel MAC units), data flow scheduler, activation and addition module, and output buffer.
[0093] The reasoning process includes: Read the input features and the validated weight blocks from the BRAM; Convolutional arrays perform matrix multiplication and accumulation. A convolutional array can contain 64–128 DSP parallel units to achieve high throughput. Single-layer convolution computation time... for:
[0094] In the formula, To output the feature map height, To output the feature map width, Number of output channels The square of the convolution kernel size, Input the number of channels. This represents the number of parallel multiply-accumulate units. This is the clock frequency.
[0095] The result enters the activation module and is written to the next layer of on-chip cache; Repeat this process for all layers until a flux prediction graph is generated, with a total inference delay of [missing information]. for:
[0096] In the formula, For the overall reasoning delay, This is the sum of the computation time for all layers. The total number of weighted blocks, For the loading and verification time of a single weight block, This represents the total overhead for weight loading and verification.
[0097] The final prediction results, output via an embedded processor, are: a two-dimensional heatmap, a three-dimensional voxel matrix, and floating-point data suitable for BNCT dose calculation. Voxel-level neutron flux. The calculation formula is:
[0098] In the formula, Voxel-level neutron flux This is the last convolution operation in the decoder. For the first Each feature map This represents the precise location where the neutron flux is predicted.
[0099] To reduce off-chip access, the decoder section uses a combination of deconvolution and upsampling to keep intermediate features small within the FPGA.
[0100] Example 4 Example 4, based on the neural network hardware inference fault-tolerant system for BNCT radiation environment proposed in Example 1, proposes a specific hardware inference fault-tolerant device architecture. Please refer to [link to example 4]. Figure 5 The device includes: 1. Image Preprocessing Module: Implements CT image normalization, alignment, and ROI extraction. This module can be deployed in the programmable logic area within an embedded processor or FPGA. Normalization and coordinate transformation can be achieved through lookup tables (LUTs) or fixed-point matrix calculations to reduce latency and power consumption.
[0101] 2. Weight Management Module: This module is used to divide the weights of a pre-trained 3D deep learning model for predicting neutron flux distribution into multiple weight blocks based on the on-chip storage capacity of the computing hardware, and to generate an integrity check code for each weight block. In a preferred embodiment, this module includes: The model layer description parsing unit is used to convert the layer structure of the quantized model into a sequence of scheduling instructions that can be executed by the FPGA. The feature map buffer management unit is used to arrange the storage location of feature maps of each layer in the limited BRAM to avoid overlay conflicts between different layers.
[0102] 3. External storage: Stores each weight block and its corresponding checksum. To improve data loading efficiency: The weighted blocks are stored in contiguous regions. Each weighted block is aligned to 64 KB to reduce cross-page access during DMA transfers; The checksum and weight block are stored with a fixed offset, which makes it easy for DMA to trigger the check immediately after loading.
[0103] 4. Weight Loading and Verification Module: This module loads weight blocks sequentially from external memory to the on-chip cache during the inference phase and performs integrity verification after loading but before sending them to the computation unit. In this embodiment: The DMA controller is responsible for moving weighted blocks from external storage to the on-chip cache; The verification module calculates the real-time CRC32 checksum and compares it with the pre-stored checksum. To improve radiation resistance, the verification module can employ fully hardware-based CRC generation and comparison logic to prevent the propagation of errors that may be caused by radiation during CPU calculations.
[0104] 5. Rollback Processing Module: This module performs a local rollback operation on the erroneous weight block when a verification fails, without interrupting the current data flow inference pipeline. In this embodiment: If the aforementioned verification fails, the module will automatically execute a retransmission instruction and reload the weight block. If multiple failures occur consecutively, the weighted block in the standby redundant replica will be invoked.
[0105] 6. Dataflow Inference Module (FPGA Acceleration Unit): After all weight blocks have passed verification, this module performs 3D deep learning inference calculations in a dataflow manner, outputting the predicted 3D neutron flux distribution of the target region. This module includes the following components: 1) Convolutional array (multiple parallel DSP units) 2) Data Stream Scheduler (used to direct the reading and writing order of input features, weight blocks, and output features) 3) Activation and Addition Module (used to perform ReLU, element-wise addition, skip connection fusion, and other operations) 4) On-chip cache (BRAM / URAM) 5) Output buffer Convolutional arrays can employ a row-column parallel structure, achieving high throughput operation by unrolling the input channels and the convolutional kernel rows.
[0106] Inter-module connection method: Modules are interconnected via AXI bus; the system acquires data via Ethernet, USB or serial port.
[0107] In a preferred embodiment: The input module communicates with peripherals via an AXI-Lite channel. Weight and feature map transfer uses AXI-HP / AXI-HB channels to provide higher memory bandwidth; The acceleration unit uses an on-chip cross switch to reduce data contention.
[0108] Example 5 Example 5 is a fault-tolerant hardware inference system for neural networks in a BNCT radiation environment proposed in Example 4. It provides a specific SoC heterogeneous deployment scheme, including: 1. Embedded Processor (ARM) The embedded processor is responsible for task scheduling and exception handling. In a preferred embodiment, the processor may be an embedded CPU based on the ARM Cortex-A9, A53, or RISC-V architecture of a Zynq SoC. The main functions of the processor include: Send a weight loading command to the DMA controller; Monitor CRC check results and execute error rollback procedures; Maintain the current inference state machine, including states such as "loading", "validating", "inference", and "rollback". Provides an interface for encapsulating and transmitting prediction results to external parties; Optional: Perform a lightweight consistency check on the inference output to prevent output anomalies caused by the propagation of abnormal weights.
[0109] To enhance radiation resistance, critical scheduling states can be stored in ECC-protected SRAM to reduce scheduling errors caused by single-event upsets.
[0110] 2. Reasoning device (corresponding to Embodiment 4) This system includes the apparatus described in Embodiment 4, which is used to perform functions such as image preprocessing, deep learning inference, weight loading and verification, and convolution acceleration.
[0111] The prediction device interacts with the embedded processor via the AXI bus, wherein: AXI-Lite: Used for register-level control and status query; AXI-HB / HP: Used for high-bandwidth transmission of weight blocks and feature maps; Interrupt lines (IRQ): Used to notify the processor of events such as CRC check completion, DMA completion, and inference completion.
[0112] 3. Data Input Interface Used to receive CT images and beam parameters.
[0113] This interface may include any one or a combination of the following: A DICOM-over-IP interface is used to acquire CT data from a PACS system. A high-speed serial port or Ethernet interface is used to receive beam parameters output by the BNCT device; The USB interface is used for loading model test data during the debugging period.
[0114] To improve data reliability, CRC16 / 24 can be used for lightweight verification of input data to prevent radiation from damaging the input data.
[0115] 4. Output Interface Used to output neutron flux prediction maps or voxel matrices required for dose calculation.
[0116] Output formats may include: DICOM RTDose format; Numpy array format; CSV / binary matrix format; and two-dimensional images in heatmap form (PNG / RAW).
[0117] In a preferred embodiment, the output can be sent directly to the treatment planning system (TPS), eliminating the need for additional data conversion steps.
[0118] 5. Monitoring Module The monitoring module is used to monitor the verification status, DMA status, and inference consistency.
[0119] The monitoring module includes: CRC Status Register: Records the success / failure status of each weight block in real time; DMA Status Register: Records transfer time, error interruption flags, etc. Inference Consistency Monitor: This monitor detects potential anomalies by performing minor statistical checks (such as energy conservation constraints and maximum flux comparisons) on the output of the previous and current frames. Anomaly recovery unit: When an inference anomaly is detected, it automatically triggers a backup weight model or executes a re-inference process.
[0120] The monitoring module can be implemented using hardware logic to avoid excessive processor involvement, thereby reducing the potential propagation of software-level errors caused by radiation.
[0121] Based on the five embodiments described above, the present invention can be used for: rapid flux assessment before BNCT treatment, real-time assistance for beam adjustment during treatment, embedded flux prediction for mobile / compact BNCT devices, and as an "accelerated surrogate model" for Monte Carlo simulation.
[0122] This invention can achieve a throughput prediction latency of <200 ms on an embedded FPGA platform with power consumption of <5 W, offering significant deployment advantages; specifically as follows: (1) Pre-treatment rapid throughput assessment During the patient treatment planning stage, it is necessary to quickly calculate the neutron flux distribution under different beam settings (including energy spectrum, collimator opening, irradiation direction, etc.).
[0123] This invention can output three-dimensional voxel-level neutron flux distribution within hundreds of milliseconds, enabling clinicians to quickly compare the differences between different irradiation protocols, thereby shortening the time required to optimize treatment plans.
[0124] Compared to Monte Carlo calculations that can take tens of minutes or hours, this system can provide near real-time preliminary results and can be used as a predictive input for Monte Carlo calculations when needed, improving the overall computational efficiency of the TPS (treatment planning system).
[0125] (2) Beam adjustment assistance during treatment During BNCT clinical irradiation, the beam energy spectrum and intensity may drift slightly depending on the accelerator's operating status.
[0126] This invention enables rapid prediction of flux distribution during irradiation by receiving real-time beam parameter input, assisting in real-time clinical monitoring and prompting adjustments to the treatment plan when necessary.
[0127] The radiation-resistant inference mechanism based on FPGA enables the prediction system to be deployed in a high-radiation region near the beam aperture, thereby achieving close coupling with the field beam monitoring system.
[0128] (3) Embedded prediction in compact or mobile BNCT devices Some BNCT devices use compact accelerators or mobile treatment units, which limit their size and prevent the use of conventional GPUs.
[0129] Because this invention uses FPGA hardware acceleration and includes weighted block loading and radiation verification mechanisms, it can maintain stable operation in environments with limited volume, limited power, and near accelerators with strong radiation, making it suitable for new portable or bedside BNCT treatment devices.
[0130] (4) Accelerated agent in Monte Carlo simulation Traditional BNCT dose calculation relies on the Monte Carlo method, but the computational cost is high.
[0131] This invention can accelerate the Monte Carlo simulation process as a "surrogate model" in the following situations: Predict the flux distribution using this model before Monte Carlo calculations as an initial estimate to reduce the necessary number of particles.
[0132] When iteratively optimizing the dosage, the flux distribution predicted by this model is used as a guide.
[0133] It can quickly provide correction values when beam parameters change, reducing the number of repeated simulations.
[0134] Based on 10 simulation tests, this model, as a surrogate, can reduce the number of Monte Carlo iterations by 30%-50%, depending on the magnitude of the beam parameter variation. (5) Low-bandwidth operating mode in remote or cloud-based TPS systems In some telemedicine or cloud-based treatment planning systems, bandwidth limitations prevent the transmission of large amounts of CT data and computational models.
[0135] The FPGA model can transmit only compressed input parameters, which can then be independently inferred by the front-end system of this invention, thereby reducing reliance on network bandwidth and improving remote planning efficiency.
[0136] (6) Patient-specific tissue structure adaptation Because the model of this invention incorporates CT anatomical structures and beam parameters during training, it can automatically adapt to the tissue density and anatomical variations of different patients.
[0137] In real clinical settings, patient positions, ROI ranges, or tumor sizes vary. This invention can maintain model input consistency through ROI extraction and spatial alignment, giving the system a strong ability to adapt to individual needs.
[0138] It should also be noted that, specifically through theoretical analysis and architectural derivation, the technical measures of this invention are achieved as follows: Limit the weight errors caused by radiation to the weight block level, rather than affecting the entire model; Complete error recovery without interrupting the pipeline and maintain real-time predictive capabilities; Implementing block-based inference for large models on embedded hardware with limited on-chip cache; By combining input normalization, model lightweighting, and block-based verification mechanisms, the stability and reliability of inference are significantly improved.
[0139] The aforementioned technical measures form an interdependent and inseparable overall architecture, making it possible to achieve stable, continuous, and reliable deep learning inference in a BNCT radiation environment. These effects were verified through FPGA prototyping; the code and model can be reproduced in a non-radiation environment, while radiation testing requires specialized equipment.
[0140] The embodiments described above merely illustrate specific implementation methods of this application, and while the descriptions are detailed and specific, they should not be construed as limiting the scope of protection of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the technical solution of this application, and these modifications and improvements all fall within the scope of protection of this application.
[0141] This background section is provided to generally present the context of the invention. The work of the currently named inventors, the work to the extent described in this background section, and aspects of this section that did not constitute prior art at the time of application are neither expressly nor impliedly acknowledged as prior art to the invention.
Claims
1. A fault-tolerant method for neural network hardware inference in BNCT radiation environments, characterized in that, The method is executed by computing hardware deployed near the boron neutron capture therapy device to output a three-dimensional neutron flux distribution prediction result for the target region, including: Step S1: Obtain a pre-trained 3D deep learning model, which is trained based on multimodal medical images and their corresponding voxel-level neutron flux labels; Step S2: Based on the on-chip storage capacity of the computing hardware, divide the model weights of the 3D deep learning model into multiple weight blocks, and generate an integrity check code for each weight block. Step S3: During the deep learning inference stage, the weight blocks are loaded from the external memory into the on-chip cache of the computing hardware in the order of weight blocks, and an integrity check is performed on the weight block after loading is completed and before it is sent to the computing unit for inference. Step S4: When the verification fails due to BNCT radiation, without interrupting the current data flow inference pipeline, a local rollback operation is performed on the erroneous weight block. The local rollback operation is selected from one or more of reloading the weight block and loading a pre-stored redundant copy, until the verification passes or the rollback is successful. Step S5: After all weight blocks have passed verification, the computing hardware performs three-dimensional deep learning inference calculations in a data stream manner and outputs the three-dimensional neutron flux distribution prediction results.
2. The neural network hardware inference fault-tolerant method for BNCT radiation environment according to claim 1, characterized in that, The multimodal medical images include one or more of CT images, MRI images, and PET images, and the voxel-level neutron flux labels are generated by Monte Carlo simulation.
3. The neural network hardware inference fault-tolerant method for BNCT radiation environment according to claim 1, characterized in that, The weight blocks are divided in one or more ways, such as by convolution kernel group, by channel group, or by hierarchical structure, so that the byte size of each weight block is not greater than the on-chip storage capacity of the computing hardware.
4. The neural network hardware inference fault-tolerant method for BNCT radiation environment according to claim 1, characterized in that, The integrity check code is selected from one or more of Cyclic Redundancy Check (CRC), checksum, and hash digest.
5. The neural network hardware inference fault-tolerant method for BNCT radiation environment according to claim 1, characterized in that, The computing hardware is one or more hardware accelerators selected from FPGA, ASIC, DSP or edge GPU; the data flow method is a computing structure in which the input feature map, weight block and intermediate feature map flow continuously in the on-chip cache in a pipeline form.
6. The neural network hardware inference fault-tolerant method for BNCT radiation environment according to claim 1, characterized in that, In step S4, the local rollback operation ensures that the calculated intermediate feature maps are not cleared from the on-chip cache, and the pipeline's inference scheduling state remains continuous.
7. A neural network hardware inference fault-tolerant system for BNCT radiation environments, characterized in that, include: The weight management module is used to divide the weights of the pre-trained 3D deep learning model used to predict neutron flux distribution into multiple weight blocks according to the on-chip storage capacity of the computing hardware, and generate an integrity check code for each weight block. The weight loading and verification module is used to load weight blocks from external memory to on-chip cache sequentially during the inference phase, and to perform integrity verification after loading and before sending them to the computing unit to detect weight data errors induced by BNCT radiation. The rollback processing module is used to perform a local rollback operation on the erroneous weight block when the verification fails, without interrupting the current data flow inference pipeline. The local rollback operation is selected from one or more of reloading the weight block and loading a pre-stored redundant copy. The dataflow inference module is used to perform 3D deep learning inference calculations in a dataflow manner after all weight blocks have passed verification, and output the 3D neutron flux distribution prediction results of the target region.
8. The neural network hardware inference fault-tolerant system for BNCT radiation environment according to claim 7, characterized in that, The dataflow inference module is deployed on one or more hardware platforms, including FPGA, ASIC, DSP, or edge GPU.
9. The neural network hardware inference fault-tolerant system for BNCT radiation environment according to claim 7, characterized in that, Both the weight loading and verification module and the rollback processing module are implemented in hardware logic to avoid software scheduling errors caused by radiation.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium is used to store instructions that, when executed, cause the method as described in any one of claims 1-6 to be implemented.