An optimization method for effectively reducing flash storage of an algorithm model
By converting the single-precision model to a half-precision model and using the FP16 data type for storage and computation, the problems of wasted storage space and low computational efficiency of the single-precision model on the T23 chip are solved, thereby reducing storage space and improving computational efficiency, and improving the model's running performance and accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HEFEI JUNZHENG TECH CO LTD
- Filing Date
- 2024-12-17
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, single-precision models suffer from problems such as wasted storage space, low computational efficiency, accuracy loss, and limited applicability on resource-constrained devices such as the T23 chip. This is especially true in the application of deep learning models, particularly in object detection and classification algorithms, where storage space requirements are high, computational resources are consumed in large quantities, and accuracy is insufficient in certain scenarios.
By converting a single-precision model to a half-precision model and using the FP16 data type for storage and computation, the representation precision of the weight data is optimized, reducing storage space and computational resource requirements. Specifically, the method involves converting FP32 weight data to FP16 when generating the model and performing corresponding parsing and conversion during inference.
It significantly reduces storage space requirements, improves computational efficiency and accuracy, reduces computational resource consumption, and enhances the model's performance and speed on resource-constrained devices. In particular, the memory usage of the T23-YOLOv5 single humanoid detection algorithm has been reduced from 572KB to 292KB.
Smart Images

Figure CN122240004A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of neural network model technology, and specifically relates to an optimization method for effectively reducing the Flash storage of algorithm models. Background Technology
[0002] In current technologies, the compression and acceleration of deep learning models is a crucial research topic. As deep learning is increasingly applied across various fields, the size and complexity of these models are constantly increasing, leading to significant demands on computing resources and high energy consumption. Therefore, effectively compressing and accelerating deep learning models has become a hot research topic.
[0003] The purpose of deep learning model compression is to reduce the size and complexity of the model in order to reduce the consumption of storage and computing resources.
[0004] Common model compression methods include weight pruning, quantization, and knowledge distillation. Weight pruning reduces the model size by removing some unimportant weight connections; quantization converts floating-point weights in the model into low-bit integer representations, thereby reducing storage and computational requirements; knowledge distillation improves the performance of smaller models by transferring knowledge from larger models to smaller models.
[0005] Besides model compression, accelerating the inference process of deep learning models is also an important research direction. Acceleration techniques can improve the running speed of models, enabling them to process data and make decisions faster. Common acceleration techniques include hardware optimization, parallel computing, and model architecture optimization. Hardware optimization refers to improving the running speed of models by improving hardware devices, such as using dedicated accelerators or optimizing processor architecture; parallel computing utilizes multiple computing units to execute computational tasks simultaneously to accelerate the inference speed of models; model architecture optimization improves the running efficiency of models by improving their structure and algorithms to reduce computational complexity.
[0006] Compression and acceleration techniques for deep learning models are of great significance in practical applications. By compressing and accelerating, deep learning models can run on resource-constrained devices, such as mobile devices and embedded systems, thereby expanding the application scope of deep learning. Furthermore, compression and acceleration can improve the running speed of models, meeting the demands of real-time applications, such as autonomous driving and speech recognition.
[0007] In summary, the compression and acceleration of deep learning models is an important research direction, aiming to address the computational resource and energy consumption issues arising from the increasing size and complexity of models. Effective compression and acceleration techniques can reduce the storage and computational requirements of models, improve their running speed, and thus promote the widespread application of deep learning in various fields.
[0008] Single-precision inference plays a crucial role in deep learning. By using 32-bit floating-point numbers to represent model weights and activations, it achieves a relatively balanced performance and resource consumption while maintaining high computational precision. For deep learning models that need to be deployed on resource-constrained devices, single-precision inference is a promising option with broad application prospects.
[0009] The Beijing Junzheng Integrated Circuit Co., Ltd. (hereinafter referred to as Junzheng) T23 chip is a processor with specific computing capabilities. Due to its design characteristics, this chip does not support the SIMD128 instruction set, meaning it cannot process multiple data simultaneously. Therefore, when deploying neural network models, chips like the Junzheng T23 can only use single-precision model inference, and the advantages of Magik low-bit model quantization are no longer applicable to the T23 chip.
[0010] Therefore, the main disadvantages of this type of chip, including the T23 model chip, which uses a single-precision model (FP32), are as follows:
[0011] 1. Storage space:
[0012] The single-precision model uses 32 bits (4 bytes) to represent a floating-point number, while the half-precision model uses only 16 bits (2 bytes). Therefore, the single-precision model will occupy twice the storage space when storing the same number of floating-point numbers.
[0013] 2. Computational efficiency:
[0014] While single-precision models offer higher accuracy, this also means that more resources may be needed to process these extra bits during computation. In contrast, half-precision models, with fewer bits, can improve computational efficiency to some extent, especially in large-scale computing or resource-constrained environments.
[0015] 3. Accuracy loss:
[0016] While the accuracy of single-precision models is sufficient for deep learning and other scientific computing tasks in most cases, they may suffer from a loss of accuracy in some extreme cases, such as when dealing with very large or very small values.
[0017] 4. Applicability:
[0018] In certain application scenarios, such as deep learning inference on embedded devices or mobile devices, single-precision models may no longer be applicable due to resource constraints.
[0019] In addition, commonly used technical terms include:
[0020] FP16: Half-Precision Floating-Point Numbers, is a format that uses 16 bits (2 bytes) to represent floating-point numbers. This format is widely used in computer graphics, deep learning, and other fields because it can significantly reduce the demand for storage and computing resources while maintaining a certain level of precision. Floating-point numbers are a data type used to represent real numbers (including decimals and integers), and they consist of three parts: a sign bit, exponent bits, and mantissa bits. The format of a half-precision floating-point number includes 1 sign bit, 5 exponent bits, and 10.
[0021] FP32: Single-Precision Floating-Point Numbers, is a format that uses 32 bits (4 bytes) to represent floating-point numbers. In specific computing environments or programming languages, single-precision floating-point numbers are typically represented using the float type. The format of a single-precision floating-point number includes 1 sign bit, 8 exponent bits, and 23 mantissa bits. Summary of the Invention
[0022] To address the aforementioned issues, the purpose of this application is to provide a novel optimized implementation for deep learning model compression and acceleration. By converting single-precision models into half-precision models, it significantly reduces the Flash storage requirements of deep learning algorithms such as object detection and classification on the T23 chip, while maintaining computational accuracy. This innovative optimization will provide a more efficient and flexible solution for the development and deployment of deep learning applications.
[0023] The half-precision model employed in this invention is an optimized computational model that significantly reduces the demand for storage and computational resources while maintaining a certain level of accuracy. This model reduces the storage space required for each data point by decreasing the precision of the data representation. Simultaneously, the reduced amount of data processed also lowers the demand for computational resources.
[0024] In traditional computing models, the precision of data representation is usually fixed, leading to a waste of storage and computing resources. However, the half-precision model of this invention achieves more efficient storage and computation by intelligently adjusting the precision of data representation.
[0025] Specifically, this model selects an appropriate precision to represent the data based on actual needs and the characteristics of the data. This allows data that doesn't require high precision to be represented using lower precision, thus saving storage space.
[0026] Besides reducing storage space requirements, the half-precision model also has a positive impact on computing resource requirements. Due to the reduced data volume, the computing resources required during computation are correspondingly reduced. This means that more data can be processed under the same hardware conditions, improving computational efficiency. Furthermore, because the half-precision model reduces the precision of data representation, it also reduces the accumulation of errors during computation, thereby improving the accuracy of the calculation results.
[0027] In summary, the half-precision model used in this invention significantly reduces the demand for storage and computing resources by lowering the precision of data representation while maintaining a certain level of accuracy. This optimized computational model not only improves computational efficiency but also enhances the accuracy of computational results, bringing significant improvements to the field of data processing and analysis.
[0028] Specifically, this invention provides an optimization method for effectively reducing the Flash storage of algorithm models. During the generation of the board-side inference model, the method reduces the model's data volume by changing the type of weight data. Specifically, when using the Magik tool to generate the model, all weight data is forcibly converted from FP32 (32-bit floating-point data type) to FP16 (16-bit floating-point data type) and stored in the model file in float format. The method includes the following steps:
[0029] When parsing model weight data
[0030] S1 first reads the weight data from the binary model file and stores it in a 16-bit unsigned integer matrix;
[0031] S2 then iterates through each element in this 16-bit unsigned integer matrix, converts each element into FP32 data, and stores it in the weight data matrix for use in network inference computation.
[0032] In step S1: the weight data in the model file is converted from FP32 to FP16, specifically including:
[0033] Define a function `float_cov_uint16` that takes a floating-point number `value` as its input parameter, expressed as: `uint16_t float_cov_uint16(float value){`
[0034] Define a constant to represent the maximum value of a 32-bit floating-point number, expressed as:
[0035] const Fp32 f32infty={255U<<23};
[0036] Define a constant to represent the maximum value of a 16-bit floating-point number, expressed as:
[0037] const Fp32 f16infty={31U<<23};
[0038] Define a constant for floating-point number conversion, expressed as:
[0039] const Fp32 magic={15U<<23};
[0040] Define a mask to extract the sign bit of a floating-point number, represented as:
[0041] const uint32_t sign_mask=0x80000000U;
[0042] Define a mask for rounding floating-point numbers, expressed as: const uint32_tround_mask = ~0xFFFU;
[0043] Define a variable to store the input floating-point number, represented as:
[0044] Fp32 in;
[0045] Define a variable to store the output 16-bit unsigned integer, denoted as: uint16_t out;
[0046] Assigning the input floating-point number to in.f is represented as:
[0047] in.f = value;
[0048] Extract the sign bit of a floating-point number, and represent it as:
[0049] uint32_t sign=in.u&sign_mask;
[0050] Clearing the sign bit of a floating-point number to zero is represented as:
[0051] in.u^=sign;
[0052] To determine if a floating-point number is infinity or NaN, set all exponent bits, as shown in: if(in.u>=f32infty.u){
[0053] / NaN is converted to sNaN, and Inf remains Inf, which is represented as:
[0054] out=(in.u>f32infty.u)? 0x7FFFU:0x7C00U;
[0055] }
[0056] Otherwise, a denormalized number, zero, or a normalized number is represented as:
[0057] else{
[0058] Rounding a floating-point number is represented as:
[0059] in.u& = round_mask;
[0060] Multiplying a floating-point number by magic.f is represented as:
[0061] in.f* = magic.f;
[0062] To restore the original value of a floating-point number, it is represented as:
[0063] in.u- = round_mask;
[0064] If a floating-point number overflows, it is truncated to signed infinity, represented as: if(in.u>f16infty.u){
[0065] in.u = f16infty.u;
[0066] }
[0067] Extract the significant bits of the floating-point number and convert it to a 16-bit unsigned integer, represented as: out = uint16_t(in.u>>13);
[0068] }
[0069] Add the sign bit to the output 16-bit unsigned integer, as follows:
[0070] out=uint16_t(out|(sign>>16));
[0071] The returned 16-bit unsigned integer is represented as:
[0072] return out;
[0073] }
[0074] In step S1: the weight data in the model file is stored in FP16 data type, specifically implemented as follows:
[0075] The size of the weight node is obtained as follows:
[0076] int weight_size=get_node_weight_size(weight_node);
[0077] The value of the weight tensor is obtained as follows:
[0078] const float*weight_value=get_tensor_value(weight_node);
[0079] Allocate memory space for the temporary weight value array, represented as follows:
[0080] uint16_t*weight_value_temp=(uint16_t*)malloc(weight_size*
[0081] sizeof(uint16_t));
[0082] Iterate through the array of weight values, converting floating-point numbers to unsigned 16-bit integers, as shown in: for(int i = 0; i...). <weight_size;++i){
[0083] weight_value_temp[i]=float_cov_uint16(weight_value[i]);
[0084] }
[0085] The transformed weight values are filled into the data buffer and represented as follows:
[0086] fill_data_buffer_data_float(dataBuffer,(float*)weight_value_temp,
[0087] weight_size>>1);
[0088] Write the contents of the data buffer into the T23 model file, as follows:
[0089] dataBuffer->write_to_file(magik_model_t23).
[0090] In step S2: the FP16 weight data parsing implementation in the model file further includes: a function to convert half-precision FP16 to full-precision FP32, expressed as:
[0091] float Layer::uint16_cov_float(uint16_t value){
[0092] Define a constant `magic` to adjust the exponent, expressed as:
[0093] const Fp32 magic={(254U-15U)<<23};
[0094] Define a constant was_infnan to determine whether it is infinity or not a number, expressed as: const Fp32was_infnan={(127U+16U)<<23};
[0095] Define a variable of type Fp32, named out, to store the output result.
[0096] Perform a bitwise AND operation between the input value and 0x7FFF, then shift left by 13 bits to obtain the exponent and mantissa, represented as:
[0097] out.u=(value&0x7FFFU)<<13;
[0098] Multiply the floating-point part of `out` by the floating-point part of `magic` to adjust the exponent, expressed as:
[0099] out.f* = magic.f;
[0100] If the adjusted exponent is greater than or equal to the floating-point portion of was_infnan, it means the original value was infinity or not a number, and these special values need to be retained, represented as:
[0101] if(out.f>=was_infnan.f)
[0102] {
[0103] Setting the highest 8 bits of the unsigned integer part of `out` to 1 indicates infinity or a NOT digit, as shown below:
[0104] out.u|=255U<<23;
[0105] }
[0106] Perform a bitwise AND operation between the input value and 0x8000, then shift left by 16 bits to obtain the sign bit, represented as:
[0107] out.u|=(value&0x8000U)<<16;
[0108] Returning the floating-point part of `out` as the result is represented as:
[0109] return out.f;
[0110] }
[0111] Create a `Mat` object of type `uint16_t` to store the weight data, i.e., half-precision floating-point numbers, represented as:
[0112] Mat<uint16_t> weight_data_fp16;
[0113] Create a Mat object based on the size of the weighted data, represented as follows:
[0114] weight_data_fp16.create(weight_data_size);
[0115] The weight data is read from the binary file and stored in weight_data_fp16, represented as:
[0116] nread=fread(weight_data_fp16,weight_data_size*sizeof(uint16_t),1,binfp);
[0117] Create a Mat object of type float to store the weight data, i.e., single-precision floating-point numbers, represented as:
[0118] weight_data.create(weight_data_size);
[0119] Iterate through each element in weight_data_fp16, convert it to float type, and store it in weight_data, represented as:
[0120] for(int i = 0; i <weight_data_size;++i){
[0121] weight_data[i]=uint16_cov_float(weight_data_fp16[i]);
[0122] }
[0123] The board-end inference model is the T23 board-end inference model, and the model weight data is parsed in the T23 underlying algorithm inference library.
[0124] The method described is applicable to the T23 algorithm model and is an optimization of the T23 algorithm model with half precision. It involves the half-precision conversion, storage, and parsing of the weight data of the T23 model.
[0125] The T23 includes the T23-YOLOv5 single-humanoid detection algorithm.
[0126] Therefore, the advantage of this application lies in the fact that the proposed half-precision optimization scheme for the T23 algorithm model significantly reduces the resource consumption of the T23 algorithm model in Flash storage without sacrificing performance. Taking the T23-YOLOv5 single-human detection algorithm as an example, the memory consumption of the T23 algorithm model generated by the existing FP32 technology solution is approximately 572KB. However, after adopting the FP16 technology solution of this application, the memory consumption of the T23 algorithm model is reduced to only 292KB. This optimization not only reduces the storage space requirement but also improves the algorithm's running efficiency and response speed. Attached Figure Description
[0127] The accompanying drawings, which are provided to further illustrate the invention and form part of this application, are not intended to limit the scope of the invention.
[0128] Figure 1 This is a flowchart illustrating the method.
[0129] Figure 2 This is a schematic diagram of the code for converting the weight data in the model file from FP32 to FP16 in step S1 of this method.
[0130] Figure 3 This is a code diagram illustrating how the weight data in the model file is stored using the FP16 data type in step S1 of this method.
[0131] Figure 4 This is a schematic diagram of the code implementation for parsing FP16 weight data in the model file in step S2 of this method. Detailed Implementation
[0132] To better understand the technical content and advantages of the present invention, the present invention will now be described in further detail with reference to the accompanying drawings.
[0133] The core of this invention is an innovative method designed to effectively reduce the resource consumption of the T23 algorithm model in Flash storage. This method optimizes the algorithm and data structure to achieve efficient utilization of Flash storage space, thereby significantly reducing the storage space required by the T23 algorithm model.
[0134] In the traditional T23 algorithm model, the complex calculation process and extensive data interaction often lead to excessive consumption of Flash storage resources. This not only increases the system's operating costs but may also affect the performance and stability of the algorithm model.
[0135] To address this issue, this application proposes a novel optimization method. First, the T23 algorithm model is analyzed and studied in depth, identifying its key computational steps and data interaction patterns. Then, by optimizing these key steps and patterns, the algorithm model achieves efficient operation.
[0136] For example, this method employs an advanced data compression technique that effectively compresses redundant data in the T23 algorithm model, thereby significantly reducing the space occupied by Flash storage.
[0137] Specifically, this application proposes an optimization method. The core idea of this method is to reduce the amount of data in the model by changing the type of weight data during the generation of the T23 board-side inference model. Specifically, when using the Magik tool to generate the model, all weight data is forcibly converted from FP32 (32-bit floating-point) data type to FP16 (16-bit floating-point) data type and stored in the model file in the form of float type.
[0138] As a result, due to the change in the original weight data type, the amount of weight data in the model is reduced to half of its original size. This means that the size of the model file will be significantly reduced, thus saving storage space and transmission bandwidth.
[0139] like Figure 1 As shown, when parsing model weight data in the T23 underlying algorithm inference library:
[0140] S1 first reads the weight data from the binary model file and stores it in a 16-bit unsigned integer matrix.
[0141] S2, then it iterates through each element in this 16-bit unsigned integer matrix and converts each element into FP32 data, storing it in the weight data matrix for use in network inference computation.
[0142] This optimization method can not only reduce the amount of data in the model, but also maintain the model's inference performance, thereby achieving efficient inference on the T23 board.
[0143] The implementation of converting weight data from FP32 to FP16 in the T23 model file is shown in the code below. Figure 2 As shown:
[0144] Define a function `float_cov_uint16` that takes a floating-point number `value` as its input parameter, expressed as: `uint16_t float_cov_uint16(float value){`
[0145] Define a constant to represent the maximum value of a 32-bit floating-point number, expressed as: const Fp32 f32infty = {255U << 23};
[0146] Define a constant to represent the maximum value of a 16-bit floating-point number, expressed as: const Fp32 f16infty = {31U << 23};
[0147] Define a constant for floating-point conversion, expressed as: const Fp32 magic = {15U << 23};
[0148] Define a mask to extract the sign bit of a floating-point number, represented as: const uint32_t sign_mask = 0x80000000U;
[0149] Define a mask for rounding floating-point numbers, expressed as: const uint32_tround_mask = ~0xFFFU;
[0150] Define a variable to store the input floating-point number, represented as: Fp32 in;
[0151] Define a variable to store the output 16-bit unsigned integer, denoted as: uint16_t out;
[0152] Assigning the input floating-point number to in.f is represented as:
[0153] in.f = value;
[0154] Extract the sign bit of a floating-point number, and represent it as:
[0155] uint32_t sign=in.u&sign_mask;
[0156] Clearing the sign bit of a floating-point number to zero is represented as:
[0157] in.u^=sign;
[0158] To determine if a floating-point number is infinity or NaN, set all exponent bits, as shown in: if(in.u>=f32infty.u){
[0159] / NaN is converted to sNaN, and Inf remains Inf, represented as: out = (in.u > f32infty.u) ? 0x7FFFU:0x7C00U;
[0160] }
[0161] Otherwise, a denormalized number, zero, or a normalized number is represented as:
[0162] else{
[0163] Rounding a floating-point number is represented as:
[0164] in.u& = round_mask;
[0165] Multiplying a floating-point number by magic.f is represented as:
[0166] in.f* = magic.f;
[0167] To restore the original value of a floating-point number, it is represented as:
[0168] in.u- = round_mask;
[0169] If a floating-point number overflows, it is truncated to signed infinity, represented as: if(in.u>f16infty.u){
[0170] in.u = f16infty.u;
[0171] }
[0172] Extract the significant bits of the floating-point number and convert it to a 16-bit unsigned integer, represented as: out = uint16_t(in.u>>13);
[0173] }
[0174] Add the sign bit to the output 16-bit unsigned integer, as follows:
[0175] out=uint16_t(out|(sign>>16));
[0176] The returned 16-bit unsigned integer is represented as:
[0177] return out;
[0178] }
[0179] The implementation of storing weight data in the T23 model file using the FP16 data type is shown in the code below. Figure 3 As shown:
[0180] The size of the weight node is obtained as follows:
[0181] int weight_size=get_node_weight_size(weight_node);
[0182] The value of the weight tensor is obtained as follows:
[0183] const float*weight_value=get_tensor_value(weight_node);
[0184] Allocate memory space for the temporary weight value array, represented as follows:
[0185] uint16_t*weight_value_temp=(uint16_t*)malloc(weight_size*
[0186] sizeof(uint16_t));
[0187] Iterate through the array of weight values, converting floating-point numbers to unsigned 16-bit integers, as shown in: for(int i = 0; i...). <weight_size;++i){
[0188] weight_value_temp[i]=float_cov_uint16(weight_value[i]);
[0189] }
[0190] The transformed weight values are filled into the data buffer and represented as follows:
[0191] fill_data_buffer_data_float(dataBuffer,(float*)weight_value_temp,
[0192] weight_size>>1);
[0193] Write the contents of the data buffer into the T23 model file, as follows:
[0194] dataBuffer->write_to_file(magik_model_t23).
[0195] The FP16 weight data parsing implementation in the T23 model file is shown in the code below. Figure 4 As shown:
[0196] The function to convert half-precision FP16 to full-precision FP32 is expressed as follows:
[0197] float Layer::uint16_cov_float(uint16_t value){
[0198] Define a constant `magic` to adjust the exponent, expressed as:
[0199] const Fp32 magic={(254U-15U)<<23};
[0200] Define a constant was_infnan to determine whether it is infinity or not a number, expressed as: const Fp32was_infnan={(127U+16U)<<23};
[0201] Define a variable of type Fp32, named out, to store the output result.
[0202] Perform a bitwise AND operation between the input value and 0x7FFF, then shift left by 13 bits to obtain the exponent and mantissa, represented as:
[0203] out.u=(value&0x7FFFU)<<13;
[0204] Multiply the floating-point part of `out` by the floating-point part of `magic` to adjust the exponent, expressed as:
[0205] out.f* = magic.f;
[0206] If the adjusted exponent is greater than or equal to the floating-point portion of was_infnan, it means the original value was infinity or not a number, and these special values need to be retained, represented as:
[0207] if(out.f>=was_infnan.f)
[0208] {
[0209] Setting the highest 8 bits of the unsigned integer part of `out` to 1 indicates infinity or a NOT digit, as shown below:
[0210] out.u|=255U<<23;
[0211] }
[0212] Perform a bitwise AND operation between the input value and 0x8000, then shift left by 16 bits to obtain the sign bit, represented as:
[0213] out.u|=(value&0x8000U)<<16;
[0214] Returning the floating-point part of `out` as the result is represented as:
[0215] return out.f;
[0216] }
[0217] Create a `Mat` object of type `uint16_t` to store the weight data, i.e., half-precision floating-point numbers, represented as:
[0218] Mat<uint16_t> weight_data_fp16;
[0219] Create a Mat object based on the size of the weighted data, represented as follows:
[0220] weight_data_fp16.create(weight_data_size);
[0221] The weight data is read from the binary file and stored in weight_data_fp16, represented as:
[0222] nread=fread(weight_data_fp16,weight_data_size*sizeof(uint16_t),1,binfp);
[0223] Create a Mat object of type float to store the weight data, i.e., single-precision floating-point numbers, represented as:
[0224] weight_data.create(weight_data_size);
[0225] Iterate through each element in weight_data_fp16, convert it to float type, and store it in weight_data, represented as:
[0226] for(int i = 0; i <weight_data_size;++i){
[0227] weight_data[i]=uint16_cov_float(weight_data_fp16[i]);
[0228] }
[0229] Taking the T23-YOLOv5 single humanoid detection algorithm as an example, the memory usage of the T23 algorithm model generated by the existing FP32 technology solution is about 572KB. After adopting the FP16 technology solution of this application, the memory usage of the T23 algorithm model is reduced to only 292KB.
[0230] In summary, the efficient optimization method proposed in this invention for reducing Flash storage requirements of the T23 algorithm model not only significantly reduces Flash storage space usage but also improves the operational efficiency and stability of the algorithm model. This has significant practical implications and application value for promoting the application and development of the T23 algorithm model.
[0231] Although preferred embodiments of the present application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present application.
[0232] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.
[0233] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations can be made to the embodiments of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. An optimization method for effectively reducing the Flash storage of algorithm models, characterized in that, The method reduces the amount of data in the model by changing the type of weight data during the generation of the board-side inference model. Specifically, when using the Magik tool to generate the model, all weight data is forcibly converted from FP32 (32-bit floating-point data type) to FP16 (16-bit floating-point data type) and stored in the model file as float type. The method includes the following steps: When parsing model weight data S1 first reads the weight data from the binary model file and stores it in a 16-bit unsigned integer matrix; S2 then iterates through each element in this 16-bit unsigned integer matrix, converts each element into FP32 data, and stores it in the weight data matrix for use in network inference computation.
2. The optimization method for effectively reducing the Flash storage of algorithm models according to claim 1, characterized in that, In step S1: the weight data in the model file is converted from FP32 to FP16, specifically including: Define a function `float_cov_uint16`, with a floating-point number `value` as the input parameter, represented as: uint16_t float_cov_uint16(float value){ Define a constant to represent the maximum value of a 32-bit floating-point number, expressed as: const Fp32 f32infty={255U<<23}; Define a constant to represent the maximum value of a 16-bit floating-point number, expressed as: const Fp32 f16infty={31U<<23}; Define a constant for floating-point number conversion, expressed as: const Fp32 magic={15U<<23}; Define a mask to extract the sign bit of a floating-point number, represented as: const uint32_t sign_mask=0x80000000U; Define a mask for rounding floating-point numbers, represented as: const uint32_t round_mask=~0xFFFU; Define a variable to store the input floating-point number, represented as: Fp32 in; Define a variable to store the output 16-bit unsigned integer, denoted as: uint16_t out; Assigning the input floating-point number to in.f is represented as: in.f = value; Extract the sign bit of a floating-point number, and represent it as: uint32_t sign=in.u&sign_mask; Clearing the sign bit of a floating-point number to zero is represented as: in.u^=sign; To determine if a floating-point number is infinity or NaN, set all exponent bits, as shown in: if(in.u>=f32infty.u){ / NaN is converted to sNaN, and Inf remains Inf, which is represented as: out=(in.u>f32infty.u)? 0x7FFFU:0x7C00U; } Otherwise, a denormalized number, zero, or a normalized number is represented as: else{ Rounding a floating-point number is represented as: in.u& = round_mask; Multiplying a floating-point number by magic.f is represented as: in.f* = magic.f; To restore the original value of a floating-point number, it is represented as: in.u- = round_mask; If a floating-point number overflows, it is truncated to signed infinity, represented as: if(in.u>f16infty.u){ in.u = f16infty.u; } Extract the significant bits of the floating-point number and convert it to a 16-bit unsigned integer, represented as: out = uint16_t(in.u>>13); } Add the sign bit to the output 16-bit unsigned integer, as follows: out=uint16_t(out|(sign>>16)); The returned 16-bit unsigned integer is represented as: return out; }。 3. The optimization method for effectively reducing the Flash storage of algorithm models according to claim 2, characterized in that, In step S1: the weight data in the model file is stored in FP16 data type, specifically implemented as follows: The size of the weight node is obtained as follows: int weight_size=get_node_weight_size(weight_node); The value of the weight tensor is obtained as follows: const float*weight_value=get_tensor_value(weight_node); Allocate memory space for the temporary weight value array, represented as follows: uint16_t*weight_value_temp=(uint16_t*)malloc(weight_size* sizeof(uint16_t)); Iterate through the array of weight values, converting the floating-point numbers to unsigned 16-bit integers, as follows: for(int i = 0; i <weight_size;++i){ weight_value_temp[i]=float_cov_uint16(weight_value[i]); } The transformed weight values are filled into the data buffer and represented as follows: fill_data_buffer_data_float(dataBuffer,(float*)weight_value_temp,weight_size>>1); Write the contents of the data buffer into the T23 model file, as follows: dataBuffer->write_to_file(magik_model_t23).
4. The optimization method for effectively reducing the Flash storage of algorithm models according to claim 3, characterized in that, In step S2: the FP16 weight data parsing implementation in the model file further includes: The function to convert half-precision FP16 to full-precision FP32 is expressed as follows: float Layer::uint16_cov_float(uint16_t value){ Define a constant `magic` to adjust the exponent, expressed as: const Fp32 magic={(254U-15U)<<23}; Define a constant was_infnan to determine whether it is infinity or not a number, expressed as: const Fp32was_infnan={(127U+16U)<<23}; Define a variable `out` of type `Fp32` to store the output result, represented as: Fp32 out; Perform a bitwise AND operation between the input value and 0x7FFF, then shift left by 13 bits to obtain the exponent and mantissa, represented as: out.u=(value&0x7FFFU)<<13; Multiply the floating-point part of `out` by the floating-point part of `magic` to adjust the exponent, expressed as: out.f* = magic.f; If the adjusted exponent is greater than or equal to the floating-point portion of was_infnan, it means the original value was infinity or not a number, and these special values need to be retained, represented as: if(out.f>=was_infnan.f) { Setting the highest 8 bits of the unsigned integer part of `out` to 1 indicates infinity or a NOT digit, as shown below: out.u|=255U<<23; } Perform a bitwise AND operation between the input value and 0x8000, then shift left by 16 bits to obtain the sign bit, represented as: out.u|=(value&0x8000U)<<16; Returning the floating-point part of `out` as the result is represented as: return out.f; } Create a `Mat` object of type `uint16_t` to store the weight data, i.e., half-precision floating-point numbers, represented as: Mat<uint16_t> weight_data_fp16; Create a Mat object based on the size of the weighted data, represented as follows: weight_data_fp16.create(weight_data_size); The weight data is read from the binary file and stored in weight_data_fp16, represented as: nread=fread(weight_data_fp16,weight_data_size*sizeof(uint16_t),1,binfp); Create a Mat object of type float to store the weight data, i.e., single-precision floating-point numbers, represented as: weight_data.create(weight_data_size); Iterate through each element in weight_data_fp16, convert it to float type, and store it in weight_data, represented as: for(int i = 0; i <weight_data_size;++i){ weight_data[i]=uint16_cov_float(weight_data_fp16[i]);}.
5. The optimization method for effectively reducing the Flash storage of algorithm models according to claim 1, characterized in that, The board-end inference model is the T23 board-end inference model, and the model weight data is parsed in the T23 underlying algorithm inference library.
6. The optimization method for effectively reducing the Flash storage of algorithm models according to claim 1, characterized in that, The method described is applicable to the T23 algorithm model and is an optimization of the T23 algorithm model with half precision. It involves the half-precision conversion, storage, and parsing of the weight data of the T23 model.
7. An optimization method for effectively reducing the Flash storage of algorithm models according to claim 5 or 6, characterized in that, The T23 includes the T23-YOLOv5 single-humanoid detection algorithm.