Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

46 results about "Loop unrolling" patented technology

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler.

A GEMM (general matrix-matrix multiplication) high-performance realization method based on a domestic SW 26010 many-core CPU

ActiveCN107168683ASolve the problem that the computing power of slave cores cannot be fully utilizedImprove performanceRegister arrangementsConcurrent instruction executionFunction optimizationAssembly line
The invention provides a GEMM (general matrix-matrix multiplication) high-performance realization method based on a domestic SW 26010 many-core CPU. For a domestic SW many-core processor 26010, based on the platform characteristics of storage structures, memory access, hardware assembly lines and register level communication mechanisms, a matrix partitioning and inter-core data mapping method is optimized and a top-down there-level partitioning parallel block matrix multiplication algorithm is designed; a slave core computing resource data sharing method is designed based on the register level communication mechanisms, and a computing and memory access overlap double buffering strategy is designed by using a master-slave core asynchronous DMA data transmission mechanism; for a single slave core, a loop unrolling strategy and a software assembly line arrangement method are designed; function optimization is achieved by using a highly-efficient register partitioning mode and an SIMD vectoring and multiplication and addition instruction. Compared with a single-core open-source BLAS math library GotoBLAS, the function performance of the high-performance GEMM has an average speed-up ratio of 227. 94 and a highest speed-up ratio of 296.93.
Owner:INST OF SOFTWARE - CHINESE ACAD OF SCI +1

Data receiver and semiconductor device including the data receiver

The invention is directed to data receivers such as those used in semiconductor devices. Embodiments of the invention provide a loop unrolling DFE receiver that uses analog control signals from each equalizer to avoid timing delays associated with the use of latched digital control signals in the conventional art. In addition, embodiments of the invention implement each equalizer with a single sense amplifier based flip flop (SAFF) to reduce circuit size and power consumption
Owner:SAMSUNG ELECTRONICS CO LTD

High-speed accurate single-pattern character string matching method

The invention provides a high-speed accurate single-pattern character string matching method, comprising a pretreatment phase and a search phase; wherein, the pretreatment phase comprises three main steps: pretreating patterns, pretreating texts and judging optimal matching action in accordance with matching conditions; the search phase is a process of string matching and comprises three main steps: Scan Loop, Match Loop and subsequent judgment action. In the invention, the following improvements are made on the basis of an SBNDM2 algorithm, one of the top-speed methods when matching is carried out in current corpora of English: reducing the expenditure of index bound detection by introducing an index bound protection mechanism; simplifying the algorithm by the way of modifying the definitions of bitmasks and bit vectors; determining a method for selecting the optimal loop unrolling characters with regard to different pattern lengths and different corpora by expanding the loop unrolling mechanism of SBNDM2 and improving the matching performance of the algorithm aiming at different matching conditions. The method of the invention is a high-speed bit parallel accurate single-pattern string matching method with high performance and broad application range when the pattern length is not more than the machine word-length.
Owner:HARBIN ENG UNIV

Method of transforming variable loops into constant loops

A system and method for processing a variable looping statement into a constant looping statement to enable loop unrolling. A lower bound and an upper bound of the loop index within the variable looping statement are determined. A constant looping statement is then formed using the lower bound and upper bound to define a range over which the loop index varies within the constant looping statement. The constant looping statement further includes a conditional statement that reflects conditions in the initial expression and / or the exit expression of the variable looping statement. The conditional statement controls execution of the body of the generated constant looping statement, which includes the body from the original variable looping statement. Loop unrolling may then be performed on the generated constant looping statement.
Owner:ORACLE INT CORP +1

Runtime error analytical method based on abstract interpretation and model verification

The invention discloses a runtime error analytical method based on abstract interpretation and model verification. The method includes the following steps that on the basis of the abstract interpretation theory, the program numerical variable value range is analyzed by the adoption of a forward iteration method, the variable value range information is obtained when program points are stable, and the iterative computations of loop nodes are achieved by the way that loop unrolling and delay widening are combined; the variable value range information at the relevant program points needing to be detected is converted to be in an assertion or hypothesis mode to be plugged into a program according to a runtime error type to be analyzed; the assertion or hypothesis programs are converted into a Boolean formula, wherein the Boolean formula comprises limiting conditions and attributes; the correctness of the attributes in the Boolean formula is judged through an SAT verifier, if correct, it shows that relevant runtime errors do not exist, if not correct, it shows that the relevant runtime errors exist, and relevant counter example paths are output. By means of the method, an equilibrium point is acquired between runtime error analysis precision and efficiency.
Owner:中国航天系统科学与工程研究院

High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform

The invention puts forward a high-performance realization method of a BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of an SW platform. An ''interface-driver-kernel assembly core code'' three-layer code design framework is adopted by aiming at a domestic SW1600 platform, technical means, including a multiply-add instruction, loop unrolling, software pipeline instruction rearrangement, SIMD (Single Instruction Multiple Data) vector operation, register blocking technology and the like which are associated with platform architecture, are adopted to realize assembly level manual optimization, the problem that a compiler can not sufficiently optimize a compute-intensive function GEMM is solved, and function performance is greatly improved. Compared with an open source BLAS math library GotoBLAS, the high-performance realization method is characterized in that an average speed-up ratio is 4.72 and a highest speed-up ratio is 5.61.
Owner:INST OF SOFTWARE - CHINESE ACAD OF SCI

High-speed and agile encoder for variable strength long BCH codes

Agile BCH encoders are useful when the noise characteristics of the channel change which demands that the strength of the error correcting BCH code to be a variable. An agile encoder for encoding a linear cyclic code such as a BCH code, is a code that switches code strength (depth) relatively quickly in unit increments. The generator polynomial for the BCH code is provided in the factored form. The number of factored polynomials (minimal polynomials) chosen by the system determines the strength of the BCH code. The strength can vary from a weak code to a strong code in unit increments without a penalty on storage requirements for storing the factored polynomials. The BCH codeword is formed by a dividing network and a combining network. Special method is described that provides a trade off mechanism between latency and throughput while simultaneously optimizing the delay in the critical path which is in the forward path. Speed enhancements at minimal polynomial level are also provided by retiming, loop unfolding, loop unrolling, and special mathematical transformations. The presented invention can be implemented as an apparatus using software or hardware or in integrated circuit form.
Owner:CHERUKURI RAGHUNATH

Method for optimizing finite difference algorithm in heterogeneous many-core framework

InactiveCN106020773AImplement and optimize parallel computingSolve low computing performanceRegister arrangementsConcurrent instruction executionExtensibilityAnalysis data
The invention belongs to the technical field of high-performance calculation, and relates to a method for optimizing a finite difference algorithm in a heterogeneous many-core framework. The method is used for optimizing the finite difference algorithm in a many-core accelerator (MIC) and multi-core general processor (CPU)-based hybrid heterogeneous high-performance computer system by using three progressive optimization methods. The method mainly comprises a basic optimization method, a parallel optimization method and a heterogeneous collaborative optimization method. The method disclosed in the invention has the beneficial effects as follows: the three progressive optimization methods are used for solving the problems of low calculation performance and bad parallel effect caused by leap-type access and parallel execution lack when converting the finite difference algorithm from a many-core system to a heterogeneous many-core; the method is an optimization method with high efficiency and expandability, and can be used for weakening the calculation strength and clearing obstacles for vectorization through basic optimization methods such as branch elimination, loop unrolling and invariant switching; and the parallel optimization method such as a core algorithm is rewritten by using a vector instruction set through analyzing data dependency and circulating partitioning, and a multi-threading and long-vector mechanism of the many-core processor is fully utilized.
Owner:THE PLA INFORMATION ENG UNIV +2

Circulating-unfolded-structured AES encryption/decryption circuit based on data redundancy real-time error detection mechanism

The invention discloses a circulating-unfolded-structured AES encryption / decryption circuit based on a data redundancy real-time error detection mechanism, and is used for resisting fault injection attacks or used for improving circuit reliability in an extreme application environment. The circuit comprises two parts of an AES encryption / decryption unit and a detecting unit, wherein the AES encryption / decryption unit adopts the circulating-unfolded structure, and is formed by Nk round transformation units and an alternative selector; the detecting unit is composed of Nk comparators. The AES encryption / decryption unit adopts the data redundancy processing technology in the data processing process, utilizes two adjacent round transformation units to perform the same operation on each group of data twice; the comparators in the detecting unit compare the results of the two operations; the AES encryption / decryption unit works normally if the operation results are the same; the AES encryption / decryption unit generates an error if the results are different. Compared with the conventional structural redundancy error detection mechanism, the adoption of data redundancy error detection mechanism can greatly reduce the circuit area.
Owner:NANJING UNIV OF AERONAUTICS & ASTRONAUTICS

Complex matrix optimizing method

The invention discloses a complex matrix optimizing method, which is characterized by comprising the steps: firstly calculating the specific unrolling granularity of the godson architecture, carrying out four-by-four loop unrolling to a complex matrix, and selecting a maximum value as the size nb of a partitioning block of a matrix so as to obtain the optimal ideal size of the partitioning block of the matrix on the godson, wherein the maximum value of the size nb of the partitioning block of the matrix is smaller than 52, and the product of 24 and the square of the size nb of the partitioning block of the matrix is smaller than the maximum of 64 kilobyte of a first data cache of a godson processor; reasonably dividing and combing matrixes in a matrix multiplication by utilizing the continuity and the locality of data storage, and reducing the cache access number of the first grade data of the godson; and carrying out the common complex addition and multiplication in the complex matrix operation by utilizing the multiplication of two complexes in the classic complex algorithm so as to reduce the operating scale, so that the calculation performance of the complex matrix multiplication on the godson is enhanced by about 50%, and the operating rate of BLAS (basic linear algebra subprograms) base on the godson 3A is increased by more than 1.5 times.
Owner:UNIV OF SCI & TECH OF CHINA

Compiling method and compiler

InactiveCN101452394ATake advantage of parallelismLighten the programming burdenConcurrent instruction executionMemory systemsParallel computingLoop unrolling
The invention relates to a compilation method and a compiler. The compilation method comprises: identifying a cycle containing first instructions, in which, the cycle has a definite control parameter and does not contain transfer instructions, and all first instructions do not have iterative correlation; compiling statistics of the number of the first instructions and second instructions in the cycle, and calculating cycle unfolding frequency and the cycle frequency of converting the first instructions into the second instructions according to the executing capability of a first instruction executing part and a second instruction executing part; carrying out cycle unfolding for the cycle when the cycle unfolding frequency is not equal to one, and converting the first instructions in the cycle unfolding into the corresponding second instructions according to the cycle frequency of converting the first instructions into the second instructions. The compilation method and the compiler can make full use of the parallelism of instruction executing parts in a processor to increase program executing efficiency and reduce the programming burden of a user.
Owner:JIANGNAN INST OF COMPUTING TECH

High-throughput SHA-1 (Secure Hash Algorithm) based on FPGA

The invention provides a high-throughput SHA-1 (Secure Hash Algorithm) based on an FPGA. The method comprises the steps of S1, judging whether length of input message data exceeds 512 bits or not; S2, carrying out bit compensation on the message data until the length is integer multiples of the 512 bits if the length of input message data exceeds 512 bits; S3, segmenting the message data after bit compensation into multiple data blocks, wherein each data block is 512 bits, and segmenting each data block into 16 characters, wherein each character is 32 bits; S4, carrying out loop unrolling on an original iteration operation formula, thereby forming a loop unrolling structure; S5, determining pipeline series, and forming a pipeline structure by an intermediate register and the loop unrolling structure; and S6, inputting each character into the pipeline structure, thereby obtaining a SHA-1 calculation result. According to the algorithm, the iteration operation is simplified, an intermediate variable is added, therefore, a key path is shortened, and a calculation speed is improved. Moreover, through adoption of a pipeline processing mode, the data processing quantity is increased, and the throughput is improved.
Owner:SHENZHEN FORWARD IND CO LTD

Unrolling loops with partial hot traces

Methods and apparatus are disclosed for improved loop unrolling by a compiler. A large class of loops exists for which effective loop unrolling has not previously been performed because they are too large to be completely unrolled, but which do not have a single hot trace that covers an entire loop iteration. The present invention recognizes such loops that have partial hot traces identified using profile data. A set of instructions which constitute a proper superset of the hot trace and a proper subset of the entire loop, and which forms a complete loop iteration is identified. This set of instructions can then be unrolled without unrolling the entire loop.
Owner:INT BUSINESS MASCH CORP

Multi-rate multi-code length LDPC code decoding method based on SIMD instruction set

The invention provides a multi-rate multi-code length LDPC code decoding method based on a SIMD instruction set. The method comprises the following steps: realizing check matrix information online statistic through external configuration document by combining the LDPC code-based matrix feature in quasi-cyclic structure; through the adoption of a fixed point layered decoding scheme, respectively constructing a specific check node computing unit for different row weights by a decoder, and selecting the check node computing unit according to different row weights, wherein a cyclic expansion way is adopted in the check node computing unit. The online statistic of the check matrix information is realized, and the storage amount of the multi-rate multi-code length LDPC code decoder is reduced; compared with the existing algorithm, the dependence on the matrix statistical information is eliminated, and the realization complexity of the decoder is realized; compared with the existing algorithm, the speed loss is avoided, and the module for online computing can be modified, and the universality is provided.
Owner:SOUTHEAST UNIV

Optimization of floating point complex vector summation based on BWDSP chips

The invention belongs to the field of optimization of the underlying function for digital signal processors and discloses an optimization of floating point complex vector summation based on high-performance general signal processor BWDSP chips; the floating point complex vector summation is the summation of a first floating point complex vector and a second floating point complex vector; the summation of the first floating point complex vector and a second floating point complex vector is circulation of summation of multi-time floating point complex numbers; each summation process of the floating point complex numbers comprises instruction parallel optimization based on BWDSP chips, that is, optimization of simultaneous control of more than one operation units to execute a same operation by one instruction; optimization based on circulation, that is, multiple times of optimization of the same loop code in a loop; optimization based on software pipeline, that is, optimization of multiple times of execution of same circulation code parallel intersection. The hardware resource of BWDSP chips can be fully utilized to obtain efficient underlying functions.
Owner:XIDIAN UNIV +1

Constant-temperature instruction level self-testing method for testing time delay faults in inner heating manner

The invention relates to a constant-temperature instruction level self-testing method for testing time delay faults in an inner heating manner. A processor is subjected to high-temperature time delay. The method comprises steps as follows: an original instruction level self-testing program module is obtained; the original instruction level self-testing program module is subjected to loop unrolling deformation; the original instruction level self-testing program module is subjected to deformation on basis of cache miss; feasible scheduling is acquired in a set test temperature interval with a constant-temperature test program scheduling algorithm; the processor is heated to the lower bound of the test temperature interval, corresponding programs are executed according to feasible scheduling, and inner heating type constant-temperature tests are performed aiming at the time delay faults. Compared with the prior art, the constant-temperature instruction level self-testing method has the advantages that the time delay faults can be tested effectively under a high-temperature condition, a high fault covering rate is guaranteed, and the loss of the processor is reduced, and the like.
Owner:TONGJI UNIV

Method and apparatus for efficiently processing array operation in computer system

An apparatus and a method for processing an array in a loop in a computer system, including: applying loop unrolling to a multi-dimensional array included in a loop based on a predetermined unrolling factor to generate a plurality of unrolled multi-dimensional arrays; and transforming each of the plurality of unrolled multi-dimensional arrays into a one-dimensional array having an array subscript expression in a form of an affine function with respect to a loop counter variable.
Owner:SAMSUNG ELECTRONICS CO LTD

Successive approximation register (SAR) analog to digital converter (ADC) with partial loop-unrolling

A receiver system that includes an ADC for converting analog values to digital representations. A digital representation is a sum of discrete values some of which are non-binary scaled and the other are binary scaled. The ADC includes dedicated comparators to determine whether to add or to subtract the non-binary scaled values. A comparator is used to determine whether to add or to subtract the binary scaled values. The ADC further calibrates offset voltages of the comparators to substantially remove dead zone and conversion errors, without compromising the conversion speed. The calibration can be performed both in foreground and background.
Owner:INPHI

Data processing method, device and equipment and computer storage medium

The invention provides a data processing method, apparatus and device, and a computer storage medium. The method comprises the steps of obtaining an intermediate representation of a deep learning model; wherein the loop expansion factor is related to information of the intermediate representation during execution of the rear-end hardware equipment and / or equipment information of the rear-end hardware equipment; performing loop expansion on the intermediate representation according to the loop expansion factor to obtain an optimized intermediate representation; and compiling the optimized intermediate representation to obtain a target code which can be executed by the back-end hardware equipment, so that the back-end hardware equipment executes the target code to realize the function of the target code. By adopting the embodiment of the invention, the loop expansion factor can be calculated according to the execution information of the rear-end hardware equipment and / or the equipment information of the rear-end hardware equipment to obtain a more accurate loop expansion factor, and the intermediate representation is subjected to loop expansion through the loop expansion factor, so that instruction scheduling is carried out in a larger range; and the transportability of the intermediate representation is improved.
Owner:PHYTIUM TECH CO LTD

Method for performing loop unrolled decision feedback equalization in an electronic device

A method for performing loop unrolled decision feedback equalization (DFE) and an associated apparatus are provided. The method includes: receiving a tap control signal and an offset control signal from a digital domain of a DFE receiver in an electronic device, and generating DFE information respectively corresponding to the tap control signal and the offset control signal in an analog domain of the DFE receiver; broadcasting the DFE information respectively corresponding to the tap control signal and the offset control signal toward comparators in the DFE receiver; utilizing the comparators to perform comparison operations according to the DFE information respectively corresponding to the tap control signal and the offset control signal to generate comparison results; and selectively adjusting the tap control signal and the offset control signal according to the comparison results, to optimize the DFE information respectively corresponding to the tap control signal and the offset control signal, respectively. The method and the apparatus can adaptively adjust equalization information input to the comparators in the receiver such as the DFE receiver.
Owner:MEDIATEK INC

An accelerator structure and loop unrolling method for binarized neural network

The invention discloses an accelerator structure and a loop expansion method for a binary neural network. Aiming at a hardware accelerator structure with a weight value of 1 bit and a feature value of n bits, the invention includes hardware structure design of the accelerator and a method for binarization Neural network optimized loop unrolling structure and storage order of weights and eigenvalues ​​in SRAM. The hardware structure includes weights, eigenvalue storage SRAM, dedicated convolution calculation module and addition tree unit. The dedicated convolution module designs a new convolution calculation method, and the addition tree ensures the pipeline operation of the data. The combination of the loop expansion method used in the present invention and the accumulator can make the accelerator have very good scalability, and the size of the block K can be freely determined according to the complexity of the network and hardware resources without changing the control logic of the circuit. Cooperating with this loop expansion method, the present invention also proposes a storage order of weights and feature values ​​to simplify the access logic.
Owner:XI AN JIAOTONG UNIV +1

Successive approximation register (SAR) analog to digital converter (ADC) with partial loop-unrolling

A receiver system that includes an ADC for converting analog values to digital representations. A digital representation is a sum of discrete values some of which are non-binary scaled and the other are binary scaled. The ADC includes dedicated comparators to determine whether to add or to subtract the non-binary scaled values. A comparator is used to determine whether to add or to subtract the binary scaled values. The ADC further calibrates offset voltages of the comparators to substantially remove dead zone and conversion errors, without compromising the conversion speed. The calibration can be performed both in foreground and background.
Owner:INPHI

A fixed-temperature command-level self-test method for detecting time-delay faults by internal temperature rise

The invention relates to a fixed-temperature command-level self-test method for detecting time-delay faults in an internal temperature rise mode. The high-temperature time-delay test is performed on a processor, comprising the following steps: obtaining the original command-level self-test program module; and performing the original command-level self-test program module Carry out the deformation of loop unrolling; carry out the deformation of the original instruction level self-test program module based on triggering cache miss; within the set test temperature range, use the fixed temperature test program scheduling algorithm to obtain feasible scheduling; heat the processor to the test temperature range The lower bound of , according to the feasible schedule, execute the corresponding program, and implement the constant temperature test of internal heating for delay faults. Compared with the prior art, the invention has the advantages of being able to effectively test delay faults under high temperature conditions, ensuring high fault coverage, reducing processor loss and the like.
Owner:TONGJI UNIV

A method to reduce register overflow caused by fine-grained randomization security optimization

ActiveCN109240699BReduce overflowImprove loop optimizationCode compilationAlgorithmLoop optimization
The invention discloses a method for reducing register overflow caused by fine-grained randomization safety optimization, and relates to the technical field of compiler cycle optimization. First, the variables in the registers in the cycle body are reclassified, including cycle invariants, cycle induction variables and cycle changes. After the classification, the variables in the registers in the loop body are identified; finally, according to the number of loop invariants, loop induction variables, and loop variations in the loop body registers after identification, the loop expansion factor is obtained. The present invention proposes register pressure sensitive The loop unrolling method can improve the loop optimization effect to a certain extent and reduce the occurrence of register overflow; in addition, for randomization optimization, hot code, generally the loop body is more sensitive to the performance load brought by randomization, thus improving loop unrolling Optimizations can also improve the performance of fine-grained randomization security optimizations.
Owner:广东中科实数科技有限公司

AES Encryption/Decryption Circuit Based on Data Redundancy Real-time Error Detection Mechanism

The invention discloses a circulating-unfolded-structured AES encryption / decryption circuit based on a data redundancy real-time error detection mechanism, and is used for resisting fault injection attacks or used for improving circuit reliability in an extreme application environment. The circuit comprises two parts of an AES encryption / decryption unit and a detecting unit, wherein the AES encryption / decryption unit adopts the circulating-unfolded structure, and is formed by Nk round transformation units and an alternative selector; the detecting unit is composed of Nk comparators. The AES encryption / decryption unit adopts the data redundancy processing technology in the data processing process, utilizes two adjacent round transformation units to perform the same operation on each group of data twice; the comparators in the detecting unit compare the results of the two operations; the AES encryption / decryption unit works normally if the operation results are the same; the AES encryption / decryption unit generates an error if the results are different. Compared with the conventional structural redundancy error detection mechanism, the adoption of data redundancy error detection mechanism can greatly reduce the circuit area.
Owner:NANJING UNIV OF AERONAUTICS & ASTRONAUTICS

Complex matrix optimizing method

The invention discloses a complex matrix optimizing method, which is characterized by comprising the steps: firstly calculating the specific unrolling granularity of the godson architecture, carrying out four-by-four loop unrolling to a complex matrix, and selecting a maximum value as the size nb of a partitioning block of a matrix so as to obtain the optimal ideal size of the partitioning block of the matrix on the godson, wherein the maximum value of the size nb of the partitioning block of the matrix is smaller than 52, and the product of 24 and the square of the size nb of the partitioning block of the matrix is smaller than the maximum of 64 kilobyte of a first data cache of a godson processor; reasonably dividing and combing matrixes in a matrix multiplication by utilizing the continuity and the locality of data storage, and reducing the cache access number of the first grade data of the godson; and carrying out the common complex addition and multiplication in the complex matrix operation by utilizing the multiplication of two complexes in the classic complex algorithm so as to reduce the operating scale, so that the calculation performance of the complex matrix multiplication on the godson is enhanced by about 50%, and the operating rate of BLAS (basic linear algebra subprograms) base on the godson 3A is increased by more than 1.5 times.
Owner:UNIV OF SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products