Patents

Literature

Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.

46 results about "Loop unrolling" patented technology

Filter

Efficacy Topic

Property

Owner

Technical Advancement

Application Domain

Technology Topic

Technology Field Word

Patent Country/Region

Patent Type

Patent Status

Application Year

Inventor

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler.

Compiler apparatus and method for optimizing loops in a computer program

InactiveUS6938249B2Minimize timeSoftware engineeringHardware monitoringLoop optimizationLoop unrolling

A profile-based loop optimizer generates an execution frequency table for each loop that gives more detailed profile data that allows making a more intelligent decision regarding if and how to optimize each loop in the computer program. The execution frequency table contains entries that correlate a number of times a loop is executed each time the loop is entered with a count of the occurrences of each number during the execution of an instrumented instruction stream. The execution frequency table is used to determine whether there is one dominant mode that appears in the profile data, and if so, optimizes the loop according to the dominant mode. The optimizer may perform optimizations by peeling a loop, by unrolling a loop, and by performing both peeling and unrolling on a loop according to the profile data in the execution frequency table for the loop. In this manner the execution time of the resulting code is minimized according to the detailed profile data in the execution frequency tables, resulting in a computer program with loops that are more fully optimized.

Compiler apparatus and method for optimizing loops in a computer program

Compiler apparatus and method for optimizing loops in a computer program

Compiler apparatus and method for optimizing loops in a computer program

Owner:INTELLECTUAL DISCOVERY INC

A GEMM (general matrix-matrix multiplication) high-performance realization method based on a domestic SW 26010 many-core CPU

ActiveCN107168683ASolve the problem that the computing power of slave cores cannot be fully utilizedImprove performanceRegister arrangementsConcurrent instruction executionFunction optimizationAssembly line

The invention provides a GEMM (general matrix-matrix multiplication) high-performance realization method based on a domestic SW 26010 many-core CPU. For a domestic SW many-core processor 26010, based on the platform characteristics of storage structures, memory access, hardware assembly lines and register level communication mechanisms, a matrix partitioning and inter-core data mapping method is optimized and a top-down there-level partitioning parallel block matrix multiplication algorithm is designed; a slave core computing resource data sharing method is designed based on the register level communication mechanisms, and a computing and memory access overlap double buffering strategy is designed by using a master-slave core asynchronous DMA data transmission mechanism; for a single slave core, a loop unrolling strategy and a software assembly line arrangement method are designed; function optimization is achieved by using a highly-efficient register partitioning mode and an SIMD vectoring and multiplication and addition instruction. Compared with a single-core open-source BLAS math library GotoBLAS, the function performance of the high-performance GEMM has an average speed-up ratio of 227. 94 and a highest speed-up ratio of 296.93.

A GEMM (general matrix-matrix multiplication) high-performance realization method based on a domestic SW 26010 many-core CPU

A GEMM (general matrix-matrix multiplication) high-performance realization method based on a domestic SW 26010 many-core CPU

A GEMM (general matrix-matrix multiplication) high-performance realization method based on a domestic SW 26010 many-core CPU

Owner:INST OF SOFTWARE - CHINESE ACAD OF SCI +1

Data receiver and semiconductor device including the data receiver

ActiveUS20080089155A1Avoid time delayReduce circuit sizeCurrent/voltage measurementEqualisersAudio power amplifierControl signal

The invention is directed to data receivers such as those used in semiconductor devices. Embodiments of the invention provide a loop unrolling DFE receiver that uses analog control signals from each equalizer to avoid timing delays associated with the use of latched digital control signals in the conventional art. In addition, embodiments of the invention implement each equalizer with a single sense amplifier based flip flop (SAFF) to reduce circuit size and power consumption

Data receiver and semiconductor device including the data receiver

Data receiver and semiconductor device including the data receiver

Data receiver and semiconductor device including the data receiver

Owner:SAMSUNG ELECTRONICS CO LTD

High-speed accurate single-pattern character string matching method

InactiveCN101609455ALittle impact on performanceImprove performanceSpecial data processing applicationsProtection mechanismTheoretical computer science

The invention provides a high-speed accurate single-pattern character string matching method, comprising a pretreatment phase and a search phase; wherein, the pretreatment phase comprises three main steps: pretreating patterns, pretreating texts and judging optimal matching action in accordance with matching conditions; the search phase is a process of string matching and comprises three main steps: Scan Loop, Match Loop and subsequent judgment action. In the invention, the following improvements are made on the basis of an SBNDM2 algorithm, one of the top-speed methods when matching is carried out in current corpora of English: reducing the expenditure of index bound detection by introducing an index bound protection mechanism; simplifying the algorithm by the way of modifying the definitions of bitmasks and bit vectors; determining a method for selecting the optimal loop unrolling characters with regard to different pattern lengths and different corpora by expanding the loop unrolling mechanism of SBNDM2 and improving the matching performance of the algorithm aiming at different matching conditions. The method of the invention is a high-speed bit parallel accurate single-pattern string matching method with high performance and broad application range when the pattern length is not more than the machine word-length.

High-speed accurate single-pattern character string matching method

High-speed accurate single-pattern character string matching method

High-speed accurate single-pattern character string matching method

Owner:HARBIN ENG UNIV

Method of transforming variable loops into constant loops

InactiveUS6988266B2Without riskLimited value rangeSoftware engineeringProgram controlLower limitTheoretical computer science

A system and method for processing a variable looping statement into a constant looping statement to enable loop unrolling. A lower bound and an upper bound of the loop index within the variable looping statement are determined. A constant looping statement is then formed using the lower bound and upper bound to define a range over which the loop index varies within the constant looping statement. The constant looping statement further includes a conditional statement that reflects conditions in the initial expression and / or the exit expression of the variable looping statement. The conditional statement controls execution of the body of the generated constant looping statement, which includes the body from the original variable looping statement. Loop unrolling may then be performed on the generated constant looping statement.

Method of transforming variable loops into constant loops

Method of transforming variable loops into constant loops

Method of transforming variable loops into constant loops

Owner:ORACLE INT CORP +1

Runtime error analytical method based on abstract interpretation and model verification

ActiveCN103617115AShrink state spaceImprove inspection efficiencySoftware testing/debuggingHypothesisIterative methodology

The invention discloses a runtime error analytical method based on abstract interpretation and model verification. The method includes the following steps that on the basis of the abstract interpretation theory, the program numerical variable value range is analyzed by the adoption of a forward iteration method, the variable value range information is obtained when program points are stable, and the iterative computations of loop nodes are achieved by the way that loop unrolling and delay widening are combined; the variable value range information at the relevant program points needing to be detected is converted to be in an assertion or hypothesis mode to be plugged into a program according to a runtime error type to be analyzed; the assertion or hypothesis programs are converted into a Boolean formula, wherein the Boolean formula comprises limiting conditions and attributes; the correctness of the attributes in the Boolean formula is judged through an SAT verifier, if correct, it shows that relevant runtime errors do not exist, if not correct, it shows that the relevant runtime errors exist, and relevant counter example paths are output. By means of the method, an equilibrium point is acquired between runtime error analysis precision and efficiency.

Runtime error analytical method based on abstract interpretation and model verification

Runtime error analytical method based on abstract interpretation and model verification

Runtime error analytical method based on abstract interpretation and model verification

Owner:中国航天系统科学与工程研究院

High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform

ActiveCN105808309AImprove performanceReduce consumptionProgram controlMemory systemsThree levelOpen source

The invention puts forward a high-performance realization method of a BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of an SW platform. An ''interface-driver-kernel assembly core code'' three-layer code design framework is adopted by aiming at a domestic SW1600 platform, technical means, including a multiply-add instruction, loop unrolling, software pipeline instruction rearrangement, SIMD (Single Instruction Multiple Data) vector operation, register blocking technology and the like which are associated with platform architecture, are adopted to realize assembly level manual optimization, the problem that a compiler can not sufficiently optimize a compute-intensive function GEMM is solved, and function performance is greatly improved. Compared with an open source BLAS math library GotoBLAS, the high-performance realization method is characterized in that an average speed-up ratio is 4.72 and a highest speed-up ratio is 5.61.

High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform

High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform

High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform

Owner:INST OF SOFTWARE - CHINESE ACAD OF SCI

High-speed and agile encoder for variable strength long BCH codes

InactiveUS20110185265A1No additional costMinimal costCode conversionCyclic codesTheoretical computer scienceTrade offs

Agile BCH encoders are useful when the noise characteristics of the channel change which demands that the strength of the error correcting BCH code to be a variable. An agile encoder for encoding a linear cyclic code such as a BCH code, is a code that switches code strength (depth) relatively quickly in unit increments. The generator polynomial for the BCH code is provided in the factored form. The number of factored polynomials (minimal polynomials) chosen by the system determines the strength of the BCH code. The strength can vary from a weak code to a strong code in unit increments without a penalty on storage requirements for storing the factored polynomials. The BCH codeword is formed by a dividing network and a combining network. Special method is described that provides a trade off mechanism between latency and throughput while simultaneously optimizing the delay in the critical path which is in the forward path. Speed enhancements at minimal polynomial level are also provided by retiming, loop unfolding, loop unrolling, and special mathematical transformations. The presented invention can be implemented as an apparatus using software or hardware or in integrated circuit form.

High-speed and agile encoder for variable strength long BCH codes

High-speed and agile encoder for variable strength long BCH codes

High-speed and agile encoder for variable strength long BCH codes

Owner:CHERUKURI RAGHUNATH

Data receiver and semiconductor device including the data receiver

ActiveUS7701257B2Avoid time delayReduce power consumptionCurrent/voltage measurementEqualisersAudio power amplifierControl signal

The invention is directed to data receivers such as those used in semiconductor devices. Embodiments of the invention provide a loop unrolling DFE receiver that uses analog control signals from each equalizer to avoid timing delays associated with the use of latched digital control signals in the conventional art. In addition, embodiments of the invention implement each equalizer with a single sense amplifier based flip flop (SAFF) to reduce circuit size and power consumption.

Data receiver and semiconductor device including the data receiver

Data receiver and semiconductor device including the data receiver

Data receiver and semiconductor device including the data receiver

Owner:SAMSUNG ELECTRONICS CO LTD

Method for optimizing finite difference algorithm in heterogeneous many-core framework

InactiveCN106020773AImplement and optimize parallel computingSolve low computing performanceRegister arrangementsConcurrent instruction executionExtensibilityAnalysis data

The invention belongs to the technical field of high-performance calculation, and relates to a method for optimizing a finite difference algorithm in a heterogeneous many-core framework. The method is used for optimizing the finite difference algorithm in a many-core accelerator (MIC) and multi-core general processor (CPU)-based hybrid heterogeneous high-performance computer system by using three progressive optimization methods. The method mainly comprises a basic optimization method, a parallel optimization method and a heterogeneous collaborative optimization method. The method disclosed in the invention has the beneficial effects as follows: the three progressive optimization methods are used for solving the problems of low calculation performance and bad parallel effect caused by leap-type access and parallel execution lack when converting the finite difference algorithm from a many-core system to a heterogeneous many-core; the method is an optimization method with high efficiency and expandability, and can be used for weakening the calculation strength and clearing obstacles for vectorization through basic optimization methods such as branch elimination, loop unrolling and invariant switching; and the parallel optimization method such as a core algorithm is rewritten by using a vector instruction set through analyzing data dependency and circulating partitioning, and a multi-threading and long-vector mechanism of the many-core processor is fully utilized.

Method for optimizing finite difference algorithm in heterogeneous many-core framework

Method for optimizing finite difference algorithm in heterogeneous many-core framework

Owner:THE PLA INFORMATION ENG UNIV +2

Circulating-unfolded-structured AES encryption/decryption circuit based on data redundancy real-time error detection mechanism

ActiveCN104158652AReduce areaAvoid transmissionError preventionEncryption apparatus with shift registers/memoriesTime errorCircuit reliability

The invention discloses a circulating-unfolded-structured AES encryption / decryption circuit based on a data redundancy real-time error detection mechanism, and is used for resisting fault injection attacks or used for improving circuit reliability in an extreme application environment. The circuit comprises two parts of an AES encryption / decryption unit and a detecting unit, wherein the AES encryption / decryption unit adopts the circulating-unfolded structure, and is formed by Nk round transformation units and an alternative selector; the detecting unit is composed of Nk comparators. The AES encryption / decryption unit adopts the data redundancy processing technology in the data processing process, utilizes two adjacent round transformation units to perform the same operation on each group of data twice; the comparators in the detecting unit compare the results of the two operations; the AES encryption / decryption unit works normally if the operation results are the same; the AES encryption / decryption unit generates an error if the results are different. Compared with the conventional structural redundancy error detection mechanism, the adoption of data redundancy error detection mechanism can greatly reduce the circuit area.

Circulating-unfolded-structured AES encryption/decryption circuit based on data redundancy real-time error detection mechanism

Circulating-unfolded-structured AES encryption/decryption circuit based on data redundancy real-time error detection mechanism

Circulating-unfolded-structured AES encryption/decryption circuit based on data redundancy real-time error detection mechanism

Owner:NANJING UNIV OF AERONAUTICS & ASTRONAUTICS

Complex matrix optimizing method

InactiveCN102722472AImprove computing powerImprove computing efficiencyComplex mathematical operationsCache accessGranularity

The invention discloses a complex matrix optimizing method, which is characterized by comprising the steps: firstly calculating the specific unrolling granularity of the godson architecture, carrying out four-by-four loop unrolling to a complex matrix, and selecting a maximum value as the size nb of a partitioning block of a matrix so as to obtain the optimal ideal size of the partitioning block of the matrix on the godson, wherein the maximum value of the size nb of the partitioning block of the matrix is smaller than 52, and the product of 24 and the square of the size nb of the partitioning block of the matrix is smaller than the maximum of 64 kilobyte of a first data cache of a godson processor; reasonably dividing and combing matrixes in a matrix multiplication by utilizing the continuity and the locality of data storage, and reducing the cache access number of the first grade data of the godson; and carrying out the common complex addition and multiplication in the complex matrix operation by utilizing the multiplication of two complexes in the classic complex algorithm so as to reduce the operating scale, so that the calculation performance of the complex matrix multiplication on the godson is enhanced by about 50%, and the operating rate of BLAS (basic linear algebra subprograms) base on the godson 3A is increased by more than 1.5 times.

Complex matrix optimizing method

Complex matrix optimizing method

Complex matrix optimizing method

Owner:UNIV OF SCI & TECH OF CHINA

Compiling method and compiler

InactiveCN101452394ATake advantage of parallelismLighten the programming burdenConcurrent instruction executionMemory systemsParallel computingLoop unrolling

The invention relates to a compilation method and a compiler. The compilation method comprises: identifying a cycle containing first instructions, in which, the cycle has a definite control parameter and does not contain transfer instructions, and all first instructions do not have iterative correlation; compiling statistics of the number of the first instructions and second instructions in the cycle, and calculating cycle unfolding frequency and the cycle frequency of converting the first instructions into the second instructions according to the executing capability of a first instruction executing part and a second instruction executing part; carrying out cycle unfolding for the cycle when the cycle unfolding frequency is not equal to one, and converting the first instructions in the cycle unfolding into the corresponding second instructions according to the cycle frequency of converting the first instructions into the second instructions. The compilation method and the compiler can make full use of the parallelism of instruction executing parts in a processor to increase program executing efficiency and reduce the programming burden of a user.

Compiling method and compiler

Compiling method and compiler

Compiling method and compiler

Owner:JIANGNAN INST OF COMPUTING TECH

High-throughput SHA-1 (Secure Hash Algorithm) based on FPGA

InactiveCN106100825AReduce clock cyclesCalculation speedEncryption apparatus with shift registers/memoriesProcessor registerCritical path method

The invention provides a high-throughput SHA-1 (Secure Hash Algorithm) based on an FPGA. The method comprises the steps of S1, judging whether length of input message data exceeds 512 bits or not; S2, carrying out bit compensation on the message data until the length is integer multiples of the 512 bits if the length of input message data exceeds 512 bits; S3, segmenting the message data after bit compensation into multiple data blocks, wherein each data block is 512 bits, and segmenting each data block into 16 characters, wherein each character is 32 bits; S4, carrying out loop unrolling on an original iteration operation formula, thereby forming a loop unrolling structure; S5, determining pipeline series, and forming a pipeline structure by an intermediate register and the loop unrolling structure; and S6, inputting each character into the pipeline structure, thereby obtaining a SHA-1 calculation result. According to the algorithm, the iteration operation is simplified, an intermediate variable is added, therefore, a key path is shortened, and a calculation speed is improved. Moreover, through adoption of a pipeline processing mode, the data processing quantity is increased, and the throughput is improved.

High-throughput SHA-1 (Secure Hash Algorithm) based on FPGA

High-throughput SHA-1 (Secure Hash Algorithm) based on FPGA

High-throughput SHA-1 (Secure Hash Algorithm) based on FPGA

Owner:SHENZHEN FORWARD IND CO LTD

Unrolling loops with partial hot traces

InactiveUS7120907B2Software engineeringDigital computer detailsParallel computingLoop unrolling

Methods and apparatus are disclosed for improved loop unrolling by a compiler. A large class of loops exists for which effective loop unrolling has not previously been performed because they are too large to be completely unrolled, but which do not have a single hot trace that covers an entire loop iteration. The present invention recognizes such loops that have partial hot traces identified using profile data. A set of instructions which constitute a proper superset of the hot trace and a proper subset of the entire loop, and which forms a complete loop iteration is identified. This set of instructions can then be unrolled without unrolling the entire loop.

Unrolling loops with partial hot traces

Unrolling loops with partial hot traces

Unrolling loops with partial hot traces

Owner:INT BUSINESS MASCH CORP

Multi-rate multi-code length LDPC code decoding method based on SIMD instruction set

ActiveCN108365849ARealize online statisticsReduce storageError correction/detection using multiple parity bitsCode conversionCoding decodingComputer module

The invention provides a multi-rate multi-code length LDPC code decoding method based on a SIMD instruction set. The method comprises the following steps: realizing check matrix information online statistic through external configuration document by combining the LDPC code-based matrix feature in quasi-cyclic structure; through the adoption of a fixed point layered decoding scheme, respectively constructing a specific check node computing unit for different row weights by a decoder, and selecting the check node computing unit according to different row weights, wherein a cyclic expansion way is adopted in the check node computing unit. The online statistic of the check matrix information is realized, and the storage amount of the multi-rate multi-code length LDPC code decoder is reduced; compared with the existing algorithm, the dependence on the matrix statistical information is eliminated, and the realization complexity of the decoder is realized; compared with the existing algorithm, the speed loss is avoided, and the module for online computing can be modified, and the universality is provided.

Multi-rate multi-code length LDPC code decoding method based on SIMD instruction set

Multi-rate multi-code length LDPC code decoding method based on SIMD instruction set

Multi-rate multi-code length LDPC code decoding method based on SIMD instruction set

Owner:SOUTHEAST UNIV

Optimization of floating point complex vector summation based on BWDSP chips

ActiveCN107357552AImprove execution efficiencyEliminate pauseMachine execution arrangementsFloating pointEuclidean vector

The invention belongs to the field of optimization of the underlying function for digital signal processors and discloses an optimization of floating point complex vector summation based on high-performance general signal processor BWDSP chips; the floating point complex vector summation is the summation of a first floating point complex vector and a second floating point complex vector; the summation of the first floating point complex vector and a second floating point complex vector is circulation of summation of multi-time floating point complex numbers; each summation process of the floating point complex numbers comprises instruction parallel optimization based on BWDSP chips, that is, optimization of simultaneous control of more than one operation units to execute a same operation by one instruction; optimization based on circulation, that is, multiple times of optimization of the same loop code in a loop; optimization based on software pipeline, that is, optimization of multiple times of execution of same circulation code parallel intersection. The hardware resource of BWDSP chips can be fully utilized to obtain efficient underlying functions.

Optimization of floating point complex vector summation based on BWDSP chips

Optimization of floating point complex vector summation based on BWDSP chips

Optimization of floating point complex vector summation based on BWDSP chips

Owner:XIDIAN UNIV +1

Constant-temperature instruction level self-testing method for testing time delay faults in inner heating manner

InactiveCN104699578AAchieve internal heatingValid testDetecting faulty computer hardwareSoftware testing/debuggingTime delaysLoop unrolling

The invention relates to a constant-temperature instruction level self-testing method for testing time delay faults in an inner heating manner. A processor is subjected to high-temperature time delay. The method comprises steps as follows: an original instruction level self-testing program module is obtained; the original instruction level self-testing program module is subjected to loop unrolling deformation; the original instruction level self-testing program module is subjected to deformation on basis of cache miss; feasible scheduling is acquired in a set test temperature interval with a constant-temperature test program scheduling algorithm; the processor is heated to the lower bound of the test temperature interval, corresponding programs are executed according to feasible scheduling, and inner heating type constant-temperature tests are performed aiming at the time delay faults. Compared with the prior art, the constant-temperature instruction level self-testing method has the advantages that the time delay faults can be tested effectively under a high-temperature condition, a high fault covering rate is guaranteed, and the loss of the processor is reduced, and the like.

Constant-temperature instruction level self-testing method for testing time delay faults in inner heating manner

Constant-temperature instruction level self-testing method for testing time delay faults in inner heating manner

Constant-temperature instruction level self-testing method for testing time delay faults in inner heating manner

Owner:TONGJI UNIV

Method and apparatus for efficiently processing array operation in computer system

ActiveUS8024717B2Effective calculationReduce overheadSoftware engineeringDigital data processing detailsMulti dimensionalLoop unrolling

An apparatus and a method for processing an array in a loop in a computer system, including: applying loop unrolling to a multi-dimensional array included in a loop based on a predetermined unrolling factor to generate a plurality of unrolled multi-dimensional arrays; and transforming each of the plurality of unrolled multi-dimensional arrays into a one-dimensional array having an array subscript expression in a form of an affine function with respect to a loop counter variable.

Method and apparatus for efficiently processing array operation in computer system

Method and apparatus for efficiently processing array operation in computer system

Method and apparatus for efficiently processing array operation in computer system

Owner:SAMSUNG ELECTRONICS CO LTD

Incremental Loop Modification for LDPC Encoding

ActiveUS20160352458A1Error preventionError correction/detection using multiple parity bitsParallel computingMobile device

Techniques are disclosed relating to encoding communications. In some embodiments, for different rows of an encoding matrix, the following operations are performed: generate a set of operations for entries in the row, where the set of operations includes respective operations to be performed on the entries for multiplication of the matrix by a vector, propagate values of entries in the encoding matrix into the set of operations, and simplify ones of the set of operations based on the propagated values to generate an output set of operations. In some embodiments, the output sets of operations are usable to encode input data for communication over a medium. In some embodiments, the disclosed techniques facilitate loop unrolling within compiler memory constraints. In some embodiments, an apparatus (e.g., a mobile device) is configured with the output sets of operations.

Incremental Loop Modification for LDPC Encoding

Incremental Loop Modification for LDPC Encoding

Incremental Loop Modification for LDPC Encoding

Owner:NATIONAL INSTRUMENTS

Successive approximation register (SAR) analog to digital converter (ADC) with partial loop-unrolling

ActiveUS10469096B1Improve area efficiencyIncrease powerElectric signal transmission systemsAnalogue-digital convertersDigital down converterAnalog-to-digital converter

A receiver system that includes an ADC for converting analog values to digital representations. A digital representation is a sum of discrete values some of which are non-binary scaled and the other are binary scaled. The ADC includes dedicated comparators to determine whether to add or to subtract the non-binary scaled values. A comparator is used to determine whether to add or to subtract the binary scaled values. The ADC further calibrates offset voltages of the comparators to substantially remove dead zone and conversion errors, without compromising the conversion speed. The calibration can be performed both in foreground and background.

Successive approximation register (SAR) analog to digital converter (ADC) with partial loop-unrolling

Successive approximation register (SAR) analog to digital converter (ADC) with partial loop-unrolling

Successive approximation register (SAR) analog to digital converter (ADC) with partial loop-unrolling

Owner:INPHI

Hardware-based data prefetching based on loop-unrolled instructions

ActiveUS20190347103A1Memory architecture accessing/allocationInstruction analysisParallel computingComputer science

Prefetching data by determining that a first set of instructions that is processed by a computer processor indicates that a second set of instructions includes multiple iteration groups, where each of the iteration groups includes one or more loop-unrolled instructions, monitoring the second set of instructions as the second set of instructions is processed by the computer processor after the first set of instructions is processed by the computer processor, mapping a corresponding one of the loop-unrolled instructions in each of the iteration groups of the second set of instructions to a stride-tracking record that is shared by the corresponding loop-unrolled instructions, and prefetching data into a cache memory of the computer processor based on the stride-tracking record.

Hardware-based data prefetching based on loop-unrolled instructions

Hardware-based data prefetching based on loop-unrolled instructions

Owner:IBM CORP

Data processing method, device and equipment and computer storage medium

PendingCN114385180AImprove compilation optimization effectImprove versatilityNeural architecturesNeural learning methodsComputer hardwareScheduling instructions

The invention provides a data processing method, apparatus and device, and a computer storage medium. The method comprises the steps of obtaining an intermediate representation of a deep learning model; wherein the loop expansion factor is related to information of the intermediate representation during execution of the rear-end hardware equipment and / or equipment information of the rear-end hardware equipment; performing loop expansion on the intermediate representation according to the loop expansion factor to obtain an optimized intermediate representation; and compiling the optimized intermediate representation to obtain a target code which can be executed by the back-end hardware equipment, so that the back-end hardware equipment executes the target code to realize the function of the target code. By adopting the embodiment of the invention, the loop expansion factor can be calculated according to the execution information of the rear-end hardware equipment and / or the equipment information of the rear-end hardware equipment to obtain a more accurate loop expansion factor, and the intermediate representation is subjected to loop expansion through the loop expansion factor, so that instruction scheduling is carried out in a larger range; and the transportability of the intermediate representation is improved.

Data processing method, device and equipment and computer storage medium

Data processing method, device and equipment and computer storage medium

Data processing method, device and equipment and computer storage medium

Owner:PHYTIUM TECH CO LTD

Method for performing loop unrolled decision feedback equalization in an electronic device

ActiveCN105391660AAccurate trackingAccurate recoveryTransmitter/receiver shaping networksControl signalEqualization

A method for performing loop unrolled decision feedback equalization (DFE) and an associated apparatus are provided. The method includes: receiving a tap control signal and an offset control signal from a digital domain of a DFE receiver in an electronic device, and generating DFE information respectively corresponding to the tap control signal and the offset control signal in an analog domain of the DFE receiver; broadcasting the DFE information respectively corresponding to the tap control signal and the offset control signal toward comparators in the DFE receiver; utilizing the comparators to perform comparison operations according to the DFE information respectively corresponding to the tap control signal and the offset control signal to generate comparison results; and selectively adjusting the tap control signal and the offset control signal according to the comparison results, to optimize the DFE information respectively corresponding to the tap control signal and the offset control signal, respectively. The method and the apparatus can adaptively adjust equalization information input to the comparators in the receiver such as the DFE receiver.

Method for performing loop unrolled decision feedback equalization in an electronic device

Method for performing loop unrolled decision feedback equalization in an electronic device

Method for performing loop unrolled decision feedback equalization in an electronic device

Owner:MEDIATEK INC

An accelerator structure and loop unrolling method for binarized neural network

ActiveCN111797977BNeural architecturesEnergy efficient computingHardware structureHardware acceleration

The invention discloses an accelerator structure and a loop expansion method for a binary neural network. Aiming at a hardware accelerator structure with a weight value of 1 bit and a feature value of n bits, the invention includes hardware structure design of the accelerator and a method for binarization Neural network optimized loop unrolling structure and storage order of weights and eigenvalues in SRAM. The hardware structure includes weights, eigenvalue storage SRAM, dedicated convolution calculation module and addition tree unit. The dedicated convolution module designs a new convolution calculation method, and the addition tree ensures the pipeline operation of the data. The combination of the loop expansion method used in the present invention and the accumulator can make the accelerator have very good scalability, and the size of the block K can be freely determined according to the complexity of the network and hardware resources without changing the control logic of the circuit. Cooperating with this loop expansion method, the present invention also proposes a storage order of weights and feature values to simplify the access logic.

An accelerator structure and loop unrolling method for binarized neural network

An accelerator structure and loop unrolling method for binarized neural network

An accelerator structure and loop unrolling method for binarized neural network

Owner:XI AN JIAOTONG UNIV +1

Successive approximation register (SAR) analog to digital converter (ADC) with partial loop-unrolling

ActiveUS10454491B1Improve area efficiencyIncrease powerElectric signal transmission systemsParallel/series conversionDigital down converterBinary scaling

A receiver system that includes an ADC for converting analog values to digital representations. A digital representation is a sum of discrete values some of which are non-binary scaled and the other are binary scaled. The ADC includes dedicated comparators to determine whether to add or to subtract the non-binary scaled values. A comparator is used to determine whether to add or to subtract the binary scaled values. The ADC further calibrates offset voltages of the comparators to substantially remove dead zone and conversion errors, without compromising the conversion speed. The calibration can be performed both in foreground and background.

Successive approximation register (SAR) analog to digital converter (ADC) with partial loop-unrolling

Successive approximation register (SAR) analog to digital converter (ADC) with partial loop-unrolling

Successive approximation register (SAR) analog to digital converter (ADC) with partial loop-unrolling

Owner:INPHI

A fixed-temperature command-level self-test method for detecting time-delay faults by internal temperature rise

InactiveCN104699578BAchieve internal heatingValid testDetecting faulty computer hardwareSoftware testing/debuggingFault coverageTest procedures

The invention relates to a fixed-temperature command-level self-test method for detecting time-delay faults in an internal temperature rise mode. The high-temperature time-delay test is performed on a processor, comprising the following steps: obtaining the original command-level self-test program module; and performing the original command-level self-test program module Carry out the deformation of loop unrolling; carry out the deformation of the original instruction level self-test program module based on triggering cache miss; within the set test temperature range, use the fixed temperature test program scheduling algorithm to obtain feasible scheduling; heat the processor to the test temperature range The lower bound of , according to the feasible schedule, execute the corresponding program, and implement the constant temperature test of internal heating for delay faults. Compared with the prior art, the invention has the advantages of being able to effectively test delay faults under high temperature conditions, ensuring high fault coverage, reducing processor loss and the like.

A fixed-temperature command-level self-test method for detecting time-delay faults by internal temperature rise

A fixed-temperature command-level self-test method for detecting time-delay faults by internal temperature rise

A fixed-temperature command-level self-test method for detecting time-delay faults by internal temperature rise

Owner:TONGJI UNIV

A method to reduce register overflow caused by fine-grained randomization security optimization

ActiveCN109240699BReduce overflowImprove loop optimizationCode compilationAlgorithmLoop optimization

The invention discloses a method for reducing register overflow caused by fine-grained randomization safety optimization, and relates to the technical field of compiler cycle optimization. First, the variables in the registers in the cycle body are reclassified, including cycle invariants, cycle induction variables and cycle changes. After the classification, the variables in the registers in the loop body are identified; finally, according to the number of loop invariants, loop induction variables, and loop variations in the loop body registers after identification, the loop expansion factor is obtained. The present invention proposes register pressure sensitive The loop unrolling method can improve the loop optimization effect to a certain extent and reduce the occurrence of register overflow; in addition, for randomization optimization, hot code, generally the loop body is more sensitive to the performance load brought by randomization, thus improving loop unrolling Optimizations can also improve the performance of fine-grained randomization security optimizations.

A method to reduce register overflow caused by fine-grained randomization security optimization

A method to reduce register overflow caused by fine-grained randomization security optimization

A method to reduce register overflow caused by fine-grained randomization security optimization

Owner:广东中科实数科技有限公司

AES Encryption/Decryption Circuit Based on Data Redundancy Real-time Error Detection Mechanism

ActiveCN104158652BReduce areaAvoid transmissionError preventionEncryption apparatus with shift registers/memoriesTime errorCircuit reliability

The invention discloses a circulating-unfolded-structured AES encryption / decryption circuit based on a data redundancy real-time error detection mechanism, and is used for resisting fault injection attacks or used for improving circuit reliability in an extreme application environment. The circuit comprises two parts of an AES encryption / decryption unit and a detecting unit, wherein the AES encryption / decryption unit adopts the circulating-unfolded structure, and is formed by Nk round transformation units and an alternative selector; the detecting unit is composed of Nk comparators. The AES encryption / decryption unit adopts the data redundancy processing technology in the data processing process, utilizes two adjacent round transformation units to perform the same operation on each group of data twice; the comparators in the detecting unit compare the results of the two operations; the AES encryption / decryption unit works normally if the operation results are the same; the AES encryption / decryption unit generates an error if the results are different. Compared with the conventional structural redundancy error detection mechanism, the adoption of data redundancy error detection mechanism can greatly reduce the circuit area.

AES Encryption/Decryption Circuit Based on Data Redundancy Real-time Error Detection Mechanism

AES Encryption/Decryption Circuit Based on Data Redundancy Real-time Error Detection Mechanism

AES Encryption/Decryption Circuit Based on Data Redundancy Real-time Error Detection Mechanism

Owner:NANJING UNIV OF AERONAUTICS & ASTRONAUTICS

Complex matrix optimizing method

InactiveCN102722472BImprove computing powerImprove computing efficiencyComplex mathematical operationsCache accessGranularity

The invention discloses a complex matrix optimizing method, which is characterized by comprising the steps: firstly calculating the specific unrolling granularity of the godson architecture, carrying out four-by-four loop unrolling to a complex matrix, and selecting a maximum value as the size nb of a partitioning block of a matrix so as to obtain the optimal ideal size of the partitioning block of the matrix on the godson, wherein the maximum value of the size nb of the partitioning block of the matrix is smaller than 52, and the product of 24 and the square of the size nb of the partitioning block of the matrix is smaller than the maximum of 64 kilobyte of a first data cache of a godson processor; reasonably dividing and combing matrixes in a matrix multiplication by utilizing the continuity and the locality of data storage, and reducing the cache access number of the first grade data of the godson; and carrying out the common complex addition and multiplication in the complex matrix operation by utilizing the multiplication of two complexes in the classic complex algorithm so as to reduce the operating scale, so that the calculation performance of the complex matrix multiplication on the godson is enhanced by about 50%, and the operating rate of BLAS (basic linear algebra subprograms) base on the godson 3A is increased by more than 1.5 times.

Complex matrix optimizing method

Complex matrix optimizing method

Complex matrix optimizing method

Owner:UNIV OF SCI & TECH OF CHINA

Popular searches

Frequency table Instruction stream Computer engineering Computer program Intelligent decision Execution time Matrix partitioning Single-core Matrix multiplication Block matrix

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

© 2025 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com