Processors, components, devices, and methods for filtering processing of iir filters
By configuring the input mode of the IIR filter processor to broadcast or copy mode, parallel multiplication and accumulation operations of the coefficient matrix are achieved, solving the problem of low computational efficiency of IIR filters in video and image data processing and achieving a significant acceleration effect.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BESTECHNIC SHANGHAI CO LTD
- Filing Date
- 2022-12-21
- Publication Date
- 2026-06-26
AI Technical Summary
Existing IIR filters are computationally inefficient and offer little speedup when processing massive amounts of multimedia data, such as video and images. In particular, there is room for improvement in computational efficiency when using the arm_biquad_cascade_df1_fast_q15 function of CMSIS.
An IIR filter processor is provided, including a configuration register, a general-purpose register, and a matrix multiplication and accumulation unit. By configuring the input mode to broadcast or copy mode, it can realize parallel multiplication and accumulation operations of the coefficient matrix to meet the computational requirements of q15 type.
It significantly shortens the IIR filtering operation time by about 25% compared to the CMSIS method, thus improving computational efficiency.
Smart Images

Figure CN115913176B_ABST
Abstract
Description
Technical Field
[0001] This application relates to filters and filtering processes in the field of wireless communication, and more specifically, to processors, components, devices, and methods for filtering processes for IIR filters. Background Technology
[0002] In the field of digital signal processing, IIR filters, compared with FIR filters, have the disadvantage of poor phase characteristics. However, they are simple in structure, require less computation, and are economical and efficient. Furthermore, they can achieve high selectivity with a lower order, thus gaining widespread application. The general expression of the difference equation for an IIR filter is shown in equation (a) below:
[0003]
[0004] Where x(n) is the input sequence, y(n) is the output sequence, and a i and b i Let N represent the filter coefficients, and N be the order of the IIR filter. IIR filters have an infinitely long unit impulse response, resulting in a feedback loop and recursion. That is, the output sequence of the IIR filter is related not only to the input at past time points but also to the output at past time points, as can be seen from formula (a) above. This leads to low computational efficiency in signal filtering.
[0005] The industry generally uses CMSIS (i.e., ARM Cortex) TM Functions like `arm_biquad_cascade_df1_fast_q15` in the microcontroller software interface standard (MIC) are used for IIR filter processing. This function utilizes cascaded second-order units (Biquads). Specifically, `arm_biquad_cascade_df1_fast_q15` sorts the filter coefficients in a certain way, calculates the multiplication of corresponding coefficients in sequence, and then accumulates them. However, the speedup effect is insufficient. Especially in applications that require processing massive amounts of multimedia data such as video and images, the speedup effect is not significant, and there is room for improvement in computational efficiency. Summary of the Invention
[0006] This application addresses the aforementioned deficiencies in the prior art. There is a need for a processor, component, device, and method for filtering IIR filters, which can significantly reduce the time consumed by IIR filtering operations compared to CMSIS and existing IIR filtering calculation methods, providing a substantial acceleration effect in application scenarios requiring the processing of massive multimedia data such as video and images.
[0007] According to a first aspect of this application, a processor for filtering processing of an IIR filter is provided. The processor includes a first configuration register, a second configuration register, at least one general-purpose register, and a matrix multiplication and accumulation unit. The second configuration register is used to configure the output data reading method. The first configuration register is used to configure the data type of the arithmetic logic units, including the matrix multiplication and accumulation unit, and to configure the input mode of input data to the matrix multiplication and accumulation unit as either a copy mode or a broadcast mode. The at least one general-purpose register is configured to sequentially read and store the coefficients of each row of the coefficient matrix of each order of the IIR filter. The matrix multiplication accumulation unit is configured as follows: for the same input vector, each row coefficient is used as the current row coefficient. When the first configuration register is configured with the input mode in broadcast mode, the unit obtains the corresponding single input element in the input vector corresponding to the current row coefficient. When the first configuration register is configured with the input mode in copy mode, the stored current row coefficient is copied. The current row coefficient is multiplied in parallel with the corresponding input element to obtain the corresponding product of the current row coefficient. The product results of each row coefficient are successively accumulated to obtain the final output value, ensuring that each complete multiplication accumulation operation yields at least four sequential sampling times for the output variable. This satisfies the calculation requirements when the data type of the IIR filter coefficients is q15.
[0008] According to a second aspect of this application, an IIR filter component is provided, which includes a processor according to various embodiments of this application. The processor includes a first configuration register, a second configuration register, at least one general-purpose register, and a matrix multiplication accumulator unit. The second configuration register is used to configure the output data readout mode. The first configuration register is used to configure the data type of the arithmetic logic units, including the matrix multiplication accumulator unit, and to configure the input mode of input data to the matrix multiplication accumulator unit as a copy mode or a broadcast mode. The at least one general-purpose register is configured to sequentially read and store the row coefficients of the coefficient matrix of each order of the IIR filter. The matrix multiplication accumulation unit is configured as follows: for the same input vector, each row coefficient is used as the current row coefficient. When the input mode of the first configuration register is configured as broadcast mode, the single input element corresponding to the current row coefficient in the input vector is obtained; when the input mode of the first configuration register is configured as copy mode, the stored current row coefficient is copied; the current row coefficient is multiplied in parallel with the corresponding input element to obtain the corresponding product of the current row coefficient; the product results of each row coefficient are accumulated successively to obtain the final output value, so that each complete multiplication accumulation operation yields at least 4 sequential sampling time output variables, which meets the calculation requirements when the data type of the filter coefficient of the IIR filter is q15 type.
[0009] According to a third aspect of this application, a smart portable device with an IIR filter assembly is provided. The smart portable device includes an NPU configured to process multimedia data including at least one of audio, video, and images. The smart portable device also includes a processor according to various embodiments of this application, serving as an NPU coprocessor. The processor includes a first configuration register, a second configuration register, at least one general-purpose register, and a matrix multiplication and accumulation unit. The second configuration register is used to configure the output data readout method. The first configuration register is used to configure the data type of the arithmetic logic unit, including the matrix multiplication and accumulation unit, and to configure the input mode of input data to the matrix multiplication and accumulation unit as a copy mode or a broadcast mode. The at least one general-purpose register is configured to sequentially read and store the row coefficients of the coefficient matrix of each order of the IIR filter. The matrix multiplication accumulation unit is configured as follows: for the same input vector, each row coefficient is used as the current row coefficient. When the input mode of the first configuration register is configured as broadcast mode, a single input element in the input vector corresponding to the current row coefficient is obtained; when the input mode of the first configuration register is configured as copy mode, the stored current row coefficient is copied; the current row coefficient is multiplied in parallel with the corresponding input element to obtain the corresponding product of the current row coefficient; the product results of each row coefficient are accumulated successively to obtain the final output value, so that each complete multiplication accumulation operation yields at least 4 sequential sampling time output variables, which meets the calculation requirements when the data type of the filter coefficient of the IIR filter is q15 type.
[0010] According to the fourth aspect of this application, a filtering method for an IIR filter is provided. This filtering method includes the following steps: Determining the coefficient matrices of each order of the IIR filter; Sequentially reading and storing the coefficients of each row of the coefficient matrices of each order of the IIR filter; Using a matrix multiplication accumulation unit, for the same input vector, using each row coefficient as the current row coefficient, obtaining a single input element in the input vector corresponding to the current row coefficient, and multiplying the current row coefficient in parallel with the corresponding input element to obtain the corresponding product of the current row coefficient; Successively accumulating the product results of each row coefficient to obtain the final output value, such that each complete multiplication accumulation operation yields at least four sequential sampling times of output variables, which satisfies the calculation requirements when the data type of the IIR filter coefficients is q15.
[0011] The processors, components, devices, and methods for filtering IIR filters provided in the various embodiments of this application pre-expand the coefficient matrix row by row. With the configuration of the matrix multiplication accumulation unit, at least four dot products can be calculated at once. This meets the calculation requirements when the data type of the IIR filter coefficients is q15, and can be directly accumulated. In application scenarios where intelligent portable devices need to process massive multimedia data such as video and images, it can significantly shorten the IIR filtering operation time by about 25% compared to CMSIS and existing IIR filtering calculation methods, providing a sufficient acceleration effect. Attached Figure Description
[0012] The features, advantages, and technical and industrial significance of exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, wherein like reference numerals denote like elements, and wherein:
[0013] Figure 1 A structural diagram of a processor for filtering processing of an IIR filter according to an embodiment of this application is shown;
[0014] Figure 2 A flowchart illustrating the arithmetic processing performed by the matrix multiplication accumulation unit according to an embodiment of this application is shown.
[0015] Figure 3 A flowchart illustrating an accelerated filtering method for an IIR filter executed by a processor according to an embodiment of this application is shown; and
[0016] Figure 4 A flowchart illustrating a filtering method for an IIR filter according to an embodiment of this application is shown. Detailed Implementation
[0017] To enable those skilled in the art to better understand the technical solutions of this application, the application will be described in detail below with reference to the accompanying drawings and specific embodiments. The embodiments of this application will be further described in detail below with reference to the accompanying drawings and specific examples, but these are not intended to limit the scope of this application.
[0018] The terms "first," "second," and similar terms used in this application do not indicate any order, quantity, or importance, but are merely used for distinction. Terms such as "comprising" or "including" mean that the element preceding the term encompasses the elements listed after it, without excluding the possibility of encompassing other elements. The arrows in the accompanying drawings are not intended to limit the execution order of the steps connected by the arrows. The execution order of the steps in various embodiments may also be different from the order shown by the arrows, as long as the logical consistency of the steps is not contradictory.
[0019] Figure 1A structural diagram of a processor for filtering processing of an IIR filter according to an embodiment of this application is shown. Figure 1 As shown, the processor 100 includes a first configuration register 101, a second configuration register 102, at least one general-purpose register 103, and a matrix multiplication and accumulation unit 104.
[0020] For example, the first configuration register 101 and the second configuration register 102 can be two separate global configuration registers used to control the operation of the processor 100. The first configuration register 101 is mainly used to configure the data type of the arithmetic logic unit (ALU), including the matrix multiplication accumulator unit 104. An arithmetic logic unit (ALU) is a combinational logic circuit capable of implementing multiple sets of arithmetic and logical operations. The matrix multiplication accumulator unit 104 in this application, as an ALU, is implemented as a combinational logic circuit capable of performing matrix multiplication and accumulation addition operations. Specifically, the matrix multiplication accumulator unit 104 can be formed as an array of basic arithmetic units, multipliers and accumulators, used to perform matrix multiplication operations, that is, multiple multiplication operations in parallel, adding the product of the matrix multiplication to the value of the accumulator, and then storing it in the accumulator.
[0021] The second configuration register 102 is used to configure the output data reading method, such as how many bits the output data is packed into.
[0022] The first configuration register 101 is also used to configure the input mode of the input data to the matrix multiplication accumulator unit 104 as either copy mode or broadcast mode. The at least one general-purpose register 103 is used to store the written data, such as the row coefficients of the coefficient matrix of each order of the IIR filter, etc., as described below. Note that the input vector is not stored in the first configuration register 101, but is directly broadcast to the matrix multiplication accumulator unit 104 for parallel multiplication operations.
[0023] The at least one general-purpose register 103 is configured to: sequentially read and store the coefficients of each row of the coefficient matrix of each order of the IIR filter. For example... Figure 2As shown, the matrix multiplication accumulation unit 104 is configured to perform the following steps for each input vector to complete the full multiplication operation between the input vector and the coefficient matrix. Specifically, for the same input vector, step 201 is performed using the coefficients of each row as the current row coefficients. In step 201, when the first configuration register 101 is configured with the input mode as broadcast mode, the matrix multiplication accumulation unit 104 obtains a single input element in the input vector corresponding to the current row coefficient by receiving the broadcast, as will be described in detail below. Note that the at least one general-purpose register 103 can preset two values. The first value corresponds to mode A, which is the broadcast mode, and can also be called the mode A parameter. The second value corresponds to mode B, which is the copy mode, and can also be called the mode B parameter.
[0024] In step 202, the matrix multiplication accumulation unit 104 copies the stored current row coefficients when the input mode of the first configuration register 101 is configured as copy mode. In step 203, the matrix multiplication accumulation unit 104 multiplies the current row coefficients in parallel with their corresponding input elements to obtain the corresponding product of the current row coefficients. In step 204, the matrix multiplication accumulation unit 104 successively accumulates the product results of each row coefficient to obtain the final output value, so that each complete multiplication accumulation operation yields the output variable at least four sequential sampling times. This can meet the calculation requirements of the Q15 data type of the filter coefficients of IIR filters widely used in many scenarios.
[0025] Specifically, the first configuration register 101 can use the A mode parameter to control the input mode of each element of the input vector to the matrix multiplication accumulation unit 104 to be broadcast mode, that is, the values in the connected matrix multiplication accumulation unit 104 are all obtained by broadcasting the input elements in the same input vector; the B mode parameter can be used to control the input mode of the coefficient matrix to the matrix multiplication accumulation unit 104 to be copy mode, that is, the data stored in the at least one general-purpose register 103 is directly copied to the connected matrix multiplication accumulation unit 104.
[0026] First, the first row of coefficients in the coefficient matrix is read by at least one general-purpose register 103. By setting the A-mode parameter and the B-mode parameter in the first configuration register 101 respectively, the first element of the same input vector can be broadcast into the matrix multiplication accumulator unit 104, and the first row of coefficients can be copied into the matrix multiplication accumulator unit 104. Parallel matrix multiplication is performed to obtain the first product corresponding to the first row of coefficients, which is then stored in the accumulator. Then, the second row of coefficients in the coefficient matrix is read by at least one general-purpose register 103 (which may be the same general-purpose register 103). By setting the A-mode parameter and the B-mode parameter in the first configuration register 101 respectively, the next element of the same input vector (as the common multiplier of the second row of coefficients) can be broadcast into the matrix multiplication accumulator unit 104, and the second row of coefficients can be copied into the matrix multiplication accumulator unit 104. Parallel matrix multiplication is performed to obtain the second product corresponding to the second row of coefficients, which is then added to the current value of the accumulator and stored in the accumulator. This process is repeated sequentially until at least one general-purpose register 103 has retrieved the coefficients of all rows of the coefficient matrix. The product results corresponding to the coefficients of each row are accumulated at the corresponding positions in the matrix multiplication accumulation unit 104. The final accumulated result is then read out, which yields the final output variable. The above operations and interactions of the various components in the processor 100 can be implemented using assembly instructions.
[0027] By expanding the coefficient matrix of the IIR filter into rows of coefficients, and setting a general-purpose register 103 to store a row of coefficients, the matrix multiplication and accumulation unit 104 reads in the row of coefficients and the common input variable at high speed, and performs parallel matrix multiplication and accumulation operations on the row of coefficients and the common input variable, multiple multiplications can be calculated at once. For example, in the case of Q15 data type, four multiplications can be calculated at once, which can significantly accelerate the filtering process of the IIR filter. For example, when the data type is Q15, the at least one general-purpose register is 64 bits, which can read and store four 16-bit numbers at once, which is exactly four coefficients in a row. Each complete multiplication and accumulation operation yields the output variables at four sequential sampling times.
[0028] Specifically, the filtering process of the IIR filter was improved by utilizing the input x from the previous two time steps (m-1 and m-2 for the current time step m). m-1 x m-2 and output y m-1 y m-2 And the input x for n time steps from the current time. m ~x m+n-1 Let y represent the n output terms starting from the current time. m ~y m+n-1The corresponding coefficient matrix is calculated, and this coefficient matrix can be directly used to calculate each of the subsequent n outputs within the same order, so that each complete multiplication and accumulation operation yields the output variables at n sequential sampling times.
[0029] In some embodiments, the coefficient matrix is the same for each complete multiplication-accumulation operation of the same order, that is, for each input variable of the same order, and can be calculated using the following formula (1):
[0030] y n = b0x n + b1x n-1 + b2x n-2 – a1y n-1 – a2y n-2 Formula (1)
[0031] Each complete multiplication-accumulation operation yields the output variable x at n sequential sampling times. n x n-1 and x n-2 Let y be the input variables at times n, n-1, and n-2, respectively. n y n-1 and y n-2 Let be the output variables at times n, n-1, and n-2, respectively, and b0, b1, b2, a1, and a2 be the filtering coefficients at that order. The coefficient matrix is calculated through the following steps: the values of y0 and y1 are obtained by setting the coefficients of the variables not appearing in formula (1) to 0, and then they are substituted into formula (1) for successive calculations to obtain the values from y0, y1, and x0 to x. n+1 From a total of n+4 input variables to the output variable y2 at n time points, to y n+1 The mapping matrix, represented by the filter coefficients b0, b1, b2, a1, and a2 (i.e., the mapping matrix at the current time m=2), is used as the coefficient matrix. This coefficient matrix is applicable to the calculation of output variables at all times after the current time m=2 of the same order, until all output variables of the same order (e.g., N output variables) have been calculated, before proceeding to the next order of multiplication and accumulation. The same coefficient matrix is used for the calculation of output variables at all times after the current time m=2, while only the input vector is iteratively changed. This is because for the same order, the filter coefficients are fixed, and the mapping relationship (influence) between the current and subsequent time output variables represented by the input and output variables of the previous two times remains stable across different times of advancement. This simplified and efficient design significantly improves the acceleration effect.
[0032] Specifically, for the current time m=2, the current operation is the first complete multiplication and accumulation operation of the n output variables, which can be represented by formula (2):
[0033]
[0034] The first term on the left side of the equals sign is the same input vector used in this multiplication-accumulation operation, with dimensions [1, n+4]. The second term on the left side of the equals sign is the coefficient matrix, with dimensions [n+4, n]. It can be seen that for each row of the coefficient matrix, the coefficients being multiplied represent the corresponding input elements from the same input vector, in the following order: y0 (corresponding to the first row coefficients), y1 (corresponding to the second row coefficients), x0 (corresponding to the third row coefficients), x1 (corresponding to the fourth row coefficients), and so on. The right side of the equals sign is the final output vector of a complete multiplication-accumulation operation, with dimensions [1, n]. At the current time step 2 and n = 4, which is the data type Q15, the final output vector of a complete multiplication-accumulation operation is {y2 y3 y4 y5}, representing the outputs at times 2, 3, 4, and 5. This reduces the computation time by approximately 25% compared to existing CMSIS and IIR filtering methods.
[0035] In some embodiments, the IIR filter assembly may include a processor 100 according to various embodiments of this application to efficiently and rapidly perform IIR filtering operations of the IIR filter assembly.
[0036] In some embodiments, the processor 100 can be widely used in various smart portable devices with IIR filter components. These smart portable devices may include, but are not limited to, at least one of headphones, watches, speakers, glasses, and helmets, including an NPU (embedded neural network processor) employing a data-driven parallel computing architecture configured to process multimedia data including at least one of audio, video, and images, to meet the increasing demands of users for processing massive amounts of multimedia data such as audio, video, and images, and for processing efficiency. In some embodiments, these smart portable devices can assist users in freely processing massive amounts of multimedia data and various interactive data in VR (virtual reality) and AR (augmented reality) environments to achieve an immersive and realistic experience. In these smart portable devices, which typically run bus-constrained systems, the processor 100 of the various embodiments of this application can act as an NPU coprocessor to efficiently and quickly handle various parallel multiplication and accumulation workloads.
[0037] In some embodiments, the data type includes Q15 format and Q7 format. When the data type of the IIR filter coefficients is Q15 format, each complete multiplication-accumulation operation yields output variables at four sequential sampling times, while when the data type is Q7 format, each complete multiplication-accumulation operation yields output variables at eight sequential sampling times. The at least one general-purpose register 103 is 64 bits. This storage capacity is sufficient for each parallel multiplication-accumulation operation for Q15 format, and also sufficient for each parallel multiplication-accumulation operation for IIR filter coefficients with Q7 data type. In some embodiments, the matrix multiplication-accumulation unit includes at least four accumulators to perform at least four dot products in parallel, thereby meeting the acceleration requirements of the Q15 format data type.
[0038] In some embodiments, the at least one general-purpose register includes eight 64-bit general-purpose registers, and the matrix multiplication accumulation unit includes 64 accumulators to enable parallel computation of up to 64 dot products. This provides the processor 100 with ample storage space and parallel multiplication accumulation operations, allowing it to perform other parallel multiplication accumulation operations besides IIR filter processing, or filtering of other data types (e.g., Q7 format requiring parallel execution of eight or more dot product accumulation operations), and to store other operating parameters.
[0039] In some embodiments, different orders have independent and different coefficient matrices, and the coefficient matrix of each order is calculated based on the filter coefficients of that order.
[0040] In some embodiments, the coefficient matrix is the same for each complete multiplication accumulation operation of the same order, and is calculated using the formula (1):
[0041] y n = b0x n + b1x n-1 + b2x n-2 – a1y n-1 – a2y n-2 Formula (1)
[0042] Each complete multiplication-accumulation operation yields the output variable x at n sequential sampling times. n x n-1 and x n-2 Let y be the input variables at times n, n-1, and n-2, respectively. n y n-1 and y n-2Let be the output variables at times n, n-1, and n-2, respectively, and b0, b1, b2, a1, and a2 be the filtering coefficients at that order. The coefficient matrix is calculated through the following steps: the values of y0 and y1 are obtained by setting the coefficients of the variables not appearing in formula (1) to 0, and then they are substituted into formula (1) for successive calculations to obtain the values from y0, y1, and x0 to x. n+1 From a total of n+4 input variables to the output variable y2 at n time points, to y n+1 The mapping matrix represented by the filter coefficients b0, b1, b2, a1, and a2 is used as the coefficient matrix.
[0043] In some embodiments, for multi-order IIR filtering, the matrix multiplication accumulation unit is further configured to, for the current order: based on the input variable x at the last two time points N-2 and N-1 of the previous order. N-2 x N-1 and the output variable y of the last two time steps of the previous order N-2 y N-1 And the output variable y0 to y0 obtained from the previous order calculation N-1 As the input variables of the current order x0 to x N-1 The multiplication and accumulation calculation is performed using the coefficient matrix of this order, so that each complete multiplication and accumulation operation yields the output variables of n sequential sampling times, where N is the total number of sampling times of the previous order.
[0044] The following is combined with Figure 3 The multi-stage IIR filtering process is explained in detail.
[0045] In step 301, the output variables y0 and y1 can be obtained using formula (1). Specifically, for each order, b0, b1, b2, a1, and a2 are the filtering coefficients for that order. The filtering coefficients for different orders will be different. Accordingly, the output variables y0 and y1 are obtained for each order based on the filtering coefficients for that order, so as to pre-calculate the coefficient matrix for all orders in step 302. Specifically, the filtering coefficients b0, b1, b2, a1, and a2 for each order are uniformly represented as b0, b1, b2, a1, and a2 for ease of description, and their values can be different between different orders. The variable coefficients that do not appear in formula (1) can be set to 0. Let n = 0, then y0 = b0x0. Further, let n = 1, then y1 = b0x1 + b1x0 - a1y0.
[0046] In step 302, the filter coefficients b0, b1, b2, a1, and a2 of each order, as well as the inputs x0 to x..., can be used. n+1 The outputs y0 and y1 represent the first n terms, and the outputs y2 to y n+1 The coefficient matrix of this order can be pre-calculated, and similarly, the coefficient matrices of all orders can be pre-calculated.
[0047] According to formula (1), the calculation expression for y2 can be obtained as shown in formula (3) below:
[0048] y2 = b0x2 + b1x1 + b2x0 – a1y1 – a2y0 Formula (3)
[0049] According to formula (1), the calculation expression for y3 can be obtained as shown in formula (4) below:
[0050] y3 = b0x3 + b1x2 + b2x1 – a1y2 – a2y1 Formula (4),
[0051] Substituting formula (3) into formula (4), we obtain the calculation expression for y3 using the input variables x0~x3, the output variables y0, y1, and the filter coefficients b0, b1, b2, a1, and a2, as shown in formula (5) below:
[0052] y3=b0x3+b1x2+b2x1–a1(b0x2+b1x1+b2x0–a1y1–a2y0)–a2y1 formula (5).
[0053] So in this order, y t Use input variables x0~x t Substitute the calculation expressions for the output variables y0 and y1, and the filter coefficients b0, b1, b2, a1, and a2 into y t+1 =b0x t+1 +b1x t +b2x t-1 –a1y t –a2y t-1 t = 2, 3, ..., n, and finally we can obtain the values from y0, y1, and x0 to x n+1 A total of n+4 variables and filter coefficients b0, b1, b2, a1, and a2 represent the output values y2 to y3 at n time points. n+1 The calculation formula can be rearranged into a coefficient matrix represented by the filter coefficients b0, b1, b2, a1, and a2. For example, returning to formula (2), the second term on the left side of the equal sign is the coefficient matrix with dimensions [n+4, n], and each element of this coefficient matrix is represented only by the filter coefficients b0, b1, b2, a1, and a2.
[0054] In step 303, the first-order coefficient matrix is sequentially input into a general-purpose register, and the row coefficients in the general-purpose register are multiplied in parallel with the corresponding variables until all variables x0 to x are obtained. n+1 The multiplication with y0 and y1 is completed, and the coefficient matrix remains unchanged, retaining the input x from the last two time steps of the previous calculation.n x n+1 and output y n y n+1 Continue to form an array of length n+4 with the inputs from the next n time steps, and perform iterative calculations in the above manner until all input data has been taken and the calculation is completed.
[0055] For example, the output variables at n time points in the next round can be calculated according to formula (6):
[0056]
[0057] Note that the second term on the left side of the equals sign in formula (6) is the coefficient matrix. This coefficient matrix is the same as the coefficient matrix used in formula (2) for calculating the output variables at n times in the previous round (first round), which significantly speeds up the calculation.
[0058] In step 304, for multi-order IIR filtering, the input x at the last two time points of the previous order is retained. N-2 x N-1 and output y N-2 y N-1 And the output y0~y obtained from the previous calculation N-1 As the input to the next order x0~x N-1 Using the coefficient matrix of this order, continue the calculation according to the previous step until the calculation of this order is completed. After completing the calculation of this order, continue in this way to the next order of calculation, until the filtering processing operations of all orders are completed.
[0059] Figure 4 A flowchart illustrating a filtering method for an IIR filter according to an embodiment of this application is shown. Figure 4 As shown, in step 401, the coefficient matrices of each order of the IIR filter are determined.
[0060] In step 402, the coefficients of each row of the coefficient matrix of each order of the IIR filter are read and stored sequentially.
[0061] In step 403, using the matrix multiplication accumulation unit, for the same input vector, each row coefficient is used as the current row coefficient, and a single input element in the input vector corresponding to the current row coefficient is obtained. The current row coefficient is multiplied in parallel with the corresponding input element to obtain the corresponding product of the current row coefficient.
[0062] In step 404, the matrix multiplication accumulation unit is used to successively accumulate the product results of the coefficients in each row to obtain the final output value, so that each complete multiplication accumulation operation yields the output variable at least 4 sequential sampling times. This can meet the calculation requirements of the Q15 data type of the filter coefficients of IIR filters widely used in many scenarios.
[0063] By using the above filtering method, the same coefficient matrix is applied to filtering operations of the same order in advance. The coefficient matrix is expanded row by row. With the configuration of the matrix multiplication accumulation unit, at least 4 dot products can be calculated at once and directly accumulated. In application scenarios that require processing massive multimedia data such as video and images, it can significantly shorten the time consumption of IIR filtering operations compared with CMSIS and its existing IIR filtering calculation methods, reducing the time consumption by about 25% and providing a sufficient acceleration effect.
[0064] The various steps of the processing procedure described in conjunction with the processor, components, apparatus and device according to the various embodiments of this application may also be combined independently or in combination as embodiments of the configuration method, and will not be described in detail here.
[0065] Furthermore, although exemplary embodiments have been described herein, their scope includes any and all embodiments based on this application that have equivalent elements, modifications, omissions, combinations (e.g., schemes involving intersections of various embodiments), adaptations, or alterations. Elements in the claims will be interpreted broadly based on the language used in the claims and are not limited to the examples described in this specification or during the implementation of this application, which will be interpreted as non-exclusive. Therefore, this specification and examples are intended to be considered illustrative only, and the true scope and spirit are indicated by the full scope of the following claims and their equivalents.
[0066] The above description is intended to be illustrative and not restrictive. For example, the above examples (or one or more of them) can be used in combination with each other. Other embodiments may be used by those skilled in the art upon reading the above description. Furthermore, in the above detailed description, various features may be grouped together to simplify the application. This should not be construed as an intention that a feature of an unclaimed application is necessary for any claim. Rather, the subject matter of this application may be less than all the features of an embodiment of a particular application. Thus, the claims are incorporated herein by reference as examples or embodiments, wherein each claim is independently considered as a separate embodiment, and these embodiments are contemplated as being able to be combined with each other in various combinations or arrangements. The scope of the invention should be determined by reference to the appended claims and the full scope of their equivalents.
[0067] The above embodiments are merely exemplary embodiments of this application and are not intended to limit the present invention. The scope of protection of the present invention is defined by the claims. Those skilled in the art can make various modifications or equivalent substitutions to the present invention within the spirit and scope of this application, and such modifications or equivalent substitutions should also be considered to fall within the scope of protection of the present invention.
Claims
1. A processor for filtering processing of IIR filters, characterized in that, It includes a first configuration register, a second configuration register, at least one general-purpose register, and a matrix multiplication accumulation unit. The second configuration register is used to configure the output data reading method; The first configuration register is used to configure the data type of the arithmetic logic unit, including the matrix multiplication and accumulation unit, and to configure the input mode of the input data to the matrix multiplication and accumulation unit as either copy mode or broadcast mode. The at least one general-purpose register is configured to: sequentially read and store the coefficients of each row of the coefficient matrix of each order of the IIR filter; The matrix multiplication accumulation unit is configured to, for the same input vector: take the coefficients of each row as the current row coefficients, and when the input mode of the first configuration register is configured as broadcast mode, obtain a single input element in the input vector corresponding to the current row coefficient; When the input mode is configured as copy mode in the first configuration register, the stored current row coefficients are copied; the current row coefficients are multiplied in parallel with the corresponding input elements to obtain the corresponding product of the current row coefficients; The final output value is obtained by successively accumulating the product of the coefficients in each row, so that each complete multiplication and accumulation operation yields the output variable at least 4 sequential sampling times; For each complete multiplication and accumulation operation of the same order, the coefficient matrix is the same and is calculated using the following formula (1): y n = b0x n + b1x n-1 + b2x n-2 – a1y n-1 – a2y n-2 Official (1) Each complete multiplication-accumulation operation yields the output variable x at n sequential sampling times. n x n-1 and x n-2 Let y be the input variables at times n, n-1, and n-2, respectively. n y n-1 and y n-2 Let b0, b1, b2, a1, and a2 be the output variables at times n, n-1, and n-2, respectively, and b0, b1, b2, a1, and a2 be the filtering coefficients at that order. The coefficient matrix is calculated by the following steps: the values of y0 and y1 are obtained by setting the coefficients of the variables not appearing in formula (1) to 0, and then they are substituted into formula (1) for successive calculations to obtain the values from y0, y1, and x0 to x. n+1 From a total of n+4 input variables to the output variable y2 at n time points, to y n+1 The mapping matrix represented by the filter coefficients b0, b1, b2, a1, and a2 is used as the coefficient matrix.
2. The processor according to claim 1, characterized in that, The processor is an NPU coprocessor.
3. The processor according to claim 1, characterized in that, The at least one general-purpose register is 64 bits, and the data type includes Q15 format and Q7 format. When the data type of the filter coefficients of the IIR filter is Q15 format, each complete multiplication-accumulation operation yields the output variables at 4 sequential sampling times. When the data type of the filter coefficients of the IIR filter is Q7 format, each complete multiplication-accumulation operation yields the output variables at 8 sequential sampling times.
4. The processor according to claim 1, characterized in that, Different orders have independent and different coefficient matrices, and the coefficient matrix of each order is calculated based on the filter coefficients of that order.
5. The processor according to claim 1, characterized in that, For multi-order IIR filtering, the matrix multiplication accumulation unit is further configured such that, for the current order: based on the input variable x at the last two time points N-2 and N-1 of the previous order. N-2 x N-1 and the output variable y of the last two time steps of the previous order N-2 y N-1 And the output variable y0 to y0 obtained from the previous order calculation N-1 As the input variables of the current order x0 to x N-1 The multiplication and accumulation calculation is performed using the coefficient matrix of this order, so that each complete multiplication and accumulation operation yields the output variables of n sequential sampling times, where N is the total number of sampling times of the previous order.
6. The processor according to claim 1, characterized in that, The matrix multiplication accumulator unit includes at least four accumulators to perform at least four dot products in parallel.
7. An IIR filter assembly comprising the processor according to any one of claims 1-6.
8. A smart portable device with an IIR filter assembly, characterized in that, include: NPU, configured to process multimedia data including at least one of audio, video and images; as well as The processor according to any one of claims 1-6 serves as an NPU coprocessor.
9. A filtering method for an IIR filter, characterized in that, Includes the following steps: Determine the coefficient matrices for each order of the IIR filter; Read and store the coefficients of each row of the coefficient matrix of each order of the IIR filter in sequence; Using the matrix multiplication accumulation unit, for the same input vector, Using each row coefficient as the current row coefficient, obtain a single input element in the input vector corresponding to the current row coefficient, and multiply the current row coefficient with the corresponding input element in parallel to obtain the corresponding product of the current row coefficient; The final output value is obtained by successively accumulating the product of the coefficients in each row, so that each complete multiplication and accumulation operation yields the output variable at least 4 sequential sampling times; For each complete multiplication and accumulation operation of the same order, the coefficient matrix is the same and is calculated using the following formula (1): y n = b0x n + b1x n-1 + b2x n-2 – a1y n-1 – a2y n-2 Official (1) Each complete multiplication-accumulation operation yields the output variable x at n sequential sampling times. n x n-1 and x n-2 Let y be the input variables at times n, n-1, and n-2, respectively. n y n-1 and y n-2 Let b0, b1, b2, a1, and a2 be the output variables at times n, n-1, and n-2, respectively, and b0, b1, b2, a1, and a2 be the filtering coefficients at that order. The coefficient matrix is calculated by the following steps: the values of y0 and y1 are obtained by setting the coefficients of the variables not appearing in formula (1) to 0, and then they are substituted into formula (1) for successive calculations to obtain the values from y0, y1, and x0 to x. n+1 From a total of n+4 input variables to the output variable y2 at n time points, to y n+1 The mapping matrix represented by the filter coefficients b0, b1, b2, a1, and a2 is used as the coefficient matrix.