Space cache

By employing a non-linear cache design and utilizing row and column addressing units and multiplexers, the problem of low efficiency in accessing multidimensional objects in existing technologies is solved, enabling efficient and flexible access to two-dimensional data structures, which is suitable for image processing and analysis.

CN115698962BActive Publication Date: 2026-06-19IDEX ASA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
IDEX ASA
Filing Date
2021-06-11
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing semiconductor memories suffer from low access efficiency and slow speed when processing multidimensional objects, especially two-dimensional images or matrices, particularly when multiple points within a small distance from the center point need to be accessed. Current linear memory arrangements do not meet these requirements.

Method used

A high-speed cache that represents multidimensional objects in a non-linear manner supports parallel reading and writing of multiple rows and/or columns of the storage cell array through a combination of row addressing units and column addressing units. It utilizes multiplexers to achieve flexible data access and combines control and decoding circuits to convert virtual addresses into physical addresses.

Benefits of technology

It enables efficient access to multidimensional objects, improves reading flexibility and bandwidth, and is suitable for processing two-dimensional data structures such as images or matrices. In particular, it improves the efficiency and speed of data access in image processing and analysis.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115698962B_ABST
    Figure CN115698962B_ABST
Patent Text Reader

Abstract

A cache includes: a p x q array of memory cells; row-addressable cells; and column-addressable cells. Each memory cell has an m x n array of memory cells. The column-addressable cells have m n-to-1 multiplexers for each memory cell, with each of the m rows of the memory cell associated with one n-to-1 multiplexer, wherein each n-to-1 multiplexer has an input coupled to each of the n memory cells associated with the row associated with that multiplexer. The row-addressable cells have n m-to-1 multiplexers for each memory cell, with each of the n columns of the memory cell associated with one multiplexer, wherein each m-to-1 multiplexer has an input coupled to each of the m memory cells associated with the column associated with that multiplexer. The row-addressable and column-addressable cells support reads and / or writes to the memory cell array, for example, using virtual or physical addresses.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-references to related applications

[0002] This application claims priority to U.S. Patent Application No. 16 / 910,813, filed June 24, 2020, the entire U.S. Patent Application of which is incorporated herein by reference. Technical Field

[0003] Implementations related to dedicated type cache memory are disclosed. Background Technology

[0004] Semiconductor memories (including caches) are linearly arranged and addressed. When processing multidimensional objects (such as two-dimensional images or matrices), these objects are "flattened," for example, by connecting rows one after another. For some types of processing algorithms, it may be necessary to access specific portions of the multidimensional object, which do not fit well with this linear memory arrangement. For example, some processing may require accessing multiple points within a small distance of a center point, but because these points may be stored sequentially in locations far apart and irregularly spaced, current memory and cache access may be inefficient, slow, and require many different read operations to access the required data. Summary of the Invention

[0005] Therefore, there is a need for improved caches, such as improved caches that can increase read flexibility and bandwidth when processing two-dimensional data structures such as images or matrices. Implementations provide caches capable of representing portions of multidimensional objects (e.g., two-dimensional images or matrices) in a non-linear manner, thereby allowing, for example, efficient access to nearby pixels of an image.

[0006] According to a first aspect, a cache is provided. The cache includes a p (row) × q (column) array of memory cells; row-addressable cells; and column-addressable cells. Each memory cell has an m (row) × n (column) array of memory cells. The column-addressable cells have m n-to-1 multiplexers for each memory cell, with each row of the memory cell associated with one n-to-1 multiplexer, wherein each n-to-1 multiplexer has an input coupled to each of the n memory cells associated with the row associated with that multiplexer. The row-addressable cells have n m-to-1 multiplexers for each memory cell, with each column of the memory cell associated with one m-to-1 multiplexer, wherein each m-to-1 multiplexer has an input coupled to each of the m memory cells associated with the column associated with that multiplexer. Row-addressing units and column-addressing units support reading and / or writing of the storage cell array, enabling multiple rows and / or columns in the storage cell array to be read and / or written in parallel.

[0007] In some implementations, m = n = 4, and each memory cell includes one byte, such that each memory unit includes 16 bytes, and where p = q = 8, such that the memory cell array includes 1024 bytes. In some implementations, row addressing units and column addressing units support reading and / or writing multiple rows and / or columns of memory cells in one or more memory units in a single clock cycle. In some implementations, row addressing units are capable of addressing up to p*m rows of memory cells across one or more memory cell arrays and reading any cell in each of the p*m rows, wherein no two such cells are in the same row.

[0008] In some implementations, the column addressing unit is capable of addressing up to q*n columns of memory cells across one or more of the memory cell array and reading any cell in each of the q*n columns, wherein no two such cells are in the same column. In some implementations, for each memory cell not in the first row of the memory cell array, the row addressing unit also has a 2-to-1 multiplexer having inputs coupled to the outputs of an n-to-1 multiplexer associated with each column of the memory cell and the outputs of an n-to-1 multiplexer associated with a memory cell in a previous row; and for each memory cell not in the first column of the memory cell array, the column addressing unit also has a 2-to-1 multiplexer having inputs coupled to the outputs of an m-to-1 multiplexer associated with each row of the memory cell and the outputs of an m-to-1 multiplexer associated with a memory cell in a previous column.

[0009] In some implementations, row addressing units and column addressing units each support reading from storage cells in the storage cell array, and the row addressing units support writing to storage cells in the storage cell array. In some implementations, only row addressing units support writing to storage cells in the storage cell array, such that column addressing units do not support writing to storage cells in the storage cell array. In some implementations, storage cells in a p×q storage cell array represent the smallest entity that can be represented by a virtual address.

[0010] In some implementations, each storage cell in a p×q storage cell array is the minimum addressable data size in the cache and has only a physical address within the storage cell. In some implementations, row addressing units have separate addresses for each of the q*n columns, and column addressing units have separate addresses for each of the p*m rows, such that the row and column addressing units support simultaneous reading and / or writing of up to p*m storage cells from different rows and up to q*n storage cells from different columns within the storage cell array and the array of storage cells within each storage cell.

[0011] In some embodiments, the cache further includes a load / store unit and control and decoding circuitry. The load / store unit is capable of filling some or all of the memory cells with a remote memory representing a two-dimensional data structure. The control and decoding circuitry is capable of translating a virtual address representing a portion of the two-dimensional data structure represented by the remote memory into control signals for directing row and column addressing units to access specific memory cells. In some embodiments, the control and decoding circuitry maintains an operand region with a virtual origin, such that the virtual origin serves as a reference point for an address template that includes multiple virtual addresses for the remote memory. The control and decoding circuitry is also capable of decoding the address template to determine the multiple virtual addresses. In some embodiments, the control and decoding circuitry is also capable of manipulating the virtual origin and, when the virtual origin is manipulated, instructing the load / store unit to initialize and / or update the memory cells by reading data from the remote memory.

[0012] According to a second aspect, a method for accessing a cache according to any embodiment of the first aspect is provided. The method includes: initializing a first plurality of memory cells with a remote memory representing a two-dimensional data structure; and accessing one or more memory cells within the first plurality of memory cells via row-addressing units and / or column-addressing units using a virtual address, the virtual address indicating a portion of a two-dimensional data structure represented by the contents of the respective memory cell.

[0013] In some embodiments, the method further includes converting a virtual address indicating a portion of a two-dimensional data structure into a physical address indicating a corresponding storage cell. In some embodiments, the method further includes forming a read control signal and sending the read control signal to row addressing units and / or column addressing units to read the contents of the corresponding storage cell. In some embodiments, accessing one or more storage cells within a first plurality of storage units (where the virtual address indicates a portion of a two-dimensional data structure represented by the contents of the corresponding storage cell) via row addressing units and / or column addressing units using virtual addresses includes: decoding an address template having a plurality of virtual addresses; and forming an operand direction with the contents of the storage cell corresponding to each of the plurality of virtual addresses.

[0014] In some embodiments, the method further includes maintaining an operand region having a virtual origin, wherein the operand region includes storage cells representing a portion of a two-dimensional data structure. In some embodiments, the method further includes moving the virtual origin and the operand region associated with the virtual origin; and in response to moving the virtual origin and the operand region associated with the virtual origin, initializing a second plurality of storage cells in a remote memory representing a portion of the two-dimensional data structure, such that the second plurality of storage cells represent a portion of the two-dimensional data structure.

[0015] In some implementations, in response to moving the virtual origin and the operand region associated with the virtual origin, a second plurality of memory cells representing a two-dimensional data structure are initialized in the remote memory, such that the second plurality of memory cells represent a portion of the two-dimensional data structure, including one of the following: (1) in response to moving the virtual origin and the operand region associated with the virtual origin to the right, replacing the previous leftmost column of memory cells with the new rightmost column of memory cells, and reassigning the virtual address of the new column to the sum of the virtual address of the previous rightmost column and the width of a single memory cell; (2) in response to moving the virtual origin and the operand region associated with the virtual origin to the left, replacing the previous leftmost column of memory cells with the new leftmost column of memory cells. (3) In response to moving the virtual origin and the operand area associated with the virtual origin upwards, the virtual address of the new row is reassigned to the virtual address of the previous rightmost column minus the difference in the width of a single storage cell; (4) In response to moving the virtual origin and the operand area associated with the virtual origin downwards, the virtual address of the new row is reassigned to the virtual address of the previous topmost row plus the sum of the heights of a single storage cell; (5) In response to moving the virtual origin and the operand area associated with the virtual origin downwards, the virtual address of the new row is reassigned to the virtual address of the previous bottommost row minus the difference in the height of a single storage cell.

[0016] In some implementations, only a subset of the storage cell array is used to store data corresponding to the two-dimensional data structure as part of processing the two-dimensional data structure, and the remainder of the storage cell array is used for temporary register space. In some implementations, the two-dimensional data structure includes image data. In some implementations, the two-dimensional data structure includes a matrix. Attached Figure Description

[0017] The accompanying drawings, which are included herein and form part of the specification, illustrate various embodiments.

[0018] Figure 1 A cache according to an implementation method is shown.

[0019] Figure 2 A storage unit according to an embodiment is shown.

[0020] Figure 3 The image shown is analyzed by an image analysis algorithm.

[0021] Figure 4 The operand area according to the implementation method is shown.

[0022] Figures 5A-5B The physical and virtual addressing methods according to the implementation are shown.

[0023] Figures 6A-6D An address template according to an implementation method is shown.

[0024] Figures 7A-7H An address template according to an implementation method is shown.

[0025] Figure 8 This is a flowchart illustrating the process according to an implementation method.

[0026] Figure 9 This is a block diagram of the device according to the implementation method.

[0027] Figure 10A A linear array of memory is shown; Figure 10B A two-dimensional view of a linear array of memory is shown. Detailed Implementation

[0028] Figure 1 A cache 100 according to an embodiment is shown.

[0029] The cache 100 may include one or more storage units 102, one or more multiplexers 104, and one or more multiplexers 106.

[0030] As shown in the figure, the storage cells 102 are arranged in a p×q array (the storage cells 102 have p rows and q columns). In the illustrated embodiment, p = q = 8 = 2 3 Typically, other values ​​for p and q can be used, such as other powers of 2, or more generally, any other value. The values ​​of p and q can be the same or different from each other. An array can be a logical grouping of memory cells and does not necessarily indicate its physical implementation on, for example, silicon.

[0031] Multiplexers 104 and 106 can be arranged in various ways in cache 100. For example, as shown, there is a multiplexer 104 between each storage cell 102 in a given row of storage cells 102, and an additional multiplexer 104 at the end of the row of storage cells 102 (resulting in q multiplexers 104 for each row of storage cells 102); similarly, there is a multiplexer 106 between each storage cell 102 in a given column of storage cells 102, and an additional multiplexer 106 at the end of the column of storage cells 102 (resulting in p multiplexers 106 for each column of storage cells 102). In this configuration, each storage cell 102 can be viewed as being associated with a multiplexer 104 and a multiplexer 106, with the multiplexer 106 shown as being to the right of the storage cell 102 and the multiplexer 106 shown as being at the bottom of the storage cell 104.

[0032] Multiplexer 104 is used to address columns of memory cells 102 and memory units, and the set of multiplexers 104 may be referred to herein as column-addressable units. For clarity, column-addressable units refer to the structure of the set of multiplexers 104. In the illustrated embodiment, with respect to the illustrated memory cells 102 and memory units, column-addressable units read data in a left-to-right order.

[0033] Multiplexer 106 is used to address rows of memory cells 102 and memory units, and the set of multiplexers 106 may be referred to herein as row addressing units. For clarity, row addressing units refer to the structure of the set of multiplexers 106. In the illustrated embodiment, with respect to the illustrated memory cells 102 and memory units, row addressing units read data in a top-to-bottom order.

[0034] Figure 2 A storage unit 102 according to an embodiment is shown. Two multiplexers 104 and 106 associated with the storage unit 102 are also shown.

[0035] Each storage unit 102 may include one or more storage cells, labeled B0 to B15 in the figure. As shown, storage cells B0 to B15 are arranged in an m×n array (having m rows and n columns of storage cells). In the illustrated embodiment, m = n = 4 = 2 2 Typically, other values ​​for m and n can be used, such as other powers of 2, and an m×n array is usually at least as large as a 2×2 array. The values ​​of m and n can be the same or different from each other. The array can be a logical grouping of storage cells and does not necessarily indicate their physical implementation on, for example, silicon.

[0036] In some implementations, a storage cell can constitute one byte of memory. For the illustrated implementation, this means that storage cell 102 constitutes 16 bytes (= m*n*1 byte = 4*4*1 bytes), and cache 100 constitutes 1 kilobyte (= p*q*16 bytes = 8*8*16 bytes). Typically, the size of the storage cell can constitute any specific amount of memory suitable for a particular application, meaning that storage cell 102 and cache 100 can also constitute any specific amount of memory suitable for a particular application. Typically, for the desired purpose, the amount of memory in each of the storage cells, storage cell 102, and cache 100 will be a power of 2.

[0037] For discussion purposes, a column of storage unit 102 refers to m storage cells in a specific column out of n columns. As shown in the figure, each of the four columns of storage cells contains four storage cells. The first column of storage cells includes B15, B11, B7, and B3; the second column includes B14, B10, B6, and B2; the third column includes B13, B9, B5, and B1; and the fourth column includes B12, B8, B4, and B0. Similarly, a row of storage unit 102 refers to n storage cells in a specific row out of m rows of storage cells. As shown in the figure, each of the four rows of storage cells contains four storage cells. The first row of storage cells includes B15, B14, B13, and B12; the second row includes B11, B10, B9, and B8; the third row includes B7, B6, B5, and B4; and the fourth row includes B3, B2, B1, and B0.

[0038] Multiplexer 104 (shown to the right of memory cell 102) can be used to address columns of memory cell 102. As shown, multiplexer 104 includes multiple multiplexers 202. Specifically, multiplexer 102 may include m multiplexers 202, where each multiplexer 202 may be an n-to-1 multiplexer. Each of the multiplexers 202 may correspond to a memory cell in a specific row, and each may be connected to a memory cell in that row. For example, as shown, Figure 2 The topmost multiplexer 202 is associated with the first row of storage cells and connected to the inputs B15, B14, B13, and B12 corresponding to the storage cells in the first row. Similarly, Figure 2 The multiplexer 202 below the topmost multiplexer is associated with the second row of storage cells and connected to the inputs B11, B10, B9, and B8 corresponding to the second row of storage cells. Other multiplexers 202 are similarly shown as being associated with a row of storage cells and connected to storage cells in their respective rows as inputs. Each multiplexer 202 has a single output corresponding to the selection of one of its inputs.

[0039] Exemplary connections between memory cells and multiplexers 202 are shown with solid arrows. These solid arrows connect to dashed arrows leading to the corresponding memory cells. The outputs of multiplexers 202 are also indicated by arrows. The text on the output arrows indicates a specific portion of the memory output corresponding to the multiplexer 202. For example, as shown, there are four multiplexers 202, each selecting from a memory cell of one byte, meaning the combined output of the four multiplexers 202 is a 32-bit word (in this example). As shown, the topmost multiplexer 202 corresponds to bits [31:24] of the 32-bit word, the next multiplexer 202 corresponds to bits [23:16], the next corresponds to bits [15:8], and the last one corresponds to bits [7:0].

[0040] Multiplexer 106 (shown at the bottom of memory cell 102) can be used to address rows of memory cell 102. As shown, multiplexer 106 includes multiple multiplexers 204. Specifically, multiplexer 106 may include n multiplexers 204, where each multiplexer 204 may be an m-to-1 multiplexer. Each of the multiplexers 204 may correspond to a memory cell in a specific column, and each may be connected to a memory cell in that column. For example, as shown, Figure 2 The leftmost multiplexer 204 is associated with the first column of storage cells and connected to the inputs B15, B11, B7, and B3 corresponding to the storage cells of the first column of storage cells. Similarly, Figure 2 The leftmost multiplexer, to the right of the rightmost multiplexer 204, is associated with the second column of storage cells and connected to the inputs B14, B10, B6, and B2 corresponding to the second column of storage cells. Other multiplexers 204 are similarly shown as being associated with a column of storage cells and connected to the storage cells in their respective columns as inputs. Each multiplexer 202 has a single output corresponding to the selection of one of its inputs.

[0041] An exemplary connection between the memory cell and multiplexer 204 is shown by dashed arrows. The dashed arrows lead directly from the memory cell to the corresponding multiplexer 204. The output of multiplexer 204 is also shown by arrows. The text on the output arrows indicates a specific portion of the memory output corresponding to multiplexer 204. For example, as shown, there are four multiplexers 204, each selecting from a memory cell of one byte, meaning that the combined output of the four multiplexers 204 is a 32-bit word (in this example). As shown, the leftmost multiplexer 204 corresponds to bits [31:24] of the 32-bit word, the next multiplexer 204 corresponds to bits [23:16], the next corresponds to bits [15:8], and the last rightmost corresponds to bits [7:0].

[0042] In addition to multiplexers 202 and 204 that can select desired memory outputs (e.g., 32-bit words as shown) from memory cell 202, multiplexers 104 and 106 may also include additional multiplexers. For example, each multiplexer 104 (except for the multiplexer 104 associated with the leftmost column of memory cell 102) may include a 2-to-1 multiplexer for each row of memory cell 102, which passes the output of the memory cell 102 associated with the multiplexer 104 or the output of the multiplexer 104 associated with the column of memory cell 102 directly to the left of the multiplexer 104. Similarly, for example, each multiplexer 106 (except for the multiplexer 106 associated with the top row of memory cells 102) may include a 2-to-1 multiplexer for each column of memory cells 102, which passes the output of the multiplexer 106 associated with the memory cell 102 or the output of the multiplexer 106 associated with that row of memory cells 102 directly above the multiplexer 104.

[0043] As described above, the column-addressable units (i.e., the set of multiplexers 104) may include p*m multiplexers, each of which is an (n*q)-to-1 multiplexer, where the n-to-1 multiplexer is used for each memory cell 102 and the q-to-1 multiplexer is used to select the output of one column from the columns of memory cells 102. These p*m(n*q)-to-1 multiplexers can be implemented in a variety of functionally equivalent ways. For example, the q-to-1 portion can be distributed as q 2-to-1 multiplexers. Taking q=8 as an example, eight 2-to-1 multiplexers in the tree are equivalent to one 8-to-1 multiplexer. Generally, the multiplexers of the column-addressable units can be distributed in a modular manner so that they can be physically implemented as circuits. The specific implementation can also be further optimized, such as thereby improving the interconnection between memory cells and multiplexers.

[0044] Similarly, row-addressable units (i.e., the set of multiplexers 106) may include q*n multiplexers, each of which is an (m*p)-to-1 multiplexer, where p-to-1 multiplexers are used for each memory cell 102 and p-to-1 multiplexers are used to select the output of one row from the rows of memory cells 102. These q*n(m*p)-to-1 multiplexers can be implemented in a variety of functionally equivalent ways. For example, the p-to-1 portion can be distributed as p 2-to-1 multiplexers. Taking p=8 as an example, eight 2-to-1 multiplexers in the tree can be equivalent to one 8-to-1 multiplexer. Generally, the multiplexers of row-addressable units can be distributed in a modular manner so that they can be physically implemented as circuits. The specific implementation can also be further optimized, such as thereby improving the interconnection between memory cells and multiplexers.

[0045] Cache 100 supports flexible methods for reading and writing operations.

[0046] Regarding read operations, cache 100 can be considered to have two read ports, an "X" port and a "Y" port. The "X" port reads vertically based on row-addressed units (e.g., ...). Figure 1 and Figure 2 As shown), such as from top to bottom. The "Y" port reads horizontally based on column-addressed cells (e.g. Figure 1 and 2 As shown (e.g., from left to right), the channel selection signal can select which memory cell to read from to form the output.

[0047] Regarding read operations within a single storage unit 102 Figure 2 Port "X" is labeled "x_rd", and port "Y" is labeled "y_rd". As an example read operation on port "X", bytes B15, B14, B13, and B12 (corresponding to the first row of storage cells) can be read, where each multiplexer 204 is signaled to select the storage cell in the first row that corresponds to the column of the storage cell associated with that multiplexer 204 as its output. Similarly, bytes in storage cells of other rows can be read. Other read modes are also possible. For example, another read operation can read bytes B3, B6, B9, B12 (a ladder pattern), for example, by signaling multiplexer 204 to select the storage cell in the incrementing row that corresponds to the column of the storage cell associated with that multiplexer 204 as its output. Similarly, byte patterns such as B15, B10, B9, B4 or B7, B6, B8, and B9 can be read. However, in Figure 2 In the illustrated implementation, two bytes (e.g., bytes B3 and B7) in the same column of storage cells cannot be read by the "X" port because they must both be selected by the same multiplexer 204 with only one output. However, as can be clearly seen when describing the "Y" port, two bytes in the same column of storage cells can be read by the "Y" port. Generally, the "X" port can read bytes in any pattern, as long as the two bytes in the same column of storage cells are not read together.

[0048] As an example read operation on the "Y" port, bytes B15, B11, B7, and B3 (corresponding to the first column of storage cells) can be read, where each multiplexer 202 is signaled to select the storage cell in the first column of storage cells corresponding to the row of storage cells associated with multiplexer 202 as its output. Similarly, bytes in other columns of storage cells can be read. Other read modes are also possible. For example, another read operation can read bytes B3, B6, B9, and B12 (a ladder pattern), for example, by signaling multiplexer 202 to select the storage cell in the descending column of storage cells corresponding to the row of storage cells associated with multiplexer 202 as its output. Similarly, byte patterns such as B15, B10, B6, and B1, or B14, B10, B5, and B1 can be read. However, in Figure 2 In the illustrated implementation, two bytes (e.g., bytes B5 and B4) in the same row of memory cells cannot be read by the "Y" port because they must both be selected by the same multiplexer 202 with only one output. However, two such bytes in the same row of memory cells can be read through the "X" port. Typically, the "Y" port can read bytes of any pattern, as long as two bytes in the same row of memory cells are not read together.

[0049] The read operations of cache 100 are similar to those described previously for each memory cell 102. In a given read cycle (e.g., corresponding to a single clock cycle), each of the "X" and "Y" ports can be signaled to select up to m memory cells per column memory cell 102 (for the "X" port), or up to n memory cells per row memory cell 102 (for the "Y" port), up to a maximum of q*m memory cells (across the entire cache, for the "X" port) or up to a maximum of p*n memory cells (across the entire cache, for the "Y" port). The values ​​of p, q, m, and n shown are equal to up to 4 bytes and up to a maximum of 32 bytes read from each row or column memory cell 102. Some rows or columns of memory cells 102 may have no memory cells selected, while some rows or columns of memory cells 102 may have only some of their memory cells selected. Memory read through ports “X” and / or “Y” can be assembled (e.g., bytes read together) into vectors (e.g., operand vectors) so that processing elements can operate on them. Processing elements (e.g., Figure 9 The vector processor 902 shown can be designed to operate on data of a specific size (e.g., 128 bits), such as a single instruction multiple data (SIMD) processing element.

[0050] Preparing vectors for processing elements may also include additional multiplexing and alignment operations for the "X" and "Y" ports, thereby transferring only the relevant set of memory (e.g., only corresponding to the relevant pixels) from cache 100 to the vector of processing element operations. In some cases, it may be useful to select and read memory (e.g., corresponding to pixels) from the entire width of cache 100 (e.g., from any memory cell of any memory cell 102). In other cases, smaller regions of interest can be isolated, thereby improving efficiency (e.g., power efficiency) by focusing on smaller regions of interest. This implementation will be described below.

[0051] Multiple rows or columns of memory cells can be read in parallel. Address templates (described below) can be used to facilitate this reading. For example, reading memory cells in different patterns in this way may be particularly useful in certain applications (e.g., applications related to analyzing or processing images, including extracting image features). Linear algebra can also be another area of ​​application. For example, in the same read cycle (e.g., corresponding to a single clock cycle), an "X" read port can provide access to a row of data, while a "Y" read port can provide access to a column of data, which may be helpful for some algorithms. More generally, the embodiments disclosed herein can access other types of multidimensional data in a non-linear manner, and therefore algorithms that may require non-linear access to data can benefit from these embodiments.

[0052] Complex addressing schemes for read operations can lead to trade-offs in wiring complexity and read flexibility. In some applications, it may be necessary to implement flexible read operations only in one of the "X" or "Y" ports within a given clock cycle. However, in other applications, the flexibility to perform read operations in both "X" and "Y" ports within the same clock cycle may be beneficial and justify the trade-offs in complexity and additional power. For example, Figure 3Example images analyzed by image processing algorithms are shown. The numbers "1", "2", "3", etc., up to "8" in the boxes (representing sub-regions of the image, such as pixels) indicate the paths to be analyzed; the same number represents the same path. For an algorithm that accesses the rightmost path represented by "8" in a single read cycle, the "Y" port is needed because in the "X" port, four of the five sub-regions have read contention in the vertical direction (see the horizontal dashed arrow). If the "X" port were used, two read cycles would be required to read them (see the vertical dashed arrow). On the other hand, in other cases, the algorithm might find it more efficient to use the "X" port. In this case, it is beneficial to provide flexible reads from both the "X" and "Y" ports. In other cases, such as when read contention exists in both the "X" and "Y" ports, it would be useful to read from both the "X" and "Y" ports within the same clock cycle. This reduces the total number of clock cycles required to read memory cells of a given set.

[0053] Regarding write operations, cache 100 can support capabilities similar to read operations, allowing for nearly arbitrary write operations. However, in some implementations, writes can be implemented in a simpler way, for example, by allowing only writes to the "X" port for bytes in the same row of storage cells, or only writes to the "Y" port for bytes in the same column of storage cells, or allowing writes to either the "X" or "Y" ports, but only for storage cells in the same row or column, respectively. For example, a write can be the same operation as in a standard register file. In some applications, the usefulness of being able to perform different read modes does not necessarily extend to writes, and therefore cache 100 can be implemented more simply by having simpler write operations. For example, an image analysis algorithm may be able to analyze an image using specific access modes, but may not need to use those access modes to update the image.

[0054] The portion of storage unit 102 used for processing is referred to as a virtual canvas. As described herein, this can include all storage units 102 or a subset thereof. By extension (similar to that described for operand region 402), the virtual canvas can also refer to the contents of remote memory currently mirrored in those storage units 102.

[0055] Typically, the virtual canvas of a cache can be a read-centric resource and can rely on the tendency of applications to perform far more reads than writes from remote memory during processing (e.g., image filtering). For example, during image analysis, some applications may not perform any writes to image memory at all. Therefore, some cache implementations may rely primarily or exclusively on "write-around" behavior, rather than the "write-through" or "write-back" mechanisms employed by some other caches. "Write-through," "write-back," and "write-around" behaviors refer to signaling I / O completion upon write, specifically during remote memory updates ("write-around"), cache updates ("write-back"), or only after both have been updated ("write-through"). In these implementations, processing elements can perform "write-around" behavior, where infrequent remote memory updates completely bypass the cache and go directly to remote memory. Such behavior simplifies cache operation and naturally preserves the modified portions of remote memory in the virtual canvas, derived from conventional spatial filtering techniques. This allows implementations to leverage the unique needs of certain processing applications (e.g., image processing and analysis) to circumvent the performance trade-offs associated with maintaining cache coherence relative to remote image memory.

[0056] As described above, there exists a situation where a small region of interest can be isolated and focused on during a read or process operation. This smaller region of interest can be referred to as the operand region. In an implementation, the operand region may include an origin, and the region may have any particular shape or size; for example, the operand region may be circular or elliptical, and described by a radius or length and width.

[0057] Figure 4 Operand region 402 according to an embodiment is shown. For illustrative purposes, cache 100 is shown with some memory cells 102 removed. Operand region 402 is associated with virtual origin 404 and (partially or wholly) contains one or more memory cells 102. In general, operand region can be any particular shape containing one or more memory cells 102. As shown, operand region 402 is a circle centered approximately on virtual origin 404. Any operand within operand region 402 is reachable using address templates (as described herein).

[0058] Operand region 402 in Figure 4 The region shown is a region containing one or more memory cells 102. By extension, the remote memory region represented by the contents of one or more memory cells 102 (i.e., a portion of the remote memory mirrored in these memory cells 102) can also be regarded as operand region 402.

[0059] The range of operand region 402 can be determined by the design of the address template used. For example, the address template can use two's complement to reference the initial operand relative to the virtual origin 404. For n-bit two's complement, the first operand can be located at -2 relative to the virtual origin 404. n to +2 n-1 Any location within the range. Additional operands can be computed in the same way (i.e., relative to the virtual origin 404), in which case, Figure 4 The shaded operand region 402 shown represents operands reachable from the virtual origin 404 via the address template. Alternatively, additional operands can be calculated as offsets from other operands, such as offsets from the previous operand. Depending on the number of operands, the number of bits used to derive each operand, and how each operand is derived, the region (operand region 402) containing all operands reachable from the virtual origin 404 via the given address template can be irregularly shaped and can cover all or almost all memory cells 102 of the cache 100, or at least the memory cells 102 of the mirrored remote memory.

[0060] Restricting read operations to read only memory cells within operand region 402 can improve the efficiency of forming operand vectors for processing elements, such as improving power efficiency.

[0061] For discussion purposes, the following description uses image analysis algorithms as example applications. This discussion should be understood to generally apply to other applications that can utilize the cache 100 described herein. Furthermore, for discussion purposes, it will be assumed that the cache 100 has p = q = 8 and m = n = 4, where the size of the storage cell is one byte.

[0062] The storage units of cache 100 contain the contents of remote memory, such as pixel data of an image. For this discussion, remote memory and image memory will be used interchangeably, without limiting the implementation to image data. For this discussion, it will be assumed that cache 100 contains the image being analyzed. Typically, image data (e.g., 256 × 256 bytes = 65536 bytes = 64KB) will be much larger than the size of cache 100 (1KB in this example), so cache 100 can only store a portion of the image data at any given time, which is conceptually a two-dimensional window into the entire image. For this example, it is assumed that the maximum size of the image data is 64KB.

[0063] In the following discussion, the terms "virtual address" and "physical address" refer to different schemes for addressing the contents of cache 100. As used herein, a physical address refers to the address of cache 100 in relation to a single memory cell. In this example, this means that the physical address requires 10 bits, with 5 bits chosen for the "x address" and 5 bits chosen for the "y address," both of which are between 0 and 31. On the other hand, a virtual address refers to a portion of the image data mirrored in cache 100. In this example, this means the virtual address requires 16 bits, with eight selections for the "x address" and eight selections for the "y address," each between 0 and 255 (in this example, based on the maximum size of the image). In some implementations, the virtual address resolves only to the granularity of memory cell 102. For example, the six most significant bits of each of the x and y portions of the virtual address can be used to refer to a specific portion of the image data suitable for memory cell 102, and the two least significant bits of each of the x and y portions of the virtual address can be used to refer to a memory cell within that memory cell 102, and thus can correspond to the two least significant bits of the physical address representing the same memory cell. In cases where cache 100 is smaller than the image (i.e., the entire image cannot be contained in cache 100), there will be more virtual addresses than physical addresses. There may be a mapping between virtual and physical addresses; therefore, a virtual address indirectly refers to a single memory cell (as long as the virtual address representing the image memory is currently being mirrored in cache 100).

[0064] In some implementations, the virtual address of memory cell 102 corresponding to the earliest position in the image being read into cache 100 (e.g., the lower left memory cell 102 of the virtual canvas) must be aligned with an even number of 4-byte boundaries (row height and column width), but there are no other restrictions. Alignment with even numbers of 4-byte boundaries is advantageous in this implementation because this is the size of memory cell 102 (i.e., 4 bytes × 4 bytes in this example). Due to the granularity of virtual addressing resolved to a single memory cell 102 as described above, by maintaining even number of 4-byte boundary alignment, the “virtual column” or “virtual row” of memory cell 102 can be easily reallocated during refresh operations, simplifying the migration of cache 100 across different parts of remote memory. By maintaining this alignment, migrations on remote memory can always be performed in 4-byte increments in any given direction.

[0065] Because cache 100 is typically not large enough to simultaneously contain all the images being analyzed, it is advantageous to have a cache management strategy to refresh the contents of cache 100 in order to execute image analysis algorithms. This cache management strategy can take various forms. The main goal is to ensure that the image data required by the image analysis algorithm is mirrored into cache 100 in a timely manner. For example, in some algorithms, it is possible to predict with reasonable accuracy that image data from a specific region will be needed at a given time. For example, as regarding... Figure 3 The algorithm described herein can follow a path and can predict image data based on path information. In other algorithms, there may be some other directionality in the image data being processed. In other algorithms, additional information (e.g., about the image, the algorithm, or something else) can be used to predict what image data might be needed.

[0066] An example of a cache management strategy is to use a virtual origin 404 and refresh cache 100 when the virtual origin approaches the window boundary of the image mirrored in cache 100. For example, if the virtual origin 404 is close to the top of the image data mirrored in cache 100, it can be inferred that the bottom of the image data mirrored in cache 100 is unlikely to be needed and can be replaced (e.g., updated or refreshed) by image data located above the top currently mirrored in cache 100. In this way, the region of remote memory mirrored in cache 100 can be changed to predict the needs of the image analysis algorithm. The image analysis algorithm can move the virtual origin 404 based on its processing to manage the contents of cache 100, causing cache 100 to occasionally trigger update or refresh operations. Sometimes, instead of updating or refreshing in this way, cache 100 can be refreshed on demand (similar to a traditional central processing unit (CPU) cache), for example, when the image analysis algorithm references operands outside the virtual canvas. This may result in some performance loss because more data needs to be read into the cache, but it can also provide flexibility for image analysis algorithms to reference arbitrary parts of the image.

[0067] As the virtual origin 404 and the associated operand region 402 move, the storage cells in the cache 100 may need to be refreshed or updated with different portions of the image data. In fact, when the virtual origin 404 and the associated operand region 402 move, the image portion mirrored in the virtual canvas moves accordingly, for example, to keep the virtual origin 404 nearly centered within the virtual canvas. Some implementations may bias the shape or offset of the operand region 402 relative to the virtual origin 404, or may bias the cache refresh strategy to keep the virtual origin 404 within a specific portion of the virtual canvas to meet the needs of a particular application.

[0068] An example of image processing will now be described. Before processing, the loading / storage unit (e.g.) Figure 9 The load / store module 906 shown can fill some or all of the memory cells in cache 100. For example, the load / store unit can fill memory cells from image data stored in remote memory (e.g., static random access memory (SRAM)). Typically, images are stored linearly in SRAM, with one row of pixels stored sequentially after another. A series of reads (e.g., along rows of pixels) can be used to fill memory cells in cache 100. Once cache 100 is initialized with a portion of the image data, processing can proceed. As processing occurs, the processing element can move the virtual origin 404, and the associated operand region 402 can move along with the virtual origin 404. When the virtual origin 404 approaches an edge of the virtual canvas, such as the right or bottom edge, the load / store unit retrieves image data from the appropriate memory address (i.e., the address representing image data adjacent to the image data stored at the approaching edge) to fill memory cells in cache 100 with that data. The edge of the virtual canvas refers to the edge of the window of the image from remote memory currently mirrored in the virtual canvas.

[0069] As shown below, when the window moves around the image, the mirrored contents in the virtual canvas maintain the virtual row or column numbers of storage unit 102 in ascending order, but change the order of the physical row or column numbers in the process. For example, when approaching the right edge of the virtual canvas, a new "column" of memory cell 102 in cache 100 can be filled by effectively removing the column of memory cell 102 that is now furthest from the virtual origin 404. That is, the memory cell 102 of the leftmost physical column can be refilled with image data from the virtual address to the right of the rightmost physical column. Similarly, when approaching the bottom edge of the virtual canvas, for example, a new row of memory cells in cache 100 can be filled by effectively removing the row of memory cell 102 that is now furthest from the virtual origin 404. That is, the memory cell 102 of the topmost physical column can be refilled with image data from the virtual address below the bottommost physical column. This update is performed without having to reposition the contents of other memory cells 102 in cache 100. A mapping (e.g., a mapping between virtual column numbers and physical column numbers) is maintained to track which part (virtual address) of the image data is assigned to which memory cell 102 (physical address).

[0070] As described above, in some implementations, the virtual address resolves only to the granularity of memory cell 102, and the portion of the virtual address referencing a storage cell within memory cell 102 is equal to the physical address of that storage cell within memory cell 102. That is, the 16-bit address (in this example) can be viewed as an 8-bit row address (x address) and an 8-bit column address (y address). While the term "virtual address" can refer to the entire 8-bit row or column address, only the higher or most significant 6 bits (specifying one of the 64 rows or columns of memory cell 102 used for images in remote memory) are virtual, while the lower or least significant 2 bits (specifying one of the 4 rows or columns of storage cells within memory cell 102) are physical. In other words, the higher 6 bits undergo address translation to dynamically map which physical row or column of memory cell 102 in cache 100 corresponds to the virtual row or column of the window in remote memory. The lower or least significant 2 bits are not translated and are used to look up one of the four bytes exactly as specified. Other addressing or translation schemes are also possible.

[0071] For an image of size 256×256 (64KB), cache 100 (1KB in this example) can only contain a maximum of 1 / 64 of the image data. This means that the image portion in cache 100 is always a small window of the complete image content. While processing the image, the position of this small window may move, but the window size remains constant.

[0072] Since remote memory (e.g., SRAM) typically represents an image as a linear array of bytes, in the case of an image size of 256×256 pixels, the memory will store 256 connected rows, which can be addressed from a certain offset addr up to addr+65535. One implication of this arrangement is that at any given time, cache 100 can contain 32 segmented byte intervals from the linear array in remote memory, each segmented byte interval being separated by 256 bytes (the length of a row). For example, in the case where the lower left memory cell 102 is mapped to virtual row = 8 and virtual column = 2, the rows of memory cell 102 in cache 100 contain linear array entries addr+2112 to addr+2143, addr+2368 to addr+2399, addr+2624 to addr+2655, and so on, up to addr+3904 to addr+3935.

[0073] This can be as follows Figure 10A and 10B As shown. Figure 10A A remote memory employing a linear array 1002 of bytes is shown. For example, an image can be represented by a range from 0x0000 to 0xffff (i.e., 0 to 2). 16The address is linearly represented as -1). A portion 1004 of this memory is magnified together with multiple 32-byte stripes (or intervals) 1006. These 32-byte stripes (or intervals) are separated by 224 bytes (i.e., 256 bytes - 32 bytes), where 224 bytes is the difference between the width of the image (256 bytes in this example) and the number of storage cells in a row of cache 100 (i.e., q*n, which equals 32 bytes in this example). Each sequential 32-byte strip separated by the image width can be filled into cache 100, for example, starting from the lower left storage cell 102. The linear array of bytes 1002 can also be viewed as a two-dimensional structure, such as... Figure 10B As shown. The two-dimensional image 1010 can be implemented as a linear array of bytes 1002 on remote memory. The cache 100 can contain the contents of a portion 1012 of the image, which is a portion represented by sequential 32-byte stripes, each 32-byte stripe being separated by the width of the image.

[0074] In some implementations, all storage cells in cache 100 are available for processing by image processing algorithms. In other implementations, only a portion of storage units 102 in cache 100 is used for processing by image processing algorithms. As described above, the portion of storage units 102 used for processing is referred to as a virtual canvas. For example, for low-pass image filtering, only the upper half of the cache (the top 4 rows by 8 columns of storage units 102) may be needed, while for other applications, such as performing certain other image processing algorithms (e.g., feature extraction), only a subset of 6 rows by 6 columns of storage units 102 may be needed. This then leaves at least the bottom two rows and the leftmost two columns of storage units 102 as scratchpad space, for example, for working variables, with the remainder used as a virtual canvas (a portion of a mirrored remote memory portion of cache 100). In some implementations, the number of storage units 102 reserved for scratchpad space (if any) is considered only by updating or refreshing the virtual canvas when the load / store unit fills storage cells with image data and when it is determined when the operand region 402 approaches the edge of the cache. In other words, the number of automatically refreshed or updated storage units 102 is configured to meet the needs of a given application, which can help minimize irrelevant memory traffic.

[0075] When a processing element accesses local variables stored in the scratchpad space, it uses the physical address of storage unit 102 and treats cache 100 as a register file. Consistent access to the scratchpad space requires that refresh or update operations affecting the virtual canvas do not alter (e.g., overwrite or preempt) the storage unit 102 used as the scratchpad space.

[0076] When accessing image data in cache 100, the processing element can use a virtual address that reflects the portion of the image data mirrored in cache 100. The virtual address is updated when the virtual origin 404 approaches the left or right edge and causes a new row or column of memory cell 102 to be filled. When the virtual canvas stores different portions of remote memory, the update or refresh process can follow a horizontal line marking a band of row indices from largest to smallest, and a vertical line marking a band of column indices from largest to smallest (shown as bold lines in the example below).

[0077] The following sequence illustrates an example where the virtual canvas of cache 100 is a 6-row by 8-column storage unit 102, and the scratchpad space is a 2-row by 8-column storage unit 102. Cache 100 is initialized starting from the 8th row and 2nd column of the image data, which is entered into the lower left storage unit 102.

[0078] During initialization, the cache is configured as follows.

[0079]

[0080] The load / store unit has been filled with storage cells in the virtual canvas using appropriate image data. The virtual addresses of this image data are shown above. The top-left storage cell has virtual address "13,2" (indicating that the image region indicated by "13,2" or virtual row 13 and virtual column 2 is mirrored in the cache at this storage cell 102), and the bottom-right storage cell 102 has virtual address "8,9". During initialization, the horizontal and vertical wraparound lines (in bold) are located at the far right and top of cache 100, respectively. Note that the bold lines are conceptual boundaries used to aid in visualizing the reallocation of virtual rows or columns of storage cells during cache flushing. Tracking these boundaries is also helpful for processing elements to manage cache flushing strategies.

[0081] When the virtual canvas moves up one line of storage unit 102 (e.g., when the virtual origin 404 is near the top edge), the bottom line of the virtual canvas (i.e., the line "above" the bold horizontal wrapping line) is cleared and filled with the adjacent image portion indicated by the bold horizontal wrapping line. After the move, the horizontal wrapping line is updated as shown by moving it up (in this example, this would cause the line to "wrap" around the top of the virtual canvas and move to the bottom). Moving the virtual canvas means that the window of the image mirrored to the remote memory in the virtual canvas moves.

[0082]

[0083] When the virtual canvas moves one column of storage cells 102 to the right (e.g., when the virtual origin 404 is near the right edge), the leftmost column of storage cells 102 (indicated by the bold vertical wrapping line) is cleared and filled with the image portion adjacent to that indicated by the bold vertical wrapping line. After the move, the vertical wrapping line is updated as shown by moving it to the right (in this example, this would cause the line to "wrap" around the right side of the virtual canvas and move to the left side).

[0084]

[0085] When the virtual canvas moves up one row of storage cells 102 (e.g., when the virtual origin 404 is near the top edge), the second-to-last row of storage cells 102 of the virtual canvas (indicated by the bold horizontal wrapping line) is cleared and filled with the image portion adjacent to that indicated by the bold horizontal wrapping line. After the movement, the horizontal wrapping line is updated as shown by moving upwards.

[0086]

[0087] When the virtual canvas moves up one row of storage cells 102 (e.g., when the virtual origin 404 is near the top edge), the third-to-last row of storage cells 102 of the virtual canvas (indicated by the bold horizontal wrapping line) is cleared and filled with the image portion adjacent to that indicated by the bold horizontal wrapping line. After the movement, the horizontal wrapping line is updated as shown by moving upwards.

[0088]

[0089] When the virtual canvas moves one column of storage cells 102 to the right (for example, when the virtual origin 404 is near the right edge), the storage cells 102 in the second row from the left of the virtual canvas (indicated by the bold vertical wrapping line) are cleared and filled with the image portion adjacent to that indicated by the bold vertical wrapping line. After the movement, the vertical wrapping line is updated as shown by moving to the right.

[0090]

[0091] At this point, the virtual canvas has moved up 3 rows of storage cells 102 and to the right 2 columns of storage cells 102, which means that all storage cells in the cache 100 (except those in the shaded area shown below, which are depicted by bold wrapping lines) have been updated.

[0092]

[0093] In this example, the direction of the update or refresh process follows the virtual origin 404 of the operand region 402, and the direction can be reversed at any time depending on the movement of the virtual origin 404. In some cases, the image processing algorithm may need to access substantially different parts of the image, and may need to completely reinitialize the cache 100, rather than just updating a small number of rows or columns of storage units 102.

[0094] While the physical address of a given byte in cache 100 (which can be represented as 10 bits in this example) is easier to manipulate in some respects than the virtual address (which can be represented as 16 bits in this example), it is helpful to operate on virtual addresses in most applications. This can be achieved by comparing... Figure 5A and 5B To illustrate. Figure 5A The image analysis algorithm is shown with pixel fields labeled (numbered "1" to "8"). Figure 5A The image in the cache represents a virtual address view of the image data in cache 100, regardless of how the virtual origin 404 moves. On the other hand, after being refreshed twice due to the rightmost column of memory cell 102 being replaced, the physical address view of the image data in cache 100 can be similar to... Figure 5B In other words, virtual addresses display image data using their original spatial arrangement, while physical addresses can be shifted or segmented.

[0095] Although in this example the virtual address is 3 bits wider than the physical address (8 bits and 5 bits in x and y addressing, respectively), virtual address operations are guaranteed to remain within the boundaries of the virtual canvas, and a simple table lookup association is used at the end to map the virtual address to the physical address.

[0096] As described herein, cache 100 allows parallel access to ports “X” and “Y”, allowing reading of memory cells in close proximity with arbitrary patterns. In some implementations, signaling the pattern of the memory cell to be accessed using an address template can be useful for image processing algorithms. An address template is a compact representation of multiple memory cells to be read from. In the current example, any location from one to eight memory cells (bytes) can be signaled within a given address template. Control and decoding circuitry (e.g.) Figure 9 The control and decoding circuitry 903 shown can process the address template to perform a read operation on cache 100 and fill a data vector with the contents of the storage cells, allowing the processing element to operate on the vector. The control and decoding circuitry can achieve this, for example, by sending an appropriate read control signal to cache 100 (using row and / or column addressing units) after decoding the address template. The control and decoding circuitry can further fill the data vector from the result of this read, for example, by concatenating different bytes together.

[0097] Address templates can take many different forms. In some implementations, address templates can be described as follows, where two basic types exist: pseudo-linear type or linear type:

[0098]

[0099] For each of the pseudo-linear and linear type templates, the first three bits of the structure ([3:0]) indicate the type of the template. For pseudo-linear types, the type is 0, and for linear types, the type is 1. As shown, three bits indicate the type to allow for the flexible addition of more types; in the case of only two types as shown, a single bit is sufficient to indicate the type. Similarly, for each of the pseudo-linear and linear type templates, the next bit [4] specifies the default read port ("X" port = 0 or "Y" port = 1), and the following ten bits specify the signed offset from the origin 404 to the first byte to be read ([9:5] for the X offset, [14:10] for the Y offset). After this, the format of the two types of templates differs.

[0100] For pseudo-linear type templates, the offsets for specifying the x and y offsets are provided as three-bit fields ([17:15], [20:18], [23:21], [26:24], [29:27], [32:30], and [35:33]). In implementations, the template can include any number of bytes to be read, such as any position from 1 byte to 8 bytes. For a given operation, the vector length can be specified by the processing element. In some implementations, the importance of the three bits used for the offset can be described as follows:

[0101] Δx,Δy[2:0] describe 000 Δx = 1, Δy = 0 001 Δx = 0, Δy = 1 010 Δx = 1, Δy = 1 011 Δx = 2, Δy = 0 100 Δx = 0, Δy = 2 101 Δx = 2, Δy = 1 110 Δx = 1, Δy = 2 111 Δx = 2, Δy = 2

[0102] Pseudo-linear type templates form approximately linear patterns (e.g., angles or arcs) where the x and y offsets are not reversed.

[0103] For linear type templates, the offset of the specified x-offset is provided as a two-bit field ([16:15], [18:17], [20:19], [22:21], [24:23], [26:25], and [28:27]). In implementations, the template can include any number of bytes to be read, such as any position from 1 byte to 8 bytes. For a given operation, the vector length can be specified by the processing element. Linear type templates form either horizontal or vertical lines, with bytes compressed or spaced apart. A pseudo-linear type template can be used to specify a linear type template, indicating only the offset in one of the x or y offsets (the other offset being 0). The advantage of linear type templates is that they can be specified more compactly.

[0104] In addition to the templates mentioned above, reflection control structures can also be used. For example, an application can set up reflection control once to apply to a sequence of read operations using an address template. Reflection control can be configured as follows:

[0105] Fields describe [0] Polarity of the first Δx [1] The polarity of the first Δy [2] The polarity of the remaining Δx [3] The polarity of the remaining Δy [4] Swap x and y

[0106] As described above, when bit [4] of the address template is 0, the default read port is port "X", and when bit [4] of the address template is 1, the default read port is port "Y". This behavior can be changed using reflection control; for example, if bit [4] of the reflection control is 1, the default read port of any given template is swapped, and if bit [4] of the reflection control is 0, the default read port of any given template retains its normal behavior.

[0107] Reflection control can be implemented as a programmable register. When the control and decoding circuitry operates on the address template to perform a read operation, reflection control can indicate the polarity (sign) of Δx and Δy of the first byte of the read operation, as well as the polarity (sign) of Δx and Δy of the remaining bytes of the read operation. Reflection control can also indicate that the “X” and “Y” ports are swapped (bits [4]). This can have the effect of, for example, rotating the read pattern by 90°.

[0108] Figures 6A-6D An example of an address template used to facilitate certain read patterns is shown. For example, Figure 6A An example of a pseudo-linear type template is shown. This is a pattern that can be used in certain parts of an image analysis algorithm. In this example, the initial storage cell (byte) to be read is given from the reference origin 404. This can be reflected as an offset of (x, y) (7, 3). The next storage cell (byte) to be read is given as an offset of (x, y) from the first byte, here the offset is (0, 2). Similarly, each consecutive storage cell (byte) to be read is given as an offset of (x, y) from the previous byte, here the offsets are (1, 1), (1, 1), and (1, 2). Similarly, Figure 6B An example of a linear type template is shown. In this example, the initial storage cell (byte) to be read is given from the reference origin 404. This can be reflected as an offset of (x, y) (-7, 5). The subsequent storage cells (bytes) to be read are given as the x offset of the previous byte read, which is 2, 2, 2, 2, 2, 2, 2.

[0109] Similarly, Figure 6C and 6D An example of a linear type template is shown. Figure 6CIn the original graph, the initial storage cell (byte) to be read is given as an offset of (x, y) (1, 0) from the origin 404, and each subsequent storage cell (byte) to be read is assigned an x ​​offset of 1, 1, 1, 1, 1, 1, 1. Figure 6D In the original 404, the initial storage cell (byte) to be read is given as an offset of (x, y) (0, -1), and each subsequent storage cell (byte) to be read is assigned a y offset of -1, -1, -1, -1, -1, -1, -1, and -1. Figure 6D Templates can also be controlled from using reflection. Figure 6C Exporting templates.

[0110] Figures 7A-7H An example of an address template used to facilitate certain read patterns is shown. Specifically, Figures 7A-7H Each template shows the same pseudo-linear type, but differs in the sign inversion values ​​for x and y offsets controlled by reflection. For example, the polarity of Δx or Δy inverted to the first byte is reflected about the origin. The polarity inversion of Δx or Δy to subsequent bytes is reflected about the first byte. By inverting the definitions of x and y, this same template produces more than eight configurations, which are related to... Figures 7A-7H The image shown is identical, but rotated by 90 degrees.

[0111] As can be seen, by using reflection control, the amount of templates that can be reused is significantly reduced, and the amount of templates that the processing element must store in local memory (e.g., SRAM) is reduced.

[0112] Figure 8 A flowchart according to an embodiment is shown. Process 800 is a method for accessing a cache according to any of the embodiments disclosed herein. The method may begin at step s802.

[0113] Step s802 includes initializing a first plurality of memory cells with a remote memory representing a two-dimensional data structure.

[0114] Step s804 includes accessing one or more storage cells within a first plurality of storage cells via row and / or column addressing units with a virtual address indicating a portion of a two-dimensional data structure represented by the contents of the respective storage cell.

[0115] In some embodiments, the method further includes translating a virtual address indicating a portion of a two-dimensional data structure into a physical address indicating a corresponding storage cell (step s806). In some embodiments, the method further includes generating a read control signal and sending the read control signal to row and / or column addressing units to read the contents of the corresponding storage cell (step s808).

[0116] In some implementations, accessing one or more storage cells within a first plurality of storage cells via row and / or column addressing units with virtual addresses (the virtual addresses indicating a portion of a two-dimensional data structure represented by the contents of the respective storage cells) includes: decoding an address template having a plurality of virtual addresses; and forming an operand vector with the contents of the storage cells corresponding to each of the plurality of virtual addresses.

[0117] In some embodiments, the method further includes maintaining an operand region having a virtual origin, wherein the operand region includes storage units representing a portion of a two-dimensional data structure (step s810). In some embodiments, the method further includes moving the virtual origin and the operand region associated with the virtual origin; and initializing a second plurality of storage units with remote memory representing the two-dimensional data structure such that, in response to moving the virtual origin and the operand region associated with the virtual origin, the second plurality of storage units represent a portion of the two-dimensional data structure (step s812).

[0118] In some implementations, a second plurality of memory cells are initialized with a remote memory representing a two-dimensional data structure, such that, in response to moving the virtual origin and the operand region associated with the virtual origin, the second plurality of memory cells representing a portion of the two-dimensional data structure includes one of the following: (1) in response to moving the virtual origin and the operand region associated with the virtual origin to the right, replacing the previous leftmost column memory cell with a new rightmost column memory cell, and reallocating the virtual address of the new column to the sum of the virtual address of the previous rightmost column and the width of a single memory cell; (2) in response to moving the virtual origin and the operand region associated with the virtual origin to the left, replacing the previous rightmost column memory cell with a new leftmost column memory cell. (2) In response to moving the virtual origin and the operand region associated with the virtual origin upwards, the previous bottom row storage unit is replaced with the new top row storage unit, and the virtual address of the new row is reassigned as the virtual address of the previous top row plus the sum of the heights of the individual storage units; and (3) In response to moving the virtual origin and the operand region associated with the virtual origin downwards, the previous top row storage unit is replaced with the new bottom row storage unit, and the virtual address of the new row is reassigned as the virtual address of the previous bottom row minus the height difference of the individual storage units.

[0119] In some implementations, only a subset of the storage cell array is used to store data corresponding to the two-dimensional data structure, as part of processing the two-dimensional data structure, while the remainder of the storage cell array is used for temporary register space. In some implementations, the two-dimensional data structure includes image data. In some implementations, the two-dimensional data structure includes a matrix.

[0120] In some implementations, cache 100 may be implemented in a larger system, such as in device 900. Cache 100 and / or device 900 may be part of, or configured to operate with, one or more of a general-purpose computer, CPU, graphics processing unit (GPU), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or any other type of computer hardware component. The term "cache" may be used to refer only to cache 100, or, more broadly, to device 900 including cache 100, depending on the context in which the term is used.

[0121] Figure 9 This is a block diagram of device 900 according to some implementation methods. For example... Figure 9 As shown, device 900 may include: cache 100, vector processor 902, network interface 904, load / store unit 906, remote memory 908, and image capture interface 910. Vector processor 902 may communicate with cache 100, for example, by reading or writing data to cache 100. Vector processor 902 or a similar parallel processing entity may send a read instruction to cache 100 using an address template, and cache 100 may then send a result (e.g., one or more operands) to vector processor 902 in response. For example, vector processor 902 may include control and decoding circuitry 903 (shown as part of vector processor 902, but may also be separate from vector processor 902). Control and decoding circuitry 903 processes the address template by decoding it and forming appropriate read control signals to send to cache 100, thereby forming an operand vector having each operand specified in the address template and providing it to vector processor 902. As shown, the address template may be an input to control and decoding circuitry (e.g., received from vector processor 902). Vector processor 902 can also read from cache 100 using physical addresses, for example, when accessing data from a scratchpad space separate from the virtual canvas in cache 100. In addition to reading, vector processor 902 can also write to cache 100, for example, by writing to scratchpad space or the virtual canvas. Such writing may include intermediate data or may include writing the result of performing vector operations or other processing on one or more operands read from cache 100. Although not shown, vector processor 902 may also be coupled to other components of device 900, including other types of caches (e.g., L1 or L2 caches), register files, buses, or peripheral devices. Vector processor 902 can communicate with other components or systems (including other components of device 900 or components not part of device 900) via network interface 904.

[0122] Load / store unit 906 is coupled to cache 100 and can be used to fill or populate the contents of cache 100. For example, load / store module 906 can access remote memory 908 (such as image memory) to fill or populate the contents of cache 100. Remote memory 908 can be any type of memory and can be coupled to other components, such as image capture interface 910, which can capture images and digitally store them into remote memory 908. In the context of a vector processor, load / store unit 906 is sometimes referred to as a load / store vector. Load / store unit 906 is responsible for executing load and store instructions.

[0123] Brief description of various implementation methods

[0124] A1. A cache, comprising:

[0125] p(rows) × q(columns) storage cell array;

[0126] Row addressing unit; and

[0127] Column addressing units;

[0128] Each storage unit has an m (rows) × n (columns) array of storage cells;

[0129] The column addressing unit has m n-to-1 multiplexers for each memory cell, and an n-to-1 multiplexer is associated with each of the m rows of the memory cell, wherein each n-to-1 multiplexer has an input coupled to each of the n memory cells associated with the row associated with the multiplexer.

[0130] The row-addressing unit has n m-to-1 multiplexers for each memory cell, and each multiplexer is associated with each of the n columns of the memory cell. Each m-to-1 multiplexer has an input coupled to each of the m memory cells associated with the column associated with that multiplexer.

[0131] The row addressing unit and column addressing unit support reading and / or writing of the storage cell array, enabling multiple rows and / or columns of the storage cell array to be read and / or written in parallel.

[0132] A2. The cache according to embodiment A1, wherein m = n = 4, and each storage cell includes one byte, such that each storage unit includes 16 bytes, and wherein p = q = 8, such that the storage unit array includes 1024 bytes.

[0133] A3. The cache according to any one of embodiments A1-A2, wherein the row addressing unit and column addressing unit support reading and / or writing of multiple rows and / or columns of storage cells of one or more storage units in a single clock cycle.

[0134] A4. The cache according to any one of embodiments A1-A3, wherein the row addressing unit is capable of addressing up to p*m rows of storage cells across one or more storage cell arrays and reading any cell in each of the p*m rows, wherein no two such cells are in the same row.

[0135] A5. The cache according to any one of embodiments A1-A4, wherein the column addressing unit is capable of addressing up to q*n columns of storage cells across one or more of the storage cell array and reading any cell in each of the q*n columns, wherein no two such cells are in the same column.

[0136] A6. A cache according to any one of embodiments A1-A5, wherein:

[0137] For each memory cell not in the first row of the memory cell array, the row addressing unit also has a 2-to-1 multiplexer having inputs coupled to the output of an n-to-1 multiplexer associated with each column of the memory cell and the output of an n-to-1 multiplexer associated with the memory cell in the previous row, and

[0138] For each memory cell not in the first column of the memory cell array, the column addressing unit also has a two-to-one multiplexer with an output coupled to the output of an m-to-1 multiplexer associated with each row of the memory cell and an input to the output of an m-to-1 multiplexer associated with the memory cell in the previous column.

[0139] A7. The cache according to any one of embodiments A1-A6, wherein the row addressing unit and the column addressing unit each support reading the storage cells of the storage cell array, and wherein the row addressing unit supports writing to the storage cells of the storage cell array.

[0140] A8. The cache according to embodiment A7, wherein only the row addressing unit supports writing to the storage cells of the storage cell array, such that the column addressing unit does not support writing to the storage cells of the storage cell array.

[0141] A9. The cache according to any one of embodiments A1-A8, wherein the storage cell in the p×q storage cell array represents the smallest entity that can be represented by a virtual address.

[0142] A10. A cache according to any one of embodiments A1-A9, such that for each memory cell in the p×q memory cell array, each memory cell is the minimum addressable data amount in the cache and has only the physical address within the memory cell.

[0143] A11. The cache according to any one of embodiments A1-A10, wherein the row addressing unit has a separate address for each column in q*n columns, and the column addressing unit has a separate address for each row in p*m rows, such that the row and column addressing units support simultaneous reading and / or writing of up to p*m storage cells from different rows and up to q*n storage cells from different columns within the storage cell array and the storage cell array within each storage cell.

[0144] A12. The cache according to any one of embodiments A1-A11 further includes: a load / store unit capable of filling some or all of the storage cells of a remote memory representing a two-dimensional data structure; and a control and decoding circuit capable of converting a virtual address representing a portion of the two-dimensional data structure represented by the remote memory into control signals for guiding row and column addressing units to access specific storage cells.

[0145] A13. The cache according to embodiment A12, wherein the control and decoding circuitry maintains an operand region having a virtual origin, such that the virtual origin is used as a reference point for an address template, the address template including a plurality of virtual addresses for the remote memory, and wherein the control and decoding circuitry is also capable of decoding the address template to determine the plurality of virtual addresses.

[0146] A14. The cache according to embodiment A13, wherein the control and decoding circuitry is also capable of manipulating the virtual origin and instructing the load / store unit to initialize and / or update the storage cell by reading data from the remote memory when manipulating the virtual origin.

[0147] B1. A method for accessing a cache according to any one of embodiments A1-A14, the method comprising:

[0148] Initialize the first plurality of memory units using a remote memory representing a two-dimensional data structure; and

[0149] One or more storage cells within the first plurality of storage cells are accessed via the row and / or column addressing units using virtual addresses, the virtual addresses indicating a portion of the two-dimensional data structure represented by the contents of the respective storage cell.

[0150] B2. The method according to embodiment B1 further includes converting the virtual address indicating a portion of the two-dimensional data structure into a physical address indicating the corresponding storage cell.

[0151] B3. The method according to embodiment B2 further includes forming a read control signal and sending the read control signal to the row addressing unit and / or column addressing unit to read the contents of the corresponding storage cell.

[0152] B4. The method according to any one of embodiments B1-B3, wherein accessing one or more storage cells within the first plurality of storage cells via row and / or column addressing units with a virtual address comprises the step of wherein the virtual address indicates a portion of a two-dimensional data structure represented by the contents of the respective storage cell:

[0153] Decode an address template with multiple virtual addresses; and

[0154] An operand vector is formed using the contents of each corresponding storage cell in the plurality of virtual addresses.

[0155] B5. The method according to any one of embodiments B1-B4 further includes:

[0156] Maintain an operand region with a virtual origin, wherein the operand region contains storage units representing a portion of the two-dimensional data structure.

[0157] B6. The method according to embodiment B5 further includes:

[0158] The virtual origin and the operand region associated with it; and

[0159] The second plurality of memory cells are initialized with a remote memory representing a two-dimensional data structure, such that in response to moving the virtual origin and the operand region associated with the virtual origin, the second plurality of memory cells represent a portion of the two-dimensional data structure.

[0160] B7. The method according to any one of embodiments B6, wherein a second plurality of memory cells are initialized with a remote memory representing a two-dimensional data structure, such that, in response to moving a virtual origin and an operand region associated with the virtual origin, the second plurality of memory cells represent a portion of the two-dimensional data structure, including one of the following:

[0161] (1) In response to moving the virtual origin to the right and the operand region associated with the virtual origin, replace the previous leftmost column's storage cell with the new rightmost column's storage cell, and reallocate the virtual address of the new column to the sum of the previous rightmost column's virtual address and the width of a single storage cell.

[0162] (2) In response to moving the virtual origin to the left and the operand region associated with the virtual origin, replace the previous rightmost column's storage cell with the new leftmost column's storage cell, and reallocate the virtual address of the new column to the previous rightmost column's virtual address minus the difference in the width of a single storage cell.

[0163] (3) In response to moving the virtual origin and the operand region associated with the virtual origin upwards, the previous bottom row's storage cell is replaced with the new top row's storage cell, and the virtual address of the new row is reallocated to the sum of the previous top row's virtual address and the height of a single storage cell; and

[0164] (4) In response to moving the virtual origin and the operand region associated with the virtual origin downwards, the storage cell of the previous top row is replaced with the storage cell of the new bottom row, and the virtual address of the new row is reallocated to the virtual address of the previous bottom row minus the height difference of a single storage cell.

[0165] B8. The method according to any one of embodiments B1-B7, wherein only a subset of the storage cell array is used to store data corresponding to the two-dimensional data structure as part of processing the two-dimensional data structure, and the remainder of the storage cell array is used for temporary storage space.

[0166] B9. The method according to any one of embodiments B1-B8, wherein the two-dimensional data structure includes image data.

[0167] B10. The method according to any one of embodiments B1-B8, wherein the two-dimensional data structure includes a matrix.

[0168] C1. A computer program comprising instructions that, when executed by a processing circuit system, cause the processing circuit system to perform the method according to any one of embodiments B1-B10.

[0169] C2. A carrier comprising the computer program described in embodiment C1, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer-readable storage medium.

[0170] D1. A device comprising a cache according to any one of embodiments A1-A14, wherein the device is one of a general-purpose computer, a central processing unit (CPU), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), and a field-programmable gate array (FPGA).

[0171] While various embodiments of this disclosure have been described herein, it should be understood that these embodiments are presented by way of example only and not as a limitation. Therefore, the breadth and scope of this disclosure should not be limited to any of the exemplary embodiments described above. Furthermore, any combination of the foregoing elements in all possible variations is included within this disclosure unless otherwise indicated herein or explicitly stated otherwise by context.

[0172] Furthermore, although the process described above and illustrated in the figures is presented as a series of steps, this is merely for illustrative purposes. Therefore, it is contemplated that steps can be added, steps can be omitted, the order of steps can be rearranged, and some steps can be performed in parallel.

Claims

1. A cache (100), comprising: p(row) × q(column) storage cell (102) array; Row addressing unit; as well as Column addressing units; Each storage cell (102) has an m (row) × n (column) array of storage cells (B0 to B15); The column addressing unit has m n-to-1 multiplexers (104) for each memory cell (102), and each of the m rows of the memory cell (102) is associated with an n-to-1 multiplexer (104), wherein each n-to-1 multiplexer (104) has an input coupled to each of the n memory cells (B0 to B15) associated with the row associated with the multiplexer (104); The row-addressing unit has n m-to-1 multiplexers (106) for each memory cell (102), and each of the n columns of the memory cell (102) is associated with one m-to-1 multiplexer (106). Each m-to-1 multiplexer (106) has an input coupled to each of the m memory cells (B0 to B15) associated with the column associated with that multiplexer (106), and The row addressing unit and column addressing unit support reading and / or writing of the storage cell (102) array, enabling multiple rows and / or columns of the storage cell (102) array to be read and / or written in parallel.

2. The cache according to claim 1, wherein m = n = 4, and each storage cell (B0 to B15) includes one byte, such that each storage unit (102) includes 16 bytes, and wherein p = q = 8, such that the array of storage units (102) includes 1024 bytes.

3. The cache according to claim 1, wherein the row addressing unit and the column addressing unit support reading and / or writing of multiple rows and / or columns of storage cells (B0 to B15) of one or more storage units (102) in a single clock cycle.

4. The cache according to claim 1, wherein, The row addressing unit is capable of addressing up to p*m rows of storage cells (B0 to B15) across one or more of the storage cell (102) array and reading any cell in each of the p*m rows, wherein no two such cells are in the same row.

5. The cache according to claim 1, wherein, The column addressing unit is capable of addressing up to q*n columns of storage cells (B0 to B15) across one or more of the storage cell (102) array and reading any cell in each of the q*n columns, wherein no two such cells are in the same column.

6. The cache according to any one of claims 1-5, wherein: For each memory cell (102) not in the first row of the array of memory cells (102), the row addressing unit also has a two-to-one multiplexer having an input coupled to the output of an n-to-one multiplexer (104) associated with each column of the memory cells (102) and the output of an n-to-one multiplexer (104) associated with the memory cells (102) in the previous row, and For each memory cell (102) not in the first column of the array of memory cells (102), the column addressing unit also has a two-to-one multiplexer having an output coupled to the output of an m-to-1 multiplexer (106) associated with each row of memory cells (106) and an input to the output of an m-to-1 multiplexer (106) associated with the memory cells (102) in the previous column.

7. The cache according to any one of claims 1-5, wherein, The row addressing unit and the column addressing unit each support reading the storage cells (B0 to B15) of the storage cell (102) array, and wherein the row addressing unit supports writing to the storage cells (B0 to B15) of the storage cell (102) array.

8. The cache according to claim 7, wherein only the row addressing unit supports writing to the storage cells (B0 to B15) of the storage cell (102) array, such that the column addressing unit does not support writing to the storage cells (B0 to B15) of the storage cell (102) array.

9. The cache according to any one of claims 1-5, wherein the storage cell (102) in the p×q storage cell (102) array represents the smallest entity that can be represented by a virtual address.

10. The cache according to any one of claims 1-5, such that for each storage cell in the p×q storage cell (102) array, each storage cell (B0 to B15) within the storage cell (102) is the minimum addressable data amount in the cache (100) and has only the physical address within the storage cell (102).

11. The cache according to any one of claims 1-5, wherein the row addressing unit has a separate address for each of the q*n columns, and the column addressing unit has a separate address for each of the p*m rows, such that the row addressing unit and the column addressing unit support simultaneous reading and / or writing of up to p*m storage cells (B0 to B15) from different rows and up to q*n storage cells (B0 to B15) from different columns within the array of storage cells (102) and the array of storage cells (B0 to B15) within each storage cell (102).

12. The cache according to any one of claims 1-5, further comprising: A load / store unit that (906) can fill some or all of the storage cells (B0 to B15) of a remote memory representing a two-dimensional data structure. And control and decoding circuitry (903) that can convert a virtual address representing a portion of a two-dimensional data structure represented by a remote memory into control signals for guiding row addressing units and column addressing units to access specific memory cells (B0 to B15).

13. The cache of claim 12, wherein the control and decoding circuit (903) maintains an operand region (402) having a virtual origin (404) such that the virtual origin (404) serves as a reference point for an address template including a plurality of virtual addresses for the remote memory, and wherein the control and decoding circuit (903) is also capable of decoding the address template to determine the plurality of virtual addresses.

14. The cache according to claim 13, wherein, The control and decoding circuit (903) is also capable of manipulating the virtual origin (404) and instructing the load / store unit (906) to initialize and / or update the storage cells (B0 to B15) by reading data from the remote memory when manipulating the virtual origin (404).

15. A method for accessing a cache (100) according to claim 1, the method comprising: The first plurality of memory cells (102) are initialized with a remote memory representing a two-dimensional data structure; as well as One or more storage cells (B0 to B15) within the first plurality of storage units (102) are accessed via virtual addresses through the row addressing units and / or column addressing units, the virtual addresses indicating a portion of the two-dimensional data structure (1010) represented by the contents of the respective storage cells (B0 to B15).

16. The method of claim 15, further comprising converting the virtual address indicating a portion of the two-dimensional data structure (1010) into a physical address indicating the corresponding storage cell (B0 to B15).

17. The method of claim 16, further comprising forming a read control signal and sending the read control signal to the row addressing unit and / or column addressing unit to read the contents of the corresponding storage cell (B0 to B15).

18. The method according to any one of claims 15-17, wherein, Accessing one or more storage cells (B0 to B15) within the first plurality of storage units (102) via a virtual address through the row addressing unit and / or column addressing unit includes the following steps, wherein the virtual address indicates a portion of a two-dimensional data structure (1010) represented by the contents of the respective storage cell (B0 to B15): Decode an address template with multiple virtual addresses; as well as An operand vector is formed using the contents of the storage cells (B0 to B15) corresponding to each of the plurality of virtual addresses.

19. The method according to any one of claims 15-17, further comprising: An operand region (402) with a virtual origin (404) is maintained, wherein the operand region (402) contains storage cells (102) representing a portion of the two-dimensional data structure (1010).

20. The method of claim 19, further comprising: Move the virtual origin (404) and the operand region (402) associated with the virtual origin (404); as well as The second plurality of memory cells (102) are initialized with a remote memory representing a two-dimensional data structure (1010) such that, in response to moving a virtual origin (404) and an operand region (402) associated with the virtual origin (404), the second plurality of memory cells (102) represent a portion of the two-dimensional data structure (1010).

21. The method of claim 20, wherein a second plurality of memory cells (102) are initialized with a remote memory representing a two-dimensional data structure (1010) such that, in response to moving a virtual origin (404) and an operand region (402) associated with the virtual origin (404), the second plurality of memory cells (102) represent a portion of the two-dimensional data structure (1010), including one of the following: (1) In response to moving the virtual origin (404) to the right and the operand region (402) associated with the virtual origin (404), the previous leftmost column storage cell (102) is replaced with the new rightmost column storage cell (102), and the virtual address of the new column is reallocated as the sum of the previous rightmost column virtual address and the width of a single storage cell (102). (2) In response to moving the virtual origin (404) to the left and the operand region (402) associated with the virtual origin (404), the previous rightmost column storage cell (102) is replaced with the new leftmost column storage cell (102), and the virtual address of the new column is reallocated as the difference between the previous rightmost column virtual address and the width of a single storage cell (102). (3) In response to moving the virtual origin (404) upward and the operand region (402) associated with the virtual origin (404), the previous bottom row storage unit (102) is replaced with the new top row storage unit (102), and the virtual address of the new row is reallocated to the sum of the virtual address of the previous top row and the height of the single storage unit (102); and (4) In response to moving the virtual origin (404) down and the operand region (402) associated with the virtual origin (404), the previous top row storage cell (102) is replaced with the new bottom row storage cell (102), and the virtual address of the new row is reallocated to the virtual address of the previous bottom row minus the height difference of a single storage cell (102).

22. The method according to any one of claims 15-17, wherein, Only a subset of the storage cell (102) array is used to store data corresponding to the two-dimensional data structure (1010) as part of processing the two-dimensional data structure (1010), and the remainder of the storage cell (102) array is used for temporary storage space.

23. The method according to any one of claims 15-17, wherein the two-dimensional data structure (1010) includes image data.

24. The method according to any one of claims 15-17, wherein the two-dimensional data structure (1010) comprises a matrix.