Method and system for altering data movement path of large memory transactions
By detecting the memory transaction type and block size, and selecting the data movement path, the problem of large memset and memcopy operations polluting the processor cache and consuming power is solved, achieving more efficient data movement and improved processor performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- QUALCOMM INC
- Filing Date
- 2024-03-11
- Publication Date
- 2026-06-30
AI Technical Summary
Large memset and memcopy operations pollute the processor's L1 to L3 caches, increase power consumption, and cause the processor silicon to age faster.
By detecting the type and block size of memory transactions, the system selects a path to change data movement, avoiding involvement of L1 to L3 caches and interconnect devices, and moving data directly between the last-level cache and system memory.
It reduces cache contamination, lowers power consumption, and slows down processor silicon aging.
Smart Images

Figure CN120917435B_ABST
Abstract
Description
[0001] Related technical descriptions
[0002] Computing devices may include multiple processor-based subsystems. Such computing devices may be, for example, portable computing devices (“PCDs”), such as laptops or handheld computers, cellular phones or smartphones, portable digital assistants, portable game consoles, server processors, etc. Other types of PCDs may be included in applications such as autonomous vehicle systems and the Internet of Things (“IoT”).
[0003] Multiple processor-based subsystems may be included within the same integrated circuit chip or on different chips. A “system-on-a-chip” (SoC) is an example of such a chip, integrating numerous components to provide system-level functionality. For example, an SoC may include one or more types of processors, such as a central processing unit (“CPU”), a graphics processing unit (“GPU”), a digital signal processor (“DSP”), and a neural processing unit (“NPU”). An SoC may include other subsystems, such as a transceiver or “modem” subsystem providing wireless connectivity, a memory subsystem, etc.
[0004] SoC processors, such as CPUs and GPUs, utilize on-chip memory residing within the processor chip (such as core registers and L1 to L3 cache memories), as well as other types of memory external to the processor (such as last-level cache (LLC) and dynamic random-access memory (DRAM)). DRAM and LLC are typically shared resources of the SoC, utilized by multiple processors within the SoC. All these different types of memory devices constitute the memory hierarchy used by the SoC's processors.
[0005] Memory set-up and memory copy operations performed by the SoC's application programming interface (API) actively involve all levels of the memory hierarchy. A memset operation sets a block of memory addresses to a specific value. A memcopy operation copies a block of data from one set of addresses in memory to another set of addresses in memory. Large memset and memcopy operations have undesirable effects, including contamination of the processor's L1-L3 caches (which adversely affects the processor's performance), power consumption due to data movement during cache memory transactions, and accelerated silicon aging of the processor due to redundant transactions in the processor core domain. There is a need for solutions to mitigate these undesirable effects caused by these large memory transactions. Summary of the Invention
[0006] Systems, methods, and other examples for performing memory transactions in a manner that reduces data movement within the memory hierarchy are disclosed.
[0007] An exemplary method for performing a memory transaction in a manner that reduces data movement in the memory hierarchy includes: determining whether the type of a memory transaction being queued in one or more cores of a processor for execution by the processor is one of the preselected types, for which data path change is an option. The method may further include: determining whether the size of a memory block in system memory associated with the memory transaction exceeds a first preselected size threshold value S_TH1. The method may further include: if it is determined that the memory transaction type is one of the preselected types and the memory block size exceeds the S_TH1 value, selecting a modified data movement path for performing the memory transaction. The selected modified path reduces the amount of data movement compared to an unchanged data movement path used to perform a memory transaction that does not belong to one of the preselected types. The method may further include: causing the memory transaction to be performed using the modified data movement path.
[0008] An exemplary embodiment of a system for reducing data movement when performing memory transactions in a memory hierarchy includes: a processor including logic configured to: determine whether a memory transaction being queued in one or more cores of the processor for execution by the processor is a preselected type of a plurality of preselected memory transaction types, and determine whether the size of a memory block in system memory associated with the memory transaction exceeds a first preselected size threshold value S_TH1. The processor may further include logic configured to: output the memory transaction type, the memory block size, and one or more system memory addresses associated with the memory transaction if it is determined that the memory transaction type is a preselected type and the memory block size exceeds the S_TH1 value.
[0009] The system may further include: an LLC controller for the memory hierarchy, electrically coupled to the processor via an interconnect device of the memory hierarchy. The LLC controller receives, via the interconnect device, the memory transaction type, the memory block size, and one or more system memory addresses associated with the memory transaction, output by the processor. The LLC controller includes LLC memory and a last-level coprocessor (LCP). The LCP includes logic configured to: select a modified data movement path for executing the transaction, the modified data movement path reducing the amount of data movement compared to an unchanged data movement path for executing memory transactions not belonging to these pre-selected types. The logic of the LCP is further configured to cause the memory transaction to be executed using the modified data movement path.
[0010] An exemplary embodiment of a non-transitory computer-readable medium includes computer instructions that are executed by an LLC controller of a processor and a memory hierarchy to reduce data movement during memory transactions. These computer instructions include: a first set of computer instructions for determining whether the type of a memory transaction being queued in one or more cores of the processor for execution by the processor is one of a plurality of preselected types, for which data path modification is an option. These computer instructions may further include: a second set of computer instructions for determining whether the size of a memory block in system memory associated with the memory transaction exceeds a first preselected size threshold value S_TH1. System memory is part of a memory hierarchy.
[0011] These computer instructions may further include a third set of computer instructions, which are used to forward the memory transaction type, the size of the memory block, and one or more system memory addresses associated with the memory transaction to the LLC controller of the memory hierarchy when the processor determines, while executing the first set of computer instructions and the second set of computer instructions, that the memory transaction type is one of the preselected types and the memory block size exceeds the S_TH1 value.
[0012] These computer instructions may further include: a fourth set of computer instructions, which are configured to receive, in the LLC controller via the interconnects of the memory hierarchy, the transaction type, the size of the memory block, and the one or more system memory addresses, and to select, relative to an unchanged data movement path for executing a memory transaction that does not belong to these pre-selected types, a modified data movement path for executing the transaction. These computer instructions may further include: a fifth set of computer instructions, which are executed by the LLC controller to cause the memory transaction to be executed using the modified data movement path.
[0013] These and other features and advantages will become apparent from the following description, drawings and claims. Attached Figure Description
[0014] In the accompanying drawings, unless otherwise indicated, similar reference numerals are used throughout the various views to refer to similar parts. For reference numerals with letter character names, such as "101a" or "101b", the letter character names distinguish two similar parts or elements present in the same figure. When the aim is to have the reference numerals cover all parts with the same reference numerals in all figures, the letter character names of the reference numerals may be omitted.
[0015] Figure 1 This is a block diagram of an example of the memory hierarchy currently used by the CPU of a SoC, which includes the CPU's core memory devices with registers and L1 to L3 cache memory devices, external system memory, and interconnect structures that interconnect the CPU with the system memory.
[0016] Figure 2 This is a block diagram illustrating an example of a memory hierarchy based on a representative implementation, the memory hierarchy including... Figure 1 The components of the memory hierarchy shown have been modified to reduce the aforementioned undesirable effects caused by large memset and memcopy operations.
[0017] Figure 3 It means by Figure 2 The flowchart shown is a method for combining the aforementioned logic circuitry, which includes a register core and an LLC controller, to select a data movement path that minimizes the aforementioned undesirable effects when large memset and memcopy transactions are being executed.
[0018] Figure 4 It is based on the exemplary implementation scheme. Figure 3 Box 310 represents a flowchart of a process used to select a data movement path for a large memory transaction that is being queued.
[0019] Figure 5 Examples Figure 2 The block diagram shown includes a register-based core 201, which includes branch predictor logic and loop stream decoder logic for detecting the type of instructions being queued and the size of memory blocks associated with memory transactions.
[0020] Figure 6 yes Figure 2 The block diagram shown includes interconnect devices, system memory, LLC, and LLC controller, the LLC controller including logic components configured to perform the following operations: [referring to the above description]. Figure 4 The flowchart describes the operations to select a data path based on the size of the memory block and write some or all of the associated data to the LLC.
[0021] Figure 7 Examples of PCDs are illustrated, including exemplary embodiments of systems, methods, computer-readable media, and other examples of systems and methods that may implement this disclosure. Detailed Implementation
[0022] This disclosure discloses a system and method for reducing data movement in a memory hierarchy when performing large memory transactions. For certain pre-selected types of large memory transactions, such as, for example, memset and memcopy operations, the processor's logic unit determines whether the type of the memory transaction being queued is a pre-selected type among pre-selected large transaction types, for which changing the data movement path is an option. The processor's logic unit also determines whether the size of the memory block associated with the transaction is large enough to make changing the data movement path necessary. If the type is a pre-selected type and the memory block size is large enough, the LLC controller's logic unit selects the changed data movement path to reduce data movement. The LLC executes the transaction using the changed path.
[0023] Figure 1This is a block diagram of an example memory hierarchy 100 currently used by the CPU 110 of the SoC. This memory hierarchy includes a core 101 of the CPU 110 with registers, L1 to L3 cache memory devices 102 to 104 respectively, external system memory 120, and an interconnect structure interconnecting the CPU 110 with the system memory 120. The interconnect structure 105 includes an interconnect device 106, an LLC memory device 107, and an LLC controller 108. The external system memory 120 includes a system memory device 121 (e.g., a bank of double data rate (DDR) DRAM), a system memory physical layer (PHY) 122, and a system memory controller 123. In addition to being used by the CPU 110, the system memory 120, LLC memory device 107, and controller 108 are typically also used by other processors of the SoC.
[0024] Currently, the memcopy operation is executed by the CPU 110 core in the following manner: for each word of memory being copied, a read operation is initiated on the source address of the system memory 120, and then a write operation is initiated on the destination address in the system memory 120. The memset operation is executed by the CPU 110 core in the following manner: for each word of memory being set, a write operation is initiated. The read and write operations initiated by the CPU 110 core involve all components of the memory hierarchy 100, namely, the core 101 with registers, L1 to L3 cache memory devices 102 to 104 respectively, LLC memory device 107 / LLC controller 108, system memory device 121, PHY 122, and system memory controller 123.
[0025] Large memset and memcopy operations have certain undesirable effects, including: (1) contamination of L1 to L3 cache memory devices 102 to 104, which adversely affects the performance of CPU 110; (2) power consumption due to data movement during cache memory transactions; and (3) accelerated aging of the CPU 110 silicon wafer due to redundant transactions in the core processing logic unit and register 101.
[0026] In this current specific implementation, the total power consumed by the memory hierarchy 100 can be expressed as:
[0027] Total power consumed = Power usage in core / register 101 + Power usage in L1 cache 102 + L2 cache 103 + L3 cache 104 + Power usage in interconnect device 106 + Power usage in LLC memory device 107 + Power usage in LLC controller 108 + Power usage in system memory 120. (Equation 1).
[0028] As will be described below with reference to representative or exemplary embodiments, this disclosure discloses a modified memory hierarchy and method that prevents the entire memory hierarchy from being involved in each memset and memcopy operation, thereby reducing these undesirable effects.
[0029] In the following detailed description, exemplary or representative embodiments of the disclosed specific details are set forth for purposes of explanation and not limitation, in order to provide a thorough understanding of embodiments according to this teaching. The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” The word “representative” is used herein synonymously with “exemplary.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects. However, it will be apparent to those skilled in the art, who benefit from this disclosure, that other embodiments of the teaching that depart from the specific details disclosed herein remain within the scope of the appended claims. Furthermore, descriptions of well-known apparatuses and methods may be omitted so as not to obscure the description of exemplary embodiments. Such methods and apparatus are clearly within the scope of this teaching.
[0030] The terminology used herein is for the purpose of describing particular implementations only and is not intended to be limiting. The defined terms are supplementary to their technical and scientific meanings as generally understood and accepted in the technical field of this teaching content.
[0031] Unless the context clearly indicates otherwise, the terms “a,” “an,” and “the” as used in the specification and appended claims include both singular and plural references. Thus, for example, “an apparatus” includes one apparatus and multiple apparatuses.
[0032] Relative terms are used to describe the relationships between individual elements, as illustrated in the accompanying drawings. In addition to the orientations depicted in the drawings, these relative terms are intended to cover different orientations of the equipment and / or elements.
[0033] It should be understood that when an element is described as being "connected to," "coupled to," or "electrically coupled to" another element, the element may be directly connected or coupled, or there may be intermediate elements present.
[0034] As used herein, the terms "memory" or "memory device" are intended to refer to a non-transitory, computer-readable storage medium capable of storing computer instructions or computer code for execution by one or more processors. References to "memory" or "memory device" herein should be interpreted as one or more memories or memory devices. For example, memory may refer to multiple memories within the same computer system. Memory may also refer to multiple memories distributed across multiple computer systems or computing devices.
[0035] As used herein, the term "processor" encompasses any electronic component capable of executing computer programs or computer instructions. References to a computer including "processor" herein should be interpreted as one or more processors. A processor may be, for example, a multi-core processor comprising multiple processing cores, each of which may include multiple processing stages in a processing pipeline. A processor may also refer to a collection of processors within a single computer system or distributed across multiple computer systems.
[0036] Computing devices may include multiple subsystems, cores, or other components. Such computing devices may be, for example, PCDs, such as laptops or handheld computers, cellular phones or smartphones, portable digital assistants, portable game consoles, and automotive safety systems for autonomous vehicles.
[0037] Figure 2 This is a block diagram of an example memory hierarchy 200 according to a representative implementation, the memory hierarchy including... Figure 1 The components of the memory hierarchy 100 shown have been modified to reduce the aforementioned undesirable effects caused by large memset and memcopy operations. Figure 2 and Figure 1 Same, only already... Figure 1 The circuitry of CPU 110 with core 101 having registers is shown. Figure 1 The circuitry of the LLC controller 108 shown has been modified. Specifically, in Figure 2 Within this, the core 201 with registers and the LLC controller 208 include logic circuitry that works together to alter data movement paths within the memory hierarchy 200 for large memset and memcopy operations. See below for reference... Figure 5 and Figure 6 An exemplary implementation of these logic circuits, including a core 201 with registers and an LLC controller 208, is described.
[0038] refer to Figure 2The memory hierarchy 200 includes a CPU 210 with a core 201 having registers and L1 to L3 cache memory devices 102 to 104, an interconnect structure 205 with interconnect devices 106, LLC memory devices 107 and LLC controller 208, and system memory 120. The interconnect structure 205 interconnects the CPU 210 with LLC 107, LLC controller 208, and system memory device 120. System memory 120 includes system memory device 121 (e.g., the memory bank of a DDR DRAM device), PHY 122, and system memory controller 123 (e.g., a DDR DRAM memory controller).
[0039] Figure 3 This is a flowchart illustrating method 300 executed by a combination of the aforementioned logic circuitry of a core 201 with registers and an LLC controller 208, for selecting a data movement path to minimize the aforementioned undesirable effects when large memset and memcopy transactions are being executed. At the step indicated by box 301, the logic circuitry of the CPU core 201 determines the type of memory transaction being queued in the core 201 with registers for execution, and the size of the memory block associated with that transaction, as indicated by box 301. The logic circuitry of the core 201 with registers forwards the transaction type and size to the LLC controller 208 via interconnect device 106 (e.g., in the form of metadata), as indicated by box 302. References below... Figure 5 An exemplary implementation of the logic circuitry that performs these tasks is described.
[0040] The circuitry of LLC controller 208 is configured to receive and process information identifying the type of memory transaction to be performed, to determine whether the transaction type is one of a number of pre-selected large memory transaction types for which a data movement path change will be performed. Pre-selected types include memset and memcopy transactions, but may include additional large memory transactions. If it is determined at block 303 that the queued memory transaction is not one of the pre-selected types, the process proceeds to block 306, and the above reference is used. Figure 1 The normal data movement path discussed is used to execute transactions.
[0041] If, at block 303, it is determined that the queued memory transaction is one of the preselected types, the process proceeds to block 304, where the logic unit of core 201 with registers determines whether the size of the memory block associated with the queued transaction exceeds a preselected configurable size threshold S_TH. The S_TH value is preferably based on, and preferably equal to, the size of the LLC memory device 107. The S_TH value is preferably updatable in the firmware of the LLC controller 208.
[0042] If it is determined at box 304 that the size of the memory block associated with the transaction does not exceed the S_TH value, the process proceeds to box 306, and LLC controller 208 enables the use of the referenced above. Figure 1 The normal data movement path discussed is used to execute transactions. If, at box 304, the logic unit with registers at core 101 determines that the size of the memory block exceeds the S_TH value, then core 201 with registers forwards the memory block size and transaction type to LLC controller 208, as indicated in box 309. LLC controller 208 then causes the selection of a data movement path to reduce data movement during memory transaction execution, as indicated in box 310. It should be noted that the processes represented by boxes 303 and 304 can be executed in reverse order or simultaneously.
[0043] According to the preferred embodiment, the selected data movement path does not include interconnect device 106 and CPU 210, as will be referred to below. Figure 5 and Figure 6 A more detailed description follows. Excluding interconnect devices 106 and CPU 210 from the data movement path reduces the aforementioned undesirable effects: (1) contamination of L1 to L3 caches 102 to 104, respectively; (2) power consumption due to L1 to L3 cache memory transactions; and (3) accelerated aging of the CPU 210 silicon wafer due to redundant transactions in CPU 210. Regarding the reduction in power consumption, the total power consumed by the memory hierarchy according to this embodiment can be expressed as:
[0044] Total power consumed = Power usage of LLC 107 + Power usage of LLC controller 208 + Power usage of system memory 120 (Equation 2)
[0045] The power savings achieved according to this implementation scheme can be seen from a comparison of Equations 1 and 2. In Equation 2, the following terms from Equation 1 have been eliminated: power usage in core / register 101 + power usage of L1 cache + L2 cache + L3 cache 102 to 104 + power usage of interconnect device 106. Therefore, significant power savings are achieved by changing the data movement path used for large memory transactions.
[0046] Figure 4 It is based on the exemplary implementation scheme. Figure 3 Box 310 represents a flowchart of a process used to select a data movement path for a large memory transaction that is being queued. According to an exemplary embodiment, the process of selecting a data movement path, as represented by box 310, depends on whether the size S_M of the memory block exceeds the size S_L of LLC 107. At the step represented by box 401, LLC controller 208 determines whether the size S_M of the memory block to be transferred exceeds the size S_L of LLC 107. If not, LLC controller 208 writes all bytes of the memory block to LLC 107, as indicated in box 402.
[0047] If LLC controller 208 determines at box 401 that the size S_M of the memory block exceeds the size S_L of LLC 107, then LLC controller 208 writes as many bytes of the memory block as possible into LLC 107 until LLC 107 is full, without using the remaining bytes to update LLC 107, as shown in box 403. System memory 120 causes its memory bank 121 to be updated with the remaining bytes, but this will use the above reference. Figure 1 The specific implementation described herein may occur in any manner. The sequence of bytes written to LLC 107 may begin with the first S_L kilobytes (KB), the last S_L KB, or some other point of interest, depending on how LLC controller 208 is configured to perform the task.
[0048] Figure 5 Examples Figure 2 The block diagram shown illustrates a register-based core 201, which includes a branch predictor logic unit 510 and a loop stream decoder logic unit 520 for detecting the type of instructions being queued and the size of memory blocks associated with memory transactions, respectively. The instruction fetch logic unit 501, instruction decode logic unit 502, and execution engine 503 are existing components of the CPU's processing core. The branch predictor logic unit 510 and loop stream decoder logic unit 520 have been modified according to embodiments of this disclosure to allow the CPU 210 to execute the above-referenced... Figure 3The flowchart in block 303 describes the operation to determine the type of instructions being queued and the size of the associated memory block.
[0049] The processor executes instructions in a sequential control flow (i.e., one instruction following another). Branching allows the program to change the execution flow by jumping to the beginning of a new sequence of instructions to be executed. Branch predictor logic unit 510 is the logic unit that predicts the beginning of the next instruction sequence after jumping to the new instruction sequence. This helps instruction fetch logic unit 501 fetch the next instruction sequence in advance and forward it to instruction decode logic unit 502.
[0050] A loop is a sequence of instructions that repeats continuously until a specific condition is met. Typically, the execution of this sequence of instructions leads to the retrieval of a data item and the processing of that data item to modify it. This processing is repeated until a specific condition, such as whether a counter has reached a predetermined number, is true. In processor execution flow, there is the movement of instructions from memory to the processor core and the bidirectional movement of data between the processor core and memory. A stream refers to the flow of instructions or data between memory and the processor core. For example, copying a file from one folder to another results in a streaming transfer of the data content from the beginning byte of the file to the end byte of the file.
[0051] The instruction decoding logic unit 502 decodes the extracted instructions and executes them. Figure 3 The process represented by box 303 determines the type of instruction to be executed. The loop stream decoder logic unit 520 analyzes the decoded instruction to obtain a loop count and / or stream length. Based on the loop count and / or stream length, the loop stream decoder logic unit 520 determines the size of the memory block associated with the instruction being decoded and executes the process in box 304 to determine if that size exceeds a size threshold. If so, the CPU 210 forwards the memory block size and transaction type to the LLC controller 208. Otherwise, the CPU 210 uses the normal data flow path—executing the instruction in the execution engine 203 and transferring the result to other registers of the CPU 210 or to system memory 120.
[0052] Figure 6 This is a block diagram of interconnect device 106, system memory 120, LLC 107, and LLC controller 208, the LLC controller including logic components configured to perform the operations described in the reference above. Figure 4 The flowchart describes the operations to select a data path based on the size of the memory block and write some or all of the associated data to the LLC107. The last-stage coprocessor (LCP) 610 of the LLC controller 208 is configured to perform the operations described by the flowchart. Figure 4The flowchart illustrates the logical components of the process. The front-end (FE) 601 interfaces with interface device 106, decodes addresses, and extracts data from incoming transactions. LLC 107 includes TAG RAM 602 and data RAM 603, which store tags and data in a cache structure, respectively. The back-end (BE) 604 interfaces with system memory controller 123. When a cache miss occurs on LLC 107, the transaction is forwarded to system memory controller 123 via BE 604. BE 604 also transmits details of the incoming read or write transaction along with metadata to LCP 610.
[0053] As noted above, LCP 610 is an intelligent entity residing in LLC controller 208 that can read the metadata of incoming transactions and take necessary actions, such as generating a transaction to FE 601 to simulate an incoming transaction from interconnect device 106 based on the size of the memory block in the metadata of the memcopy and memset APIs. Figure 4 The decision-making process of frame 401 and by Figure 4 The operations indicated by boxes 402 and 403 are performed by LCP 610.
[0054] For those with more than Figure 3 The memset operation of the memory address block of the size of the S_TH value in box 304 encodes the start address, memory block size, and data in system memory 120 in metadata received from CPU 210 in LLC controller 208. LLC controller 208 writes the data from the start address to the end address equal to the start address plus the memory block size into the memory bank of system memory 121 and into LLC 107. The data movement paths for writing are indicated by arrows 621 and 622. Path 621 is from interconnect device 106 to FE 601 to LLC 107. Path 622 is from LLC 107 via BE 604 to system memory controller 123, to PHY 122, and then to memory bank 121.
[0055] Assuming the size of the data to be written does not exceed the size of LLC 107, as defined in box 401, LCP 610 reads the metadata received by FE 601 from interconnect device 106, generates the next transaction, and updates LLC 107 until the size of the memory to be set is reached. If the size of the data to be written exceeds the size of LLC 107, as defined in box 401, only the portion of the data equal in size to LLC 107 is written to LLC 107, as indicated in box 403. The path for this write is represented by arrow 623 from LCP 610 to LLC 107 and by arrow 622 from LLC 107 via BE 604 to system memory controller 123, to PHY 122, and then to memory bank 121.
[0056] For more than Figure 3 The memcopy operation of the memory address block with the S_TH value of box 304, the source address in system memory device 120, the destination address in system memory 120, and the size of the address block to be copied are encoded in metadata received from interconnect device 106 via FE 601 in LLC controller 208. Data is read from the source address and LLC 107 is updated. The corresponding data movement paths are indicated by arrows 621, 622, 625, and 623. The corresponding data movement paths are indicated by arrows 621, 622, and 625. Then, LCP 610 uses the received data to generate a write transaction for the destination address to copy the received data into system memory 120. The corresponding data movement paths are indicated by arrows 626, 627 (optional), and 622. LCP 610 generates the next read and associated write until the copy size is met. If the data size is larger than the size of LLC 107, update LLC 107 only on paths 626 and 627 for the portion of the data that can be contained in LLC 107, such as... Figure 4 As indicated by box 403. For portions of the data exceeding the size of LLC 107, data movement is along paths 626 and 622.
[0057] from Figure 6 As can be seen from the description, none of these large memory transactions involve memory hierarchy 200 ( Figure 2 The portion of the CPU 210 includes interconnect device 106 and CPU 210. This results in zero contamination of the L1 to L3 caches and registers of the CPU 210 core (which improves the performance of the CPU 210), lower power consumption, and reduced aging of the CPU 210 silicon due to the reduced number of calculations performed by the CPU 210 core.
[0058] Figure 7 Examples of PCD 700 are illustrated, such as mobile phones, smartphones, portable game consoles (such as extended reality (XR) devices, virtual reality (VR) devices, augmented reality (AR) devices, or mixed reality (MR) devices), autonomous driving systems for automobiles, etc., in which exemplary embodiments of systems, methods, computer-readable media, and other examples of systems and methods can be implemented according to the inventive principles and concepts of this disclosure. For clarity, Figure 7 Some interconnects, signals, devices, etc. are not shown.
[0059] PCD 700 may include SoC 702. SoC 702 includes CPU 210, NPU 705, GPU 706, DSP 707, analog signal processor 708, modem / modem subsystem 754, or other processors. CPU 704 may include one or more CPU cores, such as first CPU core 2011, second CPU core 2012, etc., up to Mth CPU core 201. M The branch predictor logic unit 510 and the loop stream decoder logic unit 520 can be found in CPU cores 2011 to 2012. M It is used in one or more CPU cores, although these logic components are typically used in all cores from 2011 to 201... M It is used in [the context]. Additionally, although the CPU 210 [is...] Figure 7 It is described as a multi-core CPU, but it can have as few as a single core, which employs a branch predictor logic unit 510 and a loop stream decoder logic unit 520. CPU cores 2011 to 201 M They also perform other operations of the type typically performed in the PCD. Alternatively or additionally, any processor in the SoC 702's processors (such as the NPU 705, GPU 706, DSP 707, etc.) may have the above-mentioned references. Figures 2 to 5 Configure in the manner described above to perform the above reference. Figure 2 , Figure 3 and Figure 5 The core of the described operation.
[0060] CPU 210 is interconnected with system memory 120 via interconnect structure 205. As indicated above, interconnect structure 205 includes interconnect device 106, LLC 107, and LLC controller 208, the LLC controller including... Figure 6 The components shown include those configured to perform the above reference. Figure 3 , Figure 4 and Figure 6 The LCP 610 is the logic unit for the described operation.
[0061] Display controller 709 and touchscreen controller 712 may be coupled to CPU 210. A touchscreen display 714 external to SoC 702 may be coupled to display controller 710 and touchscreen controller 712. PCD 700 may also include a video decoder 716 coupled to CPU 210. Video amplifier 718 may be coupled to video decoder 716 and touchscreen display 714. Video port 720 may be coupled to video amplifier 718. Universal Serial Bus (“USB”) controller 722 may also be coupled to CPU 210, and USB port 724 may be coupled to USB controller 722. Subscriber Identity Module (“SIM”) card 726 may also be coupled to CPU 210.
[0062] A stereo audio codec 734 may be coupled to an analog signal processor 708. Additionally, an audio amplifier 736 may be coupled to the stereo audio codec 734. A first stereo speaker 738 and a second stereo speaker 740 may be coupled to the audio amplifier 736, respectively. Furthermore, a microphone amplifier 742 may be coupled to the stereo audio codec 734, and a microphone 744 may be coupled to the microphone amplifier 742. An FM radio tuner 746 may be coupled to the stereo audio codec 734. An FM antenna 748 may be coupled to the FM radio tuner 746. Additionally, stereo headphones 750 may be coupled to the stereo audio codec 734. Examples of other devices that may be coupled to the CPU 210 include one or more digital (e.g., CCD or CMOS) cameras 752.
[0063] A modem or RF transceiver 754 may be coupled to an analog signal processor 708 and a CPU 210. An RF switch 756 may be coupled to an RF transceiver 754 and an RF antenna 758. Additionally, a keypad 760 and a mono headset 762 with a microphone may be coupled to the analog signal processor 708. The SoC 702 may have one or more internal or on-chip thermal sensors 770. A power supply 774 and a PMIC 776 may power the SoC 702.
[0064] Firmware or software may be stored in any of the aforementioned memories, or in local memory directly accessible to the processor hardware on which the software or firmware is executed. Execution of such firmware or software can control aspects of any of the aforementioned methods or configure aspects of any of the aforementioned systems. Any such memory or other non-transitory storage medium having firmware or software stored therein in a computer-readable form for execution by processor hardware is an example of a "computer-readable medium," as understood in the patent dictionary.
[0065] Specific implementation examples are described in the following numbered clauses.
[0066] 1. A method for performing memory transactions in a manner that reduces data movement within a memory hierarchy, the method comprising:
[0067] Determine whether the type of memory transaction being queued in one or more cores of the processor for execution by the processor is one of a number of preselected types, for which data path change is an option;
[0068] Determine whether the size of the memory block in the system memory associated with the memory transaction exceeds the first preselected size threshold value S_TH1;
[0069] If it is determined that the memory transaction type is one of the pre-selected types and the memory block size exceeds the S_TH1 value, a data movement path is selected to perform the changes for the memory transaction, wherein the selected path reduces the amount of data movement compared to an unmodified data movement path used to perform a memory transaction that does not belong to the pre-selected type; and
[0070] This enables the memory transaction to be performed using the modified data movement path.
[0071] 2. The method according to Clause 1, wherein the determining step is performed in the processor and the selecting step is performed in the last-level cache (LLC) controller of the memory hierarchy, the method further comprising:
[0072] Before performing the selection step:
[0073] The processor forwards the memory transaction type, the size of the memory block, and one or more system memory addresses associated with the memory transaction to the LLC controller.
[0074] In the LLC controller, the transaction type, the size of the memory block, and the one or more system memory addresses are received via interconnect devices of the memory hierarchy;
[0075] as well as
[0076] During the execution of the step that causes the memory transaction to be performed using the changed data movement path:
[0077] In the LLC controller, the memory transaction type, the size of the memory block, and one or more system memory addresses associated with the memory transaction are used to enable the execution of the memory transaction.
[0078] 3. The method according to Clause 2, further comprising:
[0079] If the processor determines that the memory transaction type is not one of the preselected types or the memory block size does not exceed the S_TH1 value, it causes the memory transaction to be executed using an unchanged data movement path.
[0080] 4. The method according to any one of Clauses 2 and 3, wherein the step of selecting the changed data movement path includes:
[0081] In the LLC controller, it is determined whether the size of the memory block exceeds the second pre-selected size threshold value S_TH2; and
[0082] If the LLC controller determines that the size of the memory block does not exceed the S_TH2 value, the changed data movement path selected by the LLC controller is a path that includes the LLC controller writing the entire memory block associated with the memory transaction to the LLC memory, which is part of the memory hierarchy.
[0083] 5. The method according to Clause 4, wherein the step of selecting the changed data movement path further comprises:
[0084] In the LLC controller, if the LLC controller determines that the size of the memory block does not exceed the S_TH2 value, the changed data movement path selected by the LLC controller is a path that includes bytes of a portion of the memory block written by the LLC controller to the LLC memory, the portion of the memory block having a size not exceeding the size of the LLC memory, and any remaining bytes of the memory block not written to the LLC memory being written to system memory.
[0085] 6. The method according to any one of Clauses 2 to 5, wherein the interconnect device interconnects the processor with the LLC controller and a memory controller with the system memory, and wherein the LLC controller is configured such that bytes of the memory block are written to at least one of the system memory and the LLC memory by transmitting a transaction to the front end (FE) of the LLC controller, the transaction simulating a transaction received from the interconnect device in the FE, and wherein the simulated transaction is based at least in part on the size of the memory block transmitted from the processor to the LLC controller via the interconnect device.
[0086] 7. The method according to any one of Clauses 1 to 6, wherein the multiple preselected types of memory transactions include memory set (memset) and memory copy (memcopy) operations, wherein the memset operation assigns the same preselected value to the entire memory block associated with the memory transaction, and wherein the memcopy operation copies the entire memory block from one set of addresses in system memory to another set of addresses in system memory.
[0087] 8. The method according to any one of Clauses 1 to 7, wherein the processor is a component of a system-on-chip (SoC) integrated circuit package of a personal computing device (PCD).
[0088] 9. The method according to Clause 8, wherein the processor is the central processing unit (CPU) of the SoC integrated circuit package.
[0089] 10. A system for reducing data movement when performing memory transactions in a memory hierarchy, the system comprising:
[0090] A processor, comprising logic configured to: determine whether the type of a memory transaction being queued in one or more cores of the processor for execution by the processor is a preselected type among a plurality of preselected memory transaction types, and determine whether the size of a memory block in system memory associated with the memory transaction exceeds a first preselected size threshold value S_TH1, the processor comprising logic configured to: output the memory transaction type, the memory block size, and one or more system memory addresses associated with the memory transaction if it is determined that the memory transaction type is a preselected type and the memory block size exceeds the S_TH1 value; and
[0091] The LLC controller of the memory hierarchy, electrically coupled to the processor via interconnects of the memory hierarchy, receives, via the interconnects, the memory transaction type, the memory block size, and one or more system memory addresses associated with the memory transaction output by the processor. The LLC controller includes an LLC memory and a last-level coprocessor (LCP), the LCP including logic components configured to: select a modified data movement path for executing the transaction relative to an unchanged data movement path for executing a memory transaction that does not belong to the preselected type; and cause the memory transaction to be executed using the modified data movement path.
[0092] 11. The system according to Clause 10, wherein the processor further includes a logic component configured to perform the memory transaction using an unchanged data movement path if the processor determines that the memory transaction type is not one of the preselected types or the memory block size does not exceed the S_TH1 value.
[0093] 12. The system according to any one of Clauses 10 to 11, wherein the logic component of the LLC controller configured to select a changed data movement path selects the changed data movement path by determining whether the size of the memory block exceeds a second preselected size threshold value S_TH2, and if not, selects the changed data movement path comprising the entire memory block associated with the memory transaction written by the LLC controller to the LLC memory, the LLC memory being part of the memory hierarchy.
[0094] 13. The system according to any one of clauses 10 to 12, wherein, when the memory block size exceeds the S_TH2 value, the logic component of the LLC controller configured to select a changed data movement path selects the changed data movement path including the LLC controller writing only a portion of bytes of the memory block to the LLC memory, the only portion of the memory block having a size not exceeding the size of the LLC memory.
[0095] 14. The system according to any one of clauses 10 to 13, wherein the LPC is configured such that bytes of the memory block are written to at least one of the system memory and the LLC memory by transmitting a transaction to the front end (FE) of the LLC controller, the transaction simulating a transaction received from the interconnect device in the FE, and wherein the simulated transaction is based at least in part on the size of the memory block transmitted from the processor to the LLC controller via the interconnect device.
[0096] 15. The system according to any one of Clauses 10 to 14, wherein the multiple preselected types of memory transactions include memory set and memory copy operations, wherein the memset operation assigns the same preselected value to the entire memory block associated with the memory transaction, and wherein the memcopy operation copies the entire memory block from one set of addresses in system memory to another set of addresses in system memory.
[0097] 16. The system according to any one of clauses 10 to 15, wherein the system comprises a system-on-chip (SoC) integrated circuit package, the SoC integrated circuit package including the processor.
[0098] 17. The system according to Clause 16, wherein the processor is the central processing unit (CPU) of the SoC integrated circuit package.
[0099] 18. The system according to any one of Clauses 16 to 17, wherein the SoC integrated circuit package is a component of a personal computing device (PCD).
[0100] 19. A non-transitory computer-readable medium comprising computer instructions, the computer instructions being executed by a processor and a last-level cache (LLC) controller of a memory hierarchy to reduce data movement during memory transactions, the computer instructions comprising:
[0101] A first set of computer instructions is used to determine whether the type of a memory transaction being queued in one or more cores of a processor for execution by the processor is one of a plurality of preselected types, for which data path change is an option;
[0102] A second set of computer instructions is used to determine whether the size of a memory block in the system memory associated with the memory transaction exceeds a first preselected size threshold value S_TH1, wherein the system memory is part of the memory hierarchy;
[0103] A third set of computer instructions, wherein the third set of computer instructions is configured to forward the memory transaction type, the size of the memory block, and one or more system memory addresses associated with the memory transaction to the last-level cache (LLC) controller of the memory hierarchy when the processor determines, while executing the first set of computer instructions and the second set of computer instructions, that the memory transaction type is one of the preselected types and the memory block size exceeds the S_TH1 value.
[0104] A fourth set of computer instructions, configured to receive the transaction type, the size of the memory block, and the one or more system memory addresses in the LLC controller via interconnects of the memory hierarchy, and to select a modified data movement path relative to an unchanged data movement path for executing a memory transaction not belonging to the pre-selected type; and
[0105] A fifth set of computer instructions, the fifth set of computer instructions being used to cause the memory transaction to be performed using the modified data movement path.
[0106] 20. The computer-readable medium according to Clause 19, wherein the fourth set of instructions comprises:
[0107] Computer instructions for performing the following operation: determining whether the size of the memory block exceeds a second preselected size threshold value S_TH2;
[0108] Computer instructions for performing the following operations: If it is determined that the size of the memory block does not exceed the S_TH2 value, select a data movement path that includes writing the entire memory block associated with the memory transaction to the LLC memory by the LLC controller; and
[0109] Computer instructions for performing the following operations: when the LLC controller determines that the size of the memory block exceeds the S_TH2 value, selecting a modified data movement path including bytes written by the LLC controller to the LLC memory of a portion of the memory block, the portion of the memory block having a size not exceeding the size of the LLC memory.
[0110] Alternative embodiments will become apparent to those skilled in the art to which this invention pertains. Therefore, although alternative aspects have been illustrated and described in detail, it should be understood that various substitutions and changes may be made therein.
Claims
1. A method for performing memory transactions in a manner that reduces data movement within a memory hierarchy, the method comprising: The processor determines whether the type of a memory transaction being queued in one or more cores of the processor for execution by the processor is one of a number of preselected types, for which data movement path change is an option; The processor determines whether the size of the memory block in the system memory associated with the memory transaction exceeds the first preselected size threshold value S_TH1. When the last-level cache LLC controller determines that the memory transaction type is one of the pre-selected types and the memory block size exceeds the S_TH1 value, it selects a data movement path for performing changes to the memory transaction, wherein the selected changed data movement path reduces the amount of data movement compared to an unchanged data movement path for performing a memory transaction that does not belong to the pre-selected type. as well as When the memory transaction is one of the multiple preselected types, the memory transaction is executed using the modified data movement path.
2. The method according to claim 1, further comprising: Before performing the selection step, The processor forwards the memory transaction type, the size of the memory block, and one or more system memory addresses associated with the memory transaction to the LLC controller. In the LLC controller, the transaction type, the size of the memory block, and the one or more system memory addresses are received via interconnect devices of the memory hierarchy; as well as During the execution of the step that causes the memory transaction to be performed using the changed data movement path: In the LLC controller, the memory transaction type, the size of the memory block, and one or more system memory addresses associated with the memory transaction are used to enable the execution of the memory transaction.
3. The method according to claim 2, further comprising: If the processor determines that the memory transaction type is not one of the preselected types or the memory block size does not exceed the S_TH1 value, it causes the memory transaction to be executed using an unchanged data movement path.
4. The method of claim 2, wherein the step of selecting the changed data movement path comprises: In the LLC controller, it is determined whether the size of the memory block exceeds the second pre-selected size threshold value S_TH2; and When the LLC controller determines that the size of the memory block does not exceed the S_TH2 value, the changed data movement path selected by the LLC controller is a data movement path that includes the LLC controller writing the entire memory block associated with the memory transaction to the LLC memory, which is part of the memory hierarchy.
5. The method of claim 4, wherein the step of selecting the changed data movement path further comprises: In the LLC controller, when the LLC controller determines that the size of the memory block exceeds the S_TH2 value, the changed data movement path selected by the LLC controller is a data movement path that includes bytes written by the LLC controller to the LLC memory of a portion of the memory block, the portion of the memory block having a size that does not exceed the size of the LLC memory, and bytes of the memory block that are not written to the LLC memory are written to the system memory.
6. The method of claim 5, wherein the interconnect device interconnects the processor with the LLC controller and a memory controller with the system memory, and wherein the LLC controller is configured to write bytes of the memory block to at least one of the system memory and the LLC memory by transmitting a transaction to the front-end FE of the LLC controller, the transaction simulating a transaction received from the interconnect device in the FE, and wherein the simulated transaction is based at least in part on the size of the memory block transmitted from the processor to the LLC controller via the interconnect device.
7. The method of claim 1, wherein the multiple preselected types of memory transactions include memory set (memset) and memory copy (memcopy) operations, wherein the memset operation assigns the same preselected value to the entire memory block associated with the memory transaction, and wherein the memcopy operation copies the entire memory block from one set of addresses in system memory to another set of addresses in system memory.
8. The method of claim 1, wherein the processor is a component of a system-on-chip (SoC) integrated circuit package of a personal computing device (PCD).
9. The method of claim 8, wherein the processor is a central processing unit (CPU) of the SoC integrated circuit package.
10. A system for reducing data movement when performing memory transactions in a memory hierarchy, the system comprising: A processor, comprising logic configured to: determine whether the type of a memory transaction being queued in one or more cores of the processor for execution by the processor is a preselected type among a plurality of preselected memory transaction types, and determine whether the size of a memory block in system memory associated with the memory transaction exceeds a first preselected size threshold value S_TH1, the processor comprising logic configured to: output the memory transaction type, the memory block size, and one or more system memory addresses associated with the memory transaction if it is determined that the memory transaction type is a preselected type and the memory block size exceeds the S_TH1 value; and The last-level cache LLC controller of the memory hierarchy is electrically coupled to the processor via the interconnect of the memory hierarchy. The LLC controller receives, via the interconnect, the memory transaction type, the memory block size, and one or more system memory addresses associated with the memory transaction, output by the processor. The LLC controller includes an LLC memory and a last-level coprocessor LCP. The LCP includes logic configured to: select a modified data movement path for executing the transaction relative to an unchanged data movement path; select the unchanged data movement path for executing memory transactions that do not belong to the pre-selected type; and for memory transactions belonging to the pre-selected type, cause the memory transaction to be executed using the modified data movement path.
11. The system of claim 10, wherein the processor further comprises a logic component configured to perform the following operation: if the processor determines that the memory transaction type is not one of the preselected types or the memory block size does not exceed the S_TH1 value, causing the memory transaction to be performed using an unchanged data movement path.
12. The system of claim 10, wherein the logic component of the LLC controller configured to select the changed data movement path selects the changed data movement path by determining whether the size of the memory block exceeds a second preselected size threshold value S_TH2.
13. The system of claim 12, wherein when the memory block size exceeds the S_TH2 value, the logic component of the LLC controller configured to select the changed data movement path selects the changed data movement path including the LLC controller writing only a portion of bytes of the memory block to the LLC memory, the only portion of the memory block having a size not exceeding the size of the LLC memory.
14. The system of claim 13, wherein the LCP is configured to write bytes of the memory block to at least one of the system memory and the LLC memory by transmitting a transaction to the front-end FE of the LLC controller, the transaction simulating a transaction received from the interconnect device in the FE, and wherein the simulated transaction is based at least in part on the size of the memory block transmitted from the processor to the LLC controller via the interconnect device.
15. The system of claim 10, wherein the multiple preselected types of memory transactions include memory set (memset) and memory copy (memcopy) operations, wherein the memset operation assigns the same preselected value to the entire memory block associated with the memory transaction, and wherein the memcopy operation copies the entire memory block from one set of addresses in system memory to another set of addresses in system memory.
16. The system of claim 10, wherein the system comprises a system-on-a-chip (SoC) integrated circuit package, the SoC integrated circuit package including the processor.
17. The system of claim 16, wherein the processor is a central processing unit (CPU) of the SoC integrated circuit package.
18. The system of claim 17, wherein the SoC integrated circuit package is a component of a personal computing device (PCD).
19. A non-transitory computer-readable medium comprising computer instructions, the computer instructions being executed by a processor and a cache LLC controller at the last level of a memory hierarchy to reduce data movement during memory transactions, the computer instructions comprising: A first set of computer instructions is used to determine whether the type of a memory transaction being queued in one or more cores of the processor for execution by the processor is one of a plurality of preselected types, for which data movement path change is an option; A second set of computer instructions is used to determine whether the size of a memory block in the system memory associated with the memory transaction exceeds a first preselected size threshold value S_TH1, wherein the system memory is part of the memory hierarchy; The third set of computer instructions is used to forward the memory transaction type, the size of the memory block, and one or more system memory addresses associated with the memory transaction to the last-level cache LLC controller of the memory hierarchy when the processor determines, while executing the first set of computer instructions and the second set of computer instructions, that the memory transaction type is one of the preselected types and the memory block size exceeds the S_TH1 value. A fourth set of computer instructions is configured to receive, in the LLC controller via interconnects of the memory hierarchy, the transaction type, the size of the memory block, and the one or more system memory addresses, and to select a modified data movement path relative to an unchanged data movement path for executing a memory transaction that does not belong to the preselected type; and A fifth set of computer instructions, the fifth set of computer instructions being used to cause the memory transaction to be performed using the modified data movement path.
20. The computer-readable medium of claim 19, wherein the fourth set of instructions comprises: Computer instructions for performing the following operation: determining whether the size of the memory block exceeds a second preselected size threshold value S_TH2; Computer instructions for performing the following operations: when it is determined that the size of the memory block does not exceed the S_TH2 value, select a data movement path that includes the entire memory block of the memory transaction being written from the LLC controller to the LLC memory by the LLC controller; and Computer instructions for performing the following operations: when the LLC controller determines that the size of the memory block exceeds the S_TH2 value, selecting a modified data movement path including bytes written by the LLC controller to the LLC memory of a portion of the memory block, the portion of the memory block having a size not exceeding the size of the LLC memory.