Heterogeneous storage-compute integrated acceleration system, method, and storage device
By using a heterogeneous in-memory computing acceleration system, the shortcomings of a single processor architecture in the standard compression process of gene sequencing data are solved. It achieves efficient collaboration of global redundancy mining, local pattern extraction and entropy coding, reduces data migration and synchronization overhead, and improves compression efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INST OF COMPUTING TECH CHINESE ACAD OF SCI
- Filing Date
- 2026-04-20
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, the standard compression process for gene sequencing data, when executed on a single processor architecture, suffers from problems such as long data paths, imbalance between computation and memory access, and high overhead in process control and data migration, making it difficult to form a unified acceleration system covering the entire process.
A heterogeneous in-memory computing acceleration system is adopted. The main control processing module generates task objects, and the unified flow scheduling control module performs stage-aware mapping to dispatch different task objects to the corresponding execution modules, including a near-memory reordering acceleration module, an in-memory conversion encoding acceleration module, and an entropy encoding dedicated acceleration module, which respectively handle global redundancy mining, local pattern extraction, and entropy encoding operations. Data handover is carried out using a shared buffer and intermediate result storage module.
It enables differentiated deployment and end-to-end pipeline collaboration at different stages, reduces data migration and synchronization overhead between stages, lowers system-level data migration and synchronization costs, and improves compression efficiency.
Smart Images

Figure CN122201457A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer architecture and intelligent storage device technology, and in particular to a heterogeneous in-memory computing acceleration system, method, and storage device. Background Technology
[0002] With the continuous development of high-throughput sequencing technology, gene sequencing data is rapidly increasing in applications such as precision medicine, population genetics, pathogen surveillance, and molecular breeding. The scale of raw read data, compressed archived data, and subsequent analysis results continues to expand. To meet the needs of long-term archiving, standardized exchange, and selective access, standardized compression for gene sequencing data has gradually become a key link in the data processing chain. For unaligned raw read data, standard compression coding typically includes multiple stages such as preprocessing, descriptor generation, transformation coding, and entropy coding, while also needing to balance high compression ratios, random access capabilities, and stream-level interoperability.
[0003] Raw gene sequencing data, exemplified by FASTQ, typically contains multiple fields, including read identifiers, base sequences, and quality scores. These fields differ significantly in statistical characteristics, redundancy patterns, and access methods. Standard compression workflows, to fully extract cross-read redundancy and local patterns within fields, require generating descriptors for different fields and further performing transformation, entropy encoding, and encapsulation processes. Therefore, the entire encoding chain is long, involves a large amount of intermediate data, and involves frequent data transfers between stages.
[0004] In existing technologies, standard compression of gene sequencing data largely relies on general-purpose processor platforms. These solutions typically place preprocessing, reordering, transformation encoding, entropy encoding, and encapsulation within a unified host processing path, with the processor core, cache, and main memory sharing the data movement and computation tasks. Because the standard compression process includes large working set operations for global redundancy mining, fine-grained high-frequency operations for local pattern extraction, and highly sequential operations dependent on context modeling and bitstream organization, the requirements for computation location and storage hierarchy differ significantly at each stage. When executing this process using a single processor architecture, at least the following problems exist: First, sharing the same execution path across different stages leads to a large amount of intermediate data needing to be repeatedly moved between the core, cache, main memory, and peripherals, resulting in long data paths and significant memory access wait times. Second, reordering tasks often involve large working sets and irregular accesses, making it difficult to fully utilize traditional cache levels; while transformation encoding tasks typically involve fine-grained comparisons, position extraction, and high-frequency regular operations, leading to a computation-memory access imbalance when executed using a general-purpose processor. Third, entropy encoding and file encapsulation rely continuously on context state and output stream order. If they share the same host path with other stages, it can easily lead to increased overhead in process control, synchronization, and intermediate buffer management. Fourth, when compression occurs at the data ingestion or archiving stage, existing host-side implementations still require continuous participation from the host CPU, which consumes host computing resources and increases the data transfer burden between the host and storage devices.
[0005] Existing local acceleration approaches typically focus on a single module, making it difficult to form a unified system architecture covering the entire standard compression workflow. Especially in multi-stage standard compression workflows, even if a particular hotspot is accelerated, if task handover, data formatting, synchronization mechanisms, and pipeline organization between stages still rely on frequent host intervention, new bottlenecks will shift to the control and data transfer paths. Therefore, a heterogeneous in-memory computing system for gene sequencing data standard compression workflows is needed. This system should map different types of hotspot stages to different execution locations and organize each stage into an end-to-end collaborative execution path through a unified data / task control mechanism. Furthermore, with the development of intelligent storage devices and computing storage devices, storage devices typically already possess basic resources such as a main controller processor, on-chip SRAM, off-chip DRAM, and flash memory controllers. If a heterogeneous in-memory computing architecture for gene sequencing data standard compression workflows can be deployed within the storage device, compression can be completed on the device side before writing to non-volatile media, thereby reducing host involvement, bus throughput, and saving storage space. Therefore, how to build a scalable, deployable, and standard compression-compatible heterogeneous storage and computing solution at both the system and device levels has become a key technical problem that needs to be solved. Summary of the Invention
[0006] To address the issues of significant differences in load characteristics at different stages in the standard compression workflow for gene sequencing data, the difficulty in uniformly adapting a single processing architecture, the large overhead of data migration and synchronization between stages, and the reliance on continuous host participation for compression at the storage entry point, a heterogeneous in-memory computing acceleration system, method, and storage device are proposed. Under the premise of ensuring the correct functionality and stable compression effect of the standard compression workflow, it enables differentiated deployment at different hotspot stages, unified organization of cross-stage data / tasks, and end-to-end pipeline collaboration.
[0007] This invention provides a heterogeneous in-memory computing acceleration system, comprising:
[0008] The main control processing module is used to receive gene sequencing data to be compressed and generate multiple task objects;
[0009] A unified flow scheduling and control module, connected to the main control processing module, is used to generate a task descriptor for each task object and perform stage-aware mapping according to stage attributes, operator attributes and data attributes, and dispatch different task objects to the corresponding target execution modules.
[0010] The near-memory reordering acceleration module is connected to the unified flow scheduling control module and is configured to receive reordering-related task objects for global redundancy mining and perform reordering operations.
[0011] The in-memory conversion encoding acceleration module is connected to the unified flow scheduling control module to receive conversion encoding related task objects extracted for local patterns, and to perform matching encoding or lookup table encoding according to the task type.
[0012] An entropy coding-specific acceleration module is connected to the unified flow scheduling and control module to receive entropy coding-related task objects and perform entropy coding operations.
[0013] The shared buffer and intermediate result storage module is connected to the unified flow scheduling control module, the in-memory conversion encoding acceleration module, the in-memory conversion encoding acceleration module, and the entropy encoding dedicated acceleration module, respectively, to buffer the output results of each module.
[0014] In one embodiment of the present invention, the main control processing module receives gene sequencing data to be compressed and generates multiple task objects, including:
[0015] The gene sequencing data to be compressed is split into fields, namely, an identifier stream, a base sequence stream, a quality fraction stream, and a related control information stream.
[0016] Based on preset compression rules, the split fields are organized into different granularities such as access units, descriptor blocks, field groups, or task blocks, and multiple task objects are generated according to different granularity types.
[0017] In one embodiment of the present invention, the main control processing module is further configured to perform controllable, organizational, non-hotspot descriptor generation, or exception handling related task objects.
[0018] In one embodiment of the present invention, the unified flow scheduling control module is configured with a unified task descriptor, which includes at least a task number, a process stage number, an operator type, an input address, an input length, an output address, a reserved output length, a target execution domain identifier, a context parameter pointer, a dependency identifier, a priority, a completion status, an exception status, and a rollback flag.
[0019] In one embodiment of the present invention, the task descriptor further includes an access unit number, a descriptor block number, a field type, a data alignment method, a compression strategy identifier, and a verification field.
[0020] In one embodiment of the present invention, the unified flow scheduling control module is further configured with a result descriptor, which includes at least the result address, result length, result type, stage completion identifier, next stage suggested mapping position, valid bits, and error code.
[0021] In one embodiment of the present invention, the unified flow scheduling control module is further configured to perform pipeline scheduling. When the result of the preceding stage of the i-th access unit meets the requirements of the subsequent stage, the next stage task is triggered to form an asynchronous pipeline: while performing a reordering operation on the i-th access unit, a transformation encoding operation is performed on the (i-1)-th access unit, an entropy encoding operation is performed on the (i-2)-th access unit, and an encapsulation and index organization operation is performed on the (i-3)-th access unit.
[0022] Based on the input size, buffer usage, and device load, task scheduling is dynamically selected to be performed at different granularities, such as access unit, descriptor block, field group, or task block.
[0023] In one embodiment of the present invention, the unified flow scheduling control module is further configured to: when it is detected that the input data does not conform to the gene sequencing data format, the device load is too high, the buffer is insufficient, the execution module is abnormal, or the compression result does not meet the preset threshold, execute the abnormal rollback mechanism, trigger the normal write rollback path, control the gene sequencing data to be compressed to be directly written to the non-volatile storage medium without compression, and record the rollback flag in the metadata.
[0024] In one embodiment of the present invention, the reordering operation performed by the near-memory reordering acceleration module includes hash value lookup, bin range positioning, Hamming distance filtering, and candidate result organization.
[0025] In one embodiment of the present invention, when the in-memory conversion encoding acceleration module performs matching encoding, it specifically completes fine-grained pattern comparison, matching length determination, position output, and symbol generation operations; when the in-memory conversion encoding acceleration module performs lookup table encoding, it specifically completes table entry matching, position extraction, symbol compression, and result merging operations.
[0026] In one embodiment of the present invention, the entropy coding operation performed by the dedicated entropy coding acceleration module includes context modeling, binarization, codeword update, and code stream output.
[0027] In one embodiment of the present invention, the main control processing module employs a single-core processor or multiple general-purpose processing cores, a lightweight programmable processing core, or an embedded control core; and / or,
[0028] The unified flow scheduling control module is implemented as an independent control logic, or it is integrated into the main control processing module, shared cache, memory controller, or device controller.
[0029] The near-memory reordering acceleration module is deployed near the main memory, the memory expansion module, the near-memory logic layer, or the DRAM side within the device, in a location close to large-capacity storage resources.
[0030] The in-memory conversion encoding acceleration module adopts a shared cache enhancement structure, an SRAM array computing structure, or other array structures that support in-memory computing, and is deployed in an array structure that supports in-memory computing primitives, such as a shared cache, an on-chip SRAM macrocell, or an accelerated cache.
[0031] The dedicated acceleration module for entropy coding is implemented using an ASIC, a dedicated coprocessor, or a restricted configurable coding unit.
[0032] In one embodiment of the present invention, the shared buffer and intermediate result storage module is divided into an input buffer, a reordering result area, a conversion encoding result area, an entropy encoding result area, and an encapsulation buffer area. Each area is allocated and its status is managed by the unified flow scheduling control module. Data is exchanged between modules using a low-copy or zero-copy method. The address, length, and type information of the output result of the previous stage are directly used as the input fields of the task object of the next stage.
[0033] In one embodiment of the present invention, the shared buffer and intermediate result storage module is implemented using a circular buffer, a paged buffer, or a logical partitioned buffer.
[0034] In one embodiment of the present invention, it further includes: an encapsulation and indexing module, which is connected to the shared buffer and intermediate result storage module and the unified flow scheduling control module; the encapsulation and indexing module is used to organize the entropy encoding results, descriptor metadata, access unit boundaries, descriptor block boundaries and index information stored in the shared buffer and intermediate result storage module to generate a compressed file or compressed data object that conforms to the target standard syntax.
[0035] In one embodiment of the present invention, the system is deployed on a host-side platform or inside a storage device.
[0036] In another aspect, the present invention provides a storage device, comprising at least: a device front-end command parsing module, a heterogeneous in-memory computing acceleration system, and a non-volatile storage medium;
[0037] The device front-end command parsing module is connected to the host side to receive write requests with compression attributes sent by the host side, identify the write requests, and cache the gene sequencing data to be compressed.
[0038] Heterogeneous in-memory computing acceleration system, including:
[0039] The main control processing module is used to receive gene sequencing data to be compressed and generate multiple task objects;
[0040] A unified flow scheduling and control module, connected to the main control processing module, is used to generate a task descriptor for each task object and perform stage-aware mapping according to stage attributes, operator attributes and data attributes, and dispatch different task objects to the corresponding target execution modules.
[0041] The near-memory reordering acceleration module is connected to the unified flow scheduling control module and is configured to receive reordering-related task objects for global redundancy mining and perform reordering operations.
[0042] The in-memory conversion encoding acceleration module is connected to the unified flow scheduling control module to receive conversion encoding related task objects extracted for local patterns, and to perform matching encoding or lookup table encoding according to the task type.
[0043] An entropy coding-specific acceleration module is connected to the unified flow scheduling and control module to receive entropy coding-related task objects and perform entropy coding operations.
[0044] The shared buffer and intermediate result storage module is connected to the unified flow scheduling control module, the in-memory conversion and encoding acceleration module, the in-memory conversion and encoding acceleration module, and the entropy encoding dedicated acceleration module, respectively, to buffer the output results of each module;
[0045] The encapsulation and indexing module is connected to the shared buffer and intermediate result storage module and the unified flow scheduling control module. The encapsulation and indexing module is used to organize the entropy encoding results, descriptor metadata, access unit boundaries, descriptor block boundaries and index information stored in the shared buffer and intermediate result storage module to generate compressed files or compressed data objects that conform to the target standard syntax.
[0046] A non-volatile storage medium is used to store the compressed file or compressed data object that is finally output by the heterogeneous in-memory computing acceleration system.
[0047] In another aspect, the present invention provides a heterogeneous in-memory computing acceleration method, comprising:
[0048] Receive gene sequencing data to be compressed and generate multiple task objects;
[0049] For each task object, a task descriptor is generated, and stage-aware mapping is performed according to stage attributes, operator attributes, and data attributes to dispatch different task objects to the corresponding target execution domain.
[0050] Each target execution domain completes the corresponding stage processing according to the task descriptor, writes the processing results back to the shared buffer storage area, and returns the result descriptor;
[0051] The next stage task is triggered based on the result descriptor and task dependencies to achieve multi-stage asynchronous pipeline;
[0052] The compression results from all stages of processing are encapsulated and indexed to generate output compressed files or compressed data objects.
[0053] In one embodiment of the present invention, dispatching different task objects to corresponding target execution domains includes:
[0054] Select reordering-related task objects for global redundancy mining and dispatch them to the nearest-memory reordering execution domain to perform reordering operations;
[0055] Select the conversion encoding related task objects extracted for local patterns and dispatch them to the in-memory conversion encoding execution domain to perform matching encoding or lookup table encoding according to the task type;
[0056] Select entropy-encoding related task objects and dispatch them to the entropy-encoding execution domain to perform entropy-encoding operations;
[0057] Select controllable, organized, and non-hotspot descriptors to generate or dispatch task objects related to exception handling to the main execution domain.
[0058] As can be seen from the above solutions, the advantages of the present invention are:
[0059] The heterogeneous in-memory computing acceleration system provided by this invention includes a main control processing module for receiving gene sequencing data to be compressed and generating multiple task objects; a unified flow scheduling control module for generating task descriptors for each task object and dispatching different task objects to corresponding target execution modules through stage-aware mapping; a near-memory reordering acceleration module for receiving reordering-related task objects for global redundancy mining and performing reordering operations; an in-memory conversion coding acceleration module for receiving conversion coding-related task objects for local pattern extraction and performing matching coding or lookup table coding according to task type; a dedicated entropy coding acceleration module for performing entropy coding operations; and a shared buffer and intermediate result storage module for buffering the results output by each module. This invention performs heterogeneous hardware matching for the task characteristics of different hotspot stages in the standard compression process of gene sequencing data compression. It adopts a stage-aware architecture mapping method to deploy global redundancy mining tasks, local fine-grained comparison tasks, and entropy coding tasks in different execution locations, avoiding the resource mismatch problem caused by a single architecture simultaneously handling all stages. The unified flow scheduling and control module organizes the originally fragmented multi-stage compression process into a continuous task flow and data flow, which can reduce the overhead of repeatedly returning to the host side for fine-grained control and data replication between stages, and reduce system-level data migration costs and synchronization costs. Attached Figure Description
[0060] Figure 1 This is a general block diagram of a heterogeneous in-memory computing acceleration system provided in an embodiment of the present invention;
[0061] Figure 2 This is the unified data / task control and pipeline scheduling diagram of the present invention;
[0062] Figure 3 This is a stage-operator-execution location mapping diagram for the present invention;
[0063] Figure 4 This is a general block diagram of a storage device provided in an embodiment of the present invention;
[0064] Figure 5 This is a flowchart illustrating a heterogeneous in-memory computing acceleration method provided in an embodiment of the present invention.
[0065] The attached figures are labeled as follows:
[0066] 100: Heterogeneous in-memory computing acceleration system;
[0067] 10: Main control processing module;
[0068] 20: Unified Flow Scheduling and Control Module;
[0069] 30: Nearest Repetition Reordering Acceleration Module;
[0070] 40: In-memory conversion encoding acceleration module;
[0071] 50: Dedicated acceleration module for entropy coding;
[0072] 60: Shared buffer and intermediate result storage module;
[0073] 70: Encapsulation and Indexing Module;
[0074] 80: Storage control and bus interface module;
[0075] 90: External or internal storage resources;
[0076] 200: Device front-end command parsing module;
[0077] 300: Non-volatile storage medium;
[0078] 400: Host. Detailed Implementation
[0079] It should be noted that, in this invention, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus.
[0080] In the absence of further restrictions, an element defined by the phrase "comprising a..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0081] Example 1:
[0082] See Figure 1 This embodiment provides a heterogeneous in-memory computing acceleration system 100 deployed on a host-side platform. The system includes a main control processing module 10, a unified flow scheduling control module 20, a near-memory reordering acceleration module 30, an in-memory conversion encoding acceleration module 40, an entropy encoding dedicated acceleration module 50, a shared buffer and intermediate result storage module 60, an encapsulation and indexing module 70, and a storage control and bus interface module 80.
[0083] The main control processing module 10 is configured to perform input data reception, format recognition, field splitting, preprocessing, non-hotspot descriptor generation, flow control, exception handling, file organization, index organization, and output management.
[0084] In a preferred implementation, the main control processing module 10 may be composed of a single core or multiple general-purpose processing cores, lightweight programmable processing cores or embedded control cores, so as to provide a control basis for heterogeneous in-memory computing execution paths with minimal changes to the traditional processor architecture.
[0085] Specifically, the main control processing module 10 receives the gene sequencing data to be compressed and generates multiple task objects. First, it performs field splitting, dividing the gene sequencing data to be compressed into an identifier stream, a base sequence stream, a quality fraction stream, and a related control information stream. Then, the main control processing module 10 organizes the split fields into access units, descriptor blocks, field groups, or task blocks at different granularities according to preset compression rules. Here, an access unit can be a logical processing unit divided by the number of records, bytes, field type, or application semantics; a descriptor block can be a set of descriptors for a specific field within a corresponding access unit; and a task block can be a stage subtask within an access unit or a combination of several operations. Task objects are then generated according to different granularities such as access units, descriptor blocks, field groups, or task blocks. Furthermore, the gene sequencing data to be compressed comes from external request write data from the host side, preferably raw read data objects, which can be FASTQ files, record streams, or other equivalent input formats.
[0086] refer to Figure 2 , Figure 3 As shown, Figure 2 To unify the control and flow scheduling diagram of the flow scheduling control module 20, Figure 3This is the stage-operator-execution location mapping diagram of the present invention. The unified flow scheduling control module 20 is connected to the main control processing module 10, the near-memory reordering acceleration module 30, the in-memory conversion encoding acceleration module 40, the shared buffer and intermediate result storage module 60, the encapsulation and indexing module 70, and the storage control and bus interface module 80. The unified flow scheduling control module 20 is responsible for establishing continuous data flow, task flow, and control flow between each execution domain. It is configured to generate task descriptors and result descriptors according to the task characteristics of each stage and operator in the standard compression process, maintain the task status table, dependency table, and shared buffer status, and complete stage-aware mapping, task dispatch, dependency wake-up, pipeline scheduling, completion write-back, and abnormal rollback control. Among them, the task status table records the full lifecycle attributes of all generated task descriptors, including task number, compression stage, target execution module, priority, current status (pending dispatch / in execution / completed / abnormally terminated), execution time statistics, and other fields. The dependency table records the prerequisite dependencies between tasks, including fields such as the prerequisite task number, dependency satisfaction condition, and successor task number for each task. For example, the prerequisite dependency of a transformation encoding task within the same access unit is the reordering task of that access unit, and the prerequisite dependency of an entropy encoding task is the transformation encoding task of that access unit. The shared buffer status records the real-time status of each logical partition of the shared buffer and intermediate result storage module, including fields such as the starting address, occupied length, producer module number, consumer module number, validity status (writable / writing in progress / readable / consumed), and data alignment.
[0087] In a preferred implementation, the unified flow scheduling control module 20 can be implemented as an independent control logic or integrated into the main control processing module 10, shared cache, memory controller, or device controller.
[0088] Specifically, the unified flow scheduling control module generates task descriptors for each task object to identify the task. Then, it performs stage-aware mapping according to stage attributes, operator attributes, and data attributes to generate near-memory task queues, in-memory task queues, and entropy coding task queues, which are then mapped to target execution modules such as the near-memory reordering acceleration module 30, the in-memory conversion coding acceleration module 40, and the entropy coding dedicated acceleration module 50. Specifically, if a task is a control task, an organization task, a non-hotspot descriptor generation task, or an exception handling task, it is executed by the main control processing module 10. If a task is a reordering-related task for global redundancy mining, it is executed by the near-memory reordering acceleration module 30. If a task is a conversion coding-related task for local pattern extraction, it is executed by the in-memory conversion coding acceleration module 40. If a task is an entropy coding-related task, it is executed by the entropy coding dedicated acceleration module 50.
[0089] In a preferred implementation, to ensure that the above mapping truly forms an end-to-end system, a unified task descriptor is defined. The task descriptor includes at least the task number, process stage number, operator type, input address, input length, output address, reserved output length, target execution domain identifier, context parameter pointer, dependency identifier, priority, completion status, exception status, and rollback flag. Preferably, it may also include access unit number, descriptor block number, field type, data alignment method, compression strategy identifier, and verification field. Correspondingly, the result descriptor includes at least the result address, result length, result type, stage completion identifier, suggested mapping position for the next stage, valid bits, and error code.
[0090] In a preferred implementation, the unified flow scheduling control module 20 preferably maintains a task state table and a dependency graph. When the result of the preceding stage of a certain access unit or descriptor block meets the requirements of the subsequent stage, the unified flow scheduling control module 20 can trigger the next stage task without waiting for the entire file or the entire dataset to be processed. Thus, the following asynchronous pipeline scheduling can be formed: while performing a reordering task on the i-th access unit, a transformation encoding operation is performed on the (i-1)-th access unit, an entropy encoding operation is performed on the (i-2)-th access unit, and encapsulation and index organization operations are performed on the (i-3)-th access unit, thereby reducing the stage serial waiting time.
[0091] In a preferred implementation, the unified flow scheduling control module is further configured to: when it is detected that the input data does not conform to the gene sequencing data format, the device load is too high, the buffer is insufficient, the execution module is abnormal, or the compression result does not meet the preset threshold, execute the abnormal rollback mechanism, trigger the normal write rollback path, control the gene sequencing data to be compressed to be directly written to the non-volatile storage medium without compression, and record the rollback flag in the metadata.
[0092] In this embodiment, considering the heterogeneous characteristics of different hotspot stages in the standard compression process, a stage-aware architecture mapping approach is adopted. Global redundancy mining tasks, local fine-grained comparison tasks, and entropy coding tasks are deployed at different execution locations, avoiding resource mismatch issues caused by a single architecture handling all stages simultaneously. Simultaneously, a unified flow scheduling and control module organizes the originally fragmented multi-stage compression process into a continuous task and data flow, reducing the overhead of repeatedly returning to the host side for fine-grained control and data replication between stages, thus lowering system-level data migration and synchronization costs.
[0093] The near-memory reordering acceleration module 30 is connected to the unified flow scheduling control module 20 and the shared buffer and intermediate result storage module 60, and is configured to perform reordering-related tasks for global redundancy mining. These tasks are typically large working sets and irregular access tasks. Preferably, the near-memory reordering acceleration module 30 supports operations such as hash value lookup, bin range positioning, Hamming distance filtering, and candidate result organization, and is used to complete sequence reordering, candidate verification, and result output in a location close to large-capacity storage resources.
[0094] In a preferred implementation, the near-memory reordering acceleration module is deployed near the main memory, the memory expansion module, the near-memory logic layer, or the DRAM side within the device, close to high-capacity storage resources.
[0095] In a preferred implementation, after receiving the task descriptor related to the reordering stage, the near-memory reordering acceleration module 30 calls the hash value lookup subpath to determine the candidate range, then calls the bin range positioning and candidate filtering subpath to complete the sequence candidate identification, and outputs the reordering result, candidate association information, or intermediate data required for subsequent descriptor generation. To reduce host-side intervention, the near-memory reordering acceleration module 30 directly writes the result to the reordering result area in the shared buffer and intermediate result storage module 60, and simultaneously sends back the result descriptor, which is then determined by the unified flow scheduling control module 20 to decide whether to immediately trigger the subsequent conversion encoding task.
[0096] The in-memory conversion encoding acceleration module 40 is connected to the unified flow scheduling control module 20 and the shared buffer and intermediate result storage module 60. It is configured to receive conversion encoding-related task objects for local pattern extraction and execute tasks related to conversion encoding, such as fine-grained comparison, position extraction, symbol matching, and lookup table processing. Preferably, the in-memory conversion encoding acceleration module 40 supports both matching encoding and lookup table encoding conversion operations, achieving in-situ or near-in-situ processing of high-frequency fine-grained operations through SRAM, cache arrays, or other storage array structures with computing capabilities.
[0097] In a preferred implementation, the in-memory conversion encoding acceleration module adopts a shared cache enhancement structure, an SRAM array computing structure, or other array structures that support in-memory computing, and is deployed in an array structure that supports in-memory computing primitives, such as a shared cache, an on-chip SRAM macrocell, or an accelerated cache.
[0098] In a preferred implementation, after receiving the conversion encoding task descriptor, the in-memory conversion encoding acceleration module 40 performs either match encoding or lookup table encoding based on the task type. For match encoding tasks, the in-memory conversion encoding acceleration module 40 performs fine-grained pattern comparison, match length determination, position output, and symbol generation; for lookup table encoding tasks, the in-memory conversion encoding acceleration module 40 performs table entry matching, position extraction, symbol compression, and result merging. During execution, the in-memory conversion encoding acceleration module 40 preferably completes high-frequency fine-grained operations through array comparison, position extraction, and local result collection, and writes the results to the conversion encoding result area in the shared buffer and intermediate result storage module 60.
[0099] The dedicated entropy coding acceleration module 50 is connected to the shared buffer and intermediate result storage module 60 and configured to perform entropy coding tasks. Preferably, the dedicated entropy coding acceleration module 50 supports context modeling, binarization, codeword generation, and bitstream output, and can be implemented using fixed functional units, dedicated coprocessors, or restricted configurable coding units. The dedicated entropy coding acceleration module 50 receives descriptor data from the main control processing module 10 or the in-memory conversion coding acceleration module 40, and outputs compressed bitstream segments and related status information.
[0100] In a preferred implementation, the entropy coding dedicated acceleration module is implemented using an ASIC, a dedicated coprocessor, or a restricted configurable coding unit.
[0101] In a preferred implementation, the entropy coding dedicated acceleration module 50 receives the descriptor data stream from the main control processing module 10 or the in-memory conversion coding acceleration module 40, and performs binarization, context modeling, codeword updating, and code stream output according to context parameters. The code stream segments and related state information generated by the entropy coding dedicated acceleration module 50 are written to the entropy coding result area in the shared buffer and intermediate result storage module 60.
[0102] The shared buffer and intermediate result storage module 60, along with the unified flow scheduling control module 20, the near-memory reordering acceleration module 30, the in-memory conversion encoding acceleration module 40, the entropy encoding dedicated acceleration module 50, the encapsulation and indexing module 70, and the storage control and bus interface module 80, are used to cache input data objects, intermediate descriptor results, stage output results, code stream fragments, and encapsulation metadata. The shared buffer and intermediate result storage module 60 is preferably divided into an input buffer, a reordering result area, a conversion encoding result area, an entropy encoding result area, and an encapsulation buffer area. Each area is allocated and its status managed by the unified flow scheduling control module 20. Preferably, data is handed over between stages using a minimal or zero-copy method, meaning the address, length, and type information of the previous stage's output result are directly used as the input fields of the task descriptor for the next stage.
[0103] To ensure efficient data transfer, the shared buffer and intermediate result storage module 60 preferably employs a circular buffer, a paged buffer, or a logically partitioned buffer. The unified flow scheduling control module 20 records the producer module identifier, consumer module identifier, occupancy status, data type, length, alignment, and valid bit information for each buffer block. After the previous stage ends, the result address and length required for the next stage task are simply passed to the unified flow scheduling control module 20 via the result descriptor, allowing the subsequent stage to directly read the corresponding data, reducing unnecessary intermediate copying.
[0104] In this embodiment, a shared buffer and intermediate result storage module 60, as well as a task descriptor and result descriptor mechanism, are adopted to enable different acceleration units to collaborate around the same set of stage semantics and task boundaries. This is beneficial for forming asynchronous pipelines across access units and descriptor blocks, thereby improving the throughput of the entire compression process.
[0105] The encapsulation and indexing module 70 is connected to the unified flow scheduling control module 20 and the shared buffer and intermediate result storage module 60. It is used to organize the entropy encoding results, related descriptor metadata, access unit boundaries, descriptor block boundaries and index information to generate compressed files, compressed objects or compressed data pages within the device that conform to the target standard syntax.
[0106] The storage control and bus interface module 80 is connected to the unified flow scheduling control module 20 and the external or internal storage resources 90. It is used to realize data exchange between the system and the external or internal storage resources 90 or the external bus, and is responsible for DMA transfer, command issuance, status readback and address mapping.
[0107] External or internal storage resources 90 are connected to storage control and bus interface module 80, shared buffer and intermediate result storage module 60, and encapsulation and indexing module 70 to store the full dataset of raw gene sequencing data without preprocessing, as well as the finally generated gene sequencing compressed file that conforms to the standard compression format, supporting long-term data retention and cross-device access.
[0108] In a preferred implementation, multiple task granularities can be selected without altering the standard compression semantics. When the data volume is large, scheduling can be performed at the access unit granularity; when hot tasks are concentrated in certain fields, scheduling can be performed at the field group granularity; when a hot link has stronger locality, scheduling can also be performed at a finer descriptor block granularity. The unified flow scheduling control module 20 can dynamically adjust the task granularity and priority based on the input size, buffer occupancy, and device load to improve overall throughput.
[0109] The storage control and bus interface module 80 is used to realize data exchange between the system and main memory, internal storage resources of the device or external bus, and is responsible for DMA transfer, command issuance, status readback and address mapping.
[0110] In a preferred implementation, to support low-modification integration on traditional processor architectures, the near-memory reordering acceleration module 30, the in-memory conversion encoding acceleration module 40, and the entropy encoding dedicated acceleration module 50 are preferably connected to the platform where the main control processing module 10 resides, using a coprocessor interface, shared buffer interface, shared cache interface, or memory controller extension interface. In this way, the system retains the stage semantics of the standard compression process and the main control execution path at the software level, while delegating hot-button processes to heterogeneous execution modules in a decentralized manner, which helps reduce modification costs and improve system compatibility.
[0111] Example 2:
[0112] In this embodiment, the aforementioned heterogeneous in-memory computing acceleration system is deployed inside a storage device, referring to... Figure 4 As shown in the figure, this embodiment provides a structural diagram of a heterogeneous in-memory computing acceleration system deployed inside a storage device.
[0113] The storage device includes at least: a device front-end command parsing module 200, a heterogeneous in-memory computing acceleration system 100, and a non-volatile storage medium 300.
[0114] The device front-end command parsing module 200 is connected to the host 400 and is used to receive write requests with compression attributes sent by the host, identify the write requests, and cache the gene sequencing data to be compressed.
[0115] A heterogeneous in-memory computing acceleration system 100 includes: a main control processing module 10, used to receive gene sequencing data to be compressed and generate multiple task objects; a unified flow scheduling control module 20, connected to the main control processing module 10, used to generate a task descriptor for each task object and perform stage-aware mapping according to stage attributes, operator attributes, and data attributes, dispatching different task objects to the corresponding target execution modules; a near-memory reordering acceleration module 30, connected to the unified flow scheduling control module, configured to receive reordering-related task objects for global redundancy mining and perform reordering operations; an in-memory conversion coding acceleration module 40, connected to the unified flow scheduling control module, used to receive conversion coding-related task objects for local pattern extraction and perform matching coding or lookup table coding according to the task type; and an entropy coding dedicated acceleration module 50, connected to the unified flow scheduling control module, used to receive entropy coding-related task objects and perform entropy coding operations. A shared buffer and intermediate result storage module 60 is connected to the unified flow scheduling control module, the in-memory conversion encoding acceleration module, and the entropy encoding dedicated acceleration module, respectively, to buffer the output results of each module. An encapsulation and indexing module 70 is connected to both the shared buffer and intermediate result storage module and the unified flow scheduling control module; this module organizes the entropy encoding results, descriptor metadata, access unit boundaries, descriptor block boundaries, and index information stored in the shared buffer and intermediate result storage module to generate compressed files or compressed data objects conforming to the target standard syntax.
[0116] The non-volatile storage medium 300 is connected to the storage control and bus interface module 80 of the heterogeneous in-memory computing acceleration system 100, and is used to store the compressed file or compressed data object finally output by the heterogeneous in-memory computing acceleration system. The non-volatile storage medium can be NAND flash memory, 3D NAND flash memory, or other non-volatile media that can be used for large-capacity storage.
[0117] The host initiates a compression write request to the storage device. The compression write request preferably includes a logical write address, the length of the data to be written, a data type identifier, a compression strategy identifier, and integrity options. Upon receiving the request, the device front-end command parsing module temporarily stores the gene sequencing data to be written in the device's internal cache and notifies the main control processing module 10 to start compression preprocessing. The main control processing module 10 then performs field splitting, access unit partitioning, and task construction. The unified flow scheduling control module 20 calls the near-memory reordering acceleration module 30, the in-memory conversion encoding acceleration module 40, and the entropy encoding dedicated acceleration module 50 to handle the corresponding hotspot stages according to the task descriptor. After reordering, conversion encoding, and entropy encoding are completed within the device, the encapsulation and indexing module 70 combines the compressed bitstream, access unit boundaries, descriptor block boundaries, metadata index, and verification information into a compressed write object within the device, which is then written to the non-volatile storage medium 100 by the storage control and bus interface module 80. During this process, the host does not need to be aware of the fine-grained execution details of each stage within the device; it only needs to wait for the device to return a completion status. By using the above method, the main execution of the standard compression process can be moved forward to the storage device, reducing the burden on the host side.
[0118] It should be noted that the relevant technical details of the heterogeneous in-memory computing acceleration system 100 mentioned in the above embodiments are still valid in this embodiment, and will not be repeated here in order to reduce repetition.
[0119] In this embodiment, the metadata stored internally by the storage device preferably includes: original length, compressed length, access unit location table, descriptor block location table, compression parameters, checksum, write timestamp, format version number, and rollback flag. Preferably, the metadata can be stored separately in a metadata page, or it can be stored together with the compressed data object in the same logical mapping unit. When a subsequent read request occurs, the device can return the compressed data object according to the configuration, or return the original data after decompression within the storage device.
[0120] This embodiment also includes an exception fallback path. When the device front-end command parsing module 110 detects that the input data is not in the target gene sequencing data format, or when the unified flow scheduling control module 20 determines that the current device load is too high, the buffer is insufficient, a certain module is abnormal, or the compression result does not meet the preset threshold, a normal write path can be triggered. In this case, the data is written directly to the non-volatile storage medium 100 without compression, and a fallback flag is recorded in the metadata to ensure system robustness and write request success rate.
[0121] In the form of a storage device, the compression of gene sequencing data can be moved to the data storage stage, and the compression can be completed inside the device before being written to non-volatile media. This reduces the processing burden on the host side, reduces the amount of bus transmission between the host and the storage device, and improves the utilization of storage space.
[0122] Example 3:
[0123] refer to Figure 5 As shown, Figure 5 A flowchart illustrating a heterogeneous in-memory computing acceleration method is shown.
[0124] A heterogeneous in-memory computing acceleration method specifically includes the following steps:
[0125] Step S1: Receive the gene sequencing data to be compressed and generate multiple task objects.
[0126] Step S2: Generate a task descriptor for each task object, and perform stage-aware mapping according to stage attributes, operator attributes and data attributes, and dispatch different task objects to the corresponding target execution domain.
[0127] Step S3: Each target execution domain completes the corresponding stage processing according to the task descriptor, writes the processing results back to the shared buffer storage area, and returns the result descriptor.
[0128] Step S4: Trigger the next stage task based on the result descriptor and task dependency relationship to realize multi-stage asynchronous pipeline.
[0129] Step S5: Encapsulate and index the compression results after all stages of processing to generate an output compressed file or compressed data object.
[0130] In one embodiment of the present invention, dispatching different task objects to corresponding target execution domains includes:
[0131] Select reordering-related task objects for global redundancy mining and dispatch them to the nearest-memory reordering execution domain to perform reordering operations;
[0132] Select the conversion encoding related task objects extracted for local patterns and dispatch them to the in-memory conversion encoding execution domain to perform matching encoding or lookup table encoding according to the task type;
[0133] Select entropy-encoding related task objects and dispatch them to the entropy-encoding execution domain to perform entropy-encoding operations;
[0134] Select controllable, organized, and non-hotspot descriptors to generate or dispatch task objects related to exception handling to the main execution domain.
[0135] It should be noted that this method can be implemented in conjunction with the above-described system implementation. The near-memory reordering execution domain, the in-memory conversion encoding execution domain, the entropy encoding execution domain, and the master control execution domain are... Figure 1The near-memory reordering acceleration module 30, the in-memory conversion encoding acceleration module 40, the entropy encoding dedicated acceleration module 50, and the main control processing module 10 of the heterogeneous in-memory computing integrated acceleration system are all included. The relevant technical details mentioned in the above system implementation are still valid in this method implementation. To reduce repetition, they will not be repeated here.
[0136] The embodiments of the present invention have been described above with reference to the accompanying drawings. However, the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of the present invention without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of the present invention.
Claims
1. A heterogeneous in-memory computing acceleration system, characterized in that, include: The main control processing module is used to receive gene sequencing data to be compressed and generate multiple task objects; A unified flow scheduling and control module, connected to the main control processing module, is used to generate a task descriptor for each task object and perform stage-aware mapping according to stage attributes, operator attributes and data attributes, and dispatch different task objects to the corresponding target execution modules. The near-memory reordering acceleration module is connected to the unified flow scheduling control module to receive reordering-related task objects for global redundancy mining and perform reordering operations. The in-memory conversion encoding acceleration module is connected to the unified flow scheduling control module to receive conversion encoding related task objects extracted for local patterns, and to perform matching encoding or lookup table encoding according to the task type. A dedicated acceleration module for entropy coding is used to receive entropy coding-related task objects and perform entropy coding operations. The shared buffer and intermediate result storage module is connected to the unified flow scheduling control module, the in-memory conversion encoding acceleration module, the in-memory conversion encoding acceleration module, and the entropy encoding dedicated acceleration module, respectively, to buffer the output results of each module.
2. The heterogeneous in-memory computing acceleration system according to claim 1, characterized in that, The main control processing module receives the gene sequencing data to be compressed and generates multiple task objects, including: The gene sequencing data to be compressed is split into fields, namely, an identifier stream, a base sequence stream, a quality fraction stream, and a related control information stream. Based on preset compression rules, the split fields are organized into different granularities such as access units, descriptor blocks, field groups, or task blocks, and multiple task objects are generated according to different granularity types.
3. The heterogeneous in-memory computing acceleration system according to claim 1, characterized in that, The main control processing module is also configured to perform controllable, organizational, non-hotspot descriptor generation, or exception handling related tasks.
4. The heterogeneous in-memory computing acceleration system according to claim 1, characterized in that, The unified flow scheduling control module is configured with a unified task descriptor, which includes at least the task number, process stage number, operator type, input address, input length, output address, reserved output length, target execution domain identifier, context parameter pointer, dependency identifier, priority, completion status, exception status, and rollback flag.
5. The heterogeneous in-memory computing acceleration system according to claim 4, characterized in that, The task descriptor also includes an access unit number, a descriptor block number, a field type, a data alignment method, a compression strategy identifier, and a verification field.
6. The heterogeneous in-memory computing acceleration system according to claim 4, characterized in that, The unified flow scheduling control module is also configured with a result descriptor, which includes at least the result address, result length, result type, stage completion identifier, next stage suggested mapping position, valid bits, and error code.
7. The heterogeneous in-memory computing acceleration system according to claim 4, characterized in that, The unified flow scheduling control module is also configured to perform pipeline scheduling. When the result of the preceding stage of the i-th access unit meets the requirements of the subsequent stage, the next stage task is triggered to form an asynchronous pipeline: while performing a reordering operation on the i-th access unit, a transformation encoding operation is performed on the (i-1)-th access unit, an entropy encoding operation is performed on the (i-2)-th access unit, and an encapsulation and index organization operation is performed on the (i-3)-th access unit. Based on the input size, buffer usage, and device load, task scheduling is dynamically selected to be performed at different granularities, such as access unit, descriptor block, field group, or task block.
8. The heterogeneous in-memory computing acceleration system according to claim 4, characterized in that, The unified flow scheduling control module is also used to: when it is detected that the input data does not conform to the data format, the device load is too high, the buffer is insufficient, the execution module is abnormal, or the compression result does not meet the preset threshold, execute the abnormal rollback mechanism, trigger the normal write rollback path, control the gene sequencing data to be compressed to be directly written to the non-volatile storage medium without compression, and record the rollback flag in the metadata.
9. The heterogeneous in-memory computing acceleration system according to claim 1, characterized in that, The reordering operations performed by the near-memory reordering acceleration module include hash value lookup, bin range positioning, Hamming distance filtering, and candidate result organization.
10. The heterogeneous in-memory computing acceleration system according to claim 1, characterized in that, When the in-memory conversion encoding acceleration module performs matching encoding, it specifically completes fine-grained pattern comparison, matching length determination, position output, and symbol generation operations; when the in-memory conversion encoding acceleration module performs lookup table encoding, it specifically completes table entry matching, position extraction, symbol compression, and result merging operations.
11. The heterogeneous in-memory computing acceleration system according to claim 1, characterized in that, The entropy coding acceleration module performs entropy coding operations including context modeling, binarization, codeword update, and code stream output.
12. The heterogeneous in-memory computing acceleration system according to claim 1, characterized in that, The main control processing module employs a single-core processor or multiple general-purpose processing cores, a lightweight programmable processing core, or an embedded control core; and / or, The unified flow scheduling control module is implemented as an independent control logic, or it is integrated into the main control processing module, shared cache, memory controller, or device controller. The near-memory reordering acceleration module is deployed near the main memory, the memory expansion module, the near-memory logic layer, or the DRAM side within the device, in a location close to large-capacity storage resources. The in-memory conversion encoding acceleration module adopts a shared cache enhancement structure, an SRAM array computing structure, or other array structures that support in-memory computing, and is deployed in an array structure that supports in-memory computing primitives, such as a shared cache, an on-chip SRAM macrocell, or an accelerated cache. The dedicated acceleration module for entropy coding is implemented using an ASIC, a dedicated coprocessor, or a restricted configurable coding unit.
13. The heterogeneous in-memory computing acceleration system according to claim 1, characterized in that, The shared buffer and intermediate result storage module is divided into an input buffer, a reordering result area, a conversion encoding result area, an entropy encoding result area, and a packaging buffer area. Each area is allocated and its status is managed by the unified flow scheduling control module. Data is exchanged between modules using a low-copy or zero-copy method. The address, length, and type information of the output result of the previous stage are directly used as the input fields of the task object of the next stage.
14. The heterogeneous in-memory computing acceleration system according to claim 1, characterized in that, The shared buffer and intermediate result storage module is implemented using a circular buffer, a paged buffer, or a logical partitioned buffer.
15. The heterogeneous in-memory computing acceleration system according to claim 1, characterized in that, Also includes: The encapsulation and indexing module is connected to the shared buffer and intermediate result storage module and the unified stream scheduling control module. The encapsulation and indexing module is used to organize the entropy encoding results, descriptor metadata, access unit boundaries, descriptor block boundaries, and index information stored in the shared buffer and intermediate result storage module to generate compressed files or compressed data objects that conform to the target standard syntax.
16. The heterogeneous in-memory computing acceleration system according to claim 1, characterized in that, The system is deployed on a host-side platform or inside a storage device.
17. A storage device, characterized in that, It includes at least: a device front-end command parsing module, a heterogeneous in-memory computing acceleration system using any one of claims 1-16, and a non-volatile storage medium; The device front-end command parsing module is connected to the host side to receive write requests with compression attributes sent by the host side, identify the write requests, and cache the gene sequencing data to be compressed. A non-volatile storage medium is used to store the compressed file or compressed data object that is finally output by the heterogeneous in-memory computing acceleration system.
18. A heterogeneous in-memory computing acceleration method, characterized in that, include: Receive gene sequencing data to be compressed and generate multiple task objects; For each task object, a task descriptor is generated, and stage-aware mapping is performed according to stage attributes, operator attributes, and data attributes to dispatch different task objects to the corresponding target execution domain. Each target execution domain completes the corresponding stage processing according to the task descriptor, writes the processing results back to the shared buffer storage area, and returns the result descriptor; The next stage task is triggered based on the result descriptor and task dependencies to achieve multi-stage asynchronous pipeline; The compression results from all stages of processing are encapsulated and indexed to generate output compressed files or compressed data objects.
19. The method according to claim 18, characterized in that, Dispatch different task objects to the corresponding target execution domains, including: Select reordering-related task objects for global redundancy mining and dispatch them to the nearest-memory reordering execution domain to perform reordering operations; Select the conversion encoding related task objects extracted for local patterns and dispatch them to the in-memory conversion encoding execution domain to perform matching encoding or lookup table encoding according to the task type; Select entropy-encoding related task objects and dispatch them to the entropy-encoding execution domain to perform entropy-encoding operations; Select controllable, organized, and non-hotspot descriptors to generate or dispatch task objects related to exception handling to the main execution domain.