A block-level data layout optimization and compression method based on API pattern normalization
By normalizing the API pattern and reordering within blocks in HTTP/API access logs in a log storage system, the problem of low compression efficiency in existing technologies is solved, achieving higher compression ratios and storage efficiency while maintaining the append order of data files. This method is suitable for the optimization and compression of log storage systems.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN SHIXI TECH CO LTD
- Filing Date
- 2026-05-18
- Publication Date
- 2026-06-19
AI Technical Summary
In scenarios involving the persistence of massive HTTP/API access logs and full messages, existing technologies are insufficient in terms of compression efficiency and storage space utilization. In particular, the compression efficiency is low due to the increased LZ77 matching distance and uneven distribution of entropy coding symbols. Furthermore, it is difficult to optimize the data file while maintaining the order of appending and the continuity of blocks at the macro level.
By performing API pattern normalization on the received log records in the log storage system, generating pattern identifiers, and rearranging them in a single buffer, log records with the same API pattern are arranged consecutively. Then, the Zstandard compressor is used for compression, and a mapping relationship between virtual offsets and physical offsets is established to maintain the overall append order of the data file.
It significantly improves the compression ratio, maintains the macroscopic physical order of data files, and enhances compression efficiency without disrupting the overall storage structure, thereby reducing storage costs. It also features low engineering intrusion and robustness, adapting to the operational needs of different environments.
Smart Images

Figure CN122240034A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data storage and compression technology, specifically a block-level data layout optimization and compression method based on API pattern normalization. Background Technology
[0002] In scenarios involving the persistence of massive HTTP / API access logs and full messages, business applications typically need to prioritize high-throughput, low-overhead continuous reading and restoration of complete logs based on time or write order. Therefore, the industry generally adopts a row-based sequential write architecture: logs are aggregated into a memory buffer based on arrival time, then sequentially written to data files using append-only methods, with the index layer maintaining the mapping from logical keys to physical offsets.
[0003] To save storage space with controllable write latency and CPU usage, engineering practices often increase the size of single compressed blocks (e.g., from several MB to 16 MB) and use general compression algorithms such as Zstandard (ZSTD) for backend compression. The macroscopic pipeline of Zstandard can be understood as follows: first, searching for repeating substrings within a sliding window using a mechanism similar to LZ77, representing the repeating structure as sequence information such as offset, matching length, and literal; then, performing entropy encoding on the sequence and literal (often combined with Huffman coding and FSE based on the ANS approach), achieving a balance between compression ratio and decompression speed.
[0004] However, the aforementioned "direct compression in arrival order" approach conflicts with the principles of compression. Individual log entries often differ only slightly in fields such as timestamp, session ID, and path parameters, but logically belong to the same API pattern. These records may be separated by numerous records from other interfaces in the row-based buffer. This lack of physical adjacency leads to: increased LZ77 matching distance and a higher probability of broken matching chains; the symbol distribution of entropy encoding tends to be uniformly mixed, making it difficult for the probability model to form a sharp distribution, thus limiting compression efficiency. Some systems attempt to alleviate this problem with stronger compression levels or external pre-trained dictionaries, but increasing the compression level usually incurs CPU / latency costs; and insufficient statistical samples in small-block storage may trigger insufficient entropy encoding efficiency.
[0005] Meanwhile, business requirements often demand that data files maintain append order and block continuity at a macro level (for easy archiving, copying, and hierarchical storage). Simply relying on "global reordering of all historical logs" often conflicts with online write confirmation and offset return timing. Therefore, there is a need for a method that can aggregate and cluster data within blocks before compression, without disrupting the overall ordered organization of logs in physical storage (typically block-based appending with stable inter-block order), thereby amplifying the matchability and predictability of Zstandard from the data layout side and overcoming the shortcomings in current practical applications. Summary of the Invention
[0006] The purpose of this invention is to provide a block-level data layout optimization and compression method based on API pattern normalization, so as to solve the problems mentioned in the background art.
[0007] To achieve the above objectives, the present invention provides the following technical solution: This invention provides a block-level data layout optimization and compression method based on API pattern normalization, applied to log storage systems. The method performs the following steps within a single buffer while maintaining the overall append-order order of the log file: Receive multiple log records; For each log record, API pattern specification is performed to generate a corresponding pattern identifier; Based on the pattern identifier, multiple log records in the block buffer are rearranged within the block so that log records belonging to the same API pattern are continuous in the byte stream; The rearranged log records are serialized, then sent to the Zstandard compressor for compression and disk storage.
[0008] As a further aspect of the present invention: the API pattern specification for each log record includes: extracting the request path from the log record; mapping the request path containing dynamic parameters to the corresponding aggregation template according to the preset aggregation rules, and using the aggregation template as the pattern identifier.
[0009] As a further aspect of the present invention: the sorting key for the intra-block rearrangement is a composite key, and the composite key includes at least the pattern identifier.
[0010] As a further aspect of the present invention, the method further includes: after compression and disk writing, establishing and storing the mapping relationship between the virtual offset (Virtual_Offset) of the log records in the block buffer before reordering and the physical offset (Physical_Offset) after reordering.
[0011] As a further aspect of the present invention, the method also includes a fallback mechanism: when the confidence level of the API pattern specification is lower than a preset threshold or the processing times out, the specification step is skipped, and the original path recorded in the log is used as the sorting key to perform intra-block rearrangement.
[0012] As a further aspect of the present invention: the triggering condition for the single buffer is that the amount of data in the buffer reaches a preset threshold or the timer expires.
[0013] This invention also provides a block-level data layout optimization and compression device based on API pattern normalization, deployed in a log storage system, for performing operations within a single block buffer while maintaining the overall append order of log files. The device includes: The log receiving module is used to receive multiple log records; The pattern specification module is used to specify the API pattern for each log record and generate a corresponding pattern identifier. The block reordering module is used to reorder multiple log records in the block buffer according to the mode identifier, so that log records belonging to the same API mode are continuous in the byte stream. The compression and disk write module is used to serialize the rearranged log records and send them to the Zstandard compressor for compression and disk write.
[0014] The present invention also provides a log storage system, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to cause the log storage system to perform the aforementioned block-level data layout optimization and compression method based on API pattern normalization.
[0015] Compared with the prior art, the beneficial effects of the present invention are: 1. Significantly improves compression ratio: By rearranging within blocks, homogeneous messages are made continuous or nearly continuous in the byte stream, providing smaller offsets and richer matching chains for Zstandard's LZ77 mechanism. At the same time, the symbol distribution of entropy coding is sharper, thus achieving better compression effect at the same compression level and effectively reducing storage costs.
[0016] 2. Maintain macroscopic physical order: All optimization operations are strictly limited to a single buffer, without changing the macroscopic form of data files being append-only, thus ensuring full compatibility with existing archiving, copying, tiered storage, and incremental backup processes.
[0017] 3. Low engineering invasiveness and robustness: This invention can be inserted as an intermediate layer in the compression preprocessing link and provides a fallback scheme that can be degraded. When the pattern specification module is unavailable or the samples are insufficient, the system can still operate normally and obtain some benefits. Attached Figure Description
[0018] Figure 1 This is a schematic diagram of a block-level data layout optimization and compression method based on API pattern normalization. Detailed Implementation
[0019] The technical solution of this application will be further described in detail below with reference to specific embodiments.
[0020] The embodiments of this application are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain this application, and should not be construed as limiting this application. Example
[0021] This embodiment describes a block-level data layout optimization and compression method based on API pattern normalization, which is applied to the write path of a log storage system.
[0022] I. System Module Overview (see...) Figure 1 ) like Figure 1 As shown, the present invention logically mainly includes the following functional modules, which work together to achieve "intra-block compression ratio improvement under overall ordered constraints": Log input module: Receives HTTP / API log stream entry.
[0023] Block buffer: Caches log records with a preset capacity (e.g., several MB to 16 MB, depending on hardware and SLA configuration), strictly adhering to append-only semantics and maintaining the original time order.
[0024] API Pattern Specification Module: Performs pattern normalization on URLs in logs (e.g., normalizes / user / 123, / user / 456, etc. to / user / id).
[0025] Intra-block rearrangement / clustering module: Clusters and rearranges records within a block according to the pattern identifier to improve data locality.
[0026] Serialization and header injection module: Adds independent record header information (which may include length, CRC, mode version, dictionary ID, etc.) to each log record, and adds block header information (which may include dictionary version ID, mapping table offset, length before and after compression, check field, etc.) to the entire block.
[0027] ZSTD compression module: It uses LZ77 matching algorithm and entropy coding for compression. Due to better clustering, the data locality effect is improved, and the compression efficiency is improved.
[0028] Index mapping layer: Maintains the mapping relationship between virtual offset (Virtual_Offset) and physical offset (Physical_Offset) to ensure query consistency even if physical blocks are rearranged.
[0029] Read-only storage: Stores the compressed final data blocks.
[0030] II. Method and Flow (Comparison) Figure 1 ) Step 1: Log Ingestion and Block Buffering like Figure 1 As shown, HTTP / API logs enter the system via the log input module. These log records are first written to a block buffer (memory buffer). The buffer strictly adheres to append-only semantics, maintaining the original chronological order. The buffer can be triggered by the number of bytes in the buffer reaching a preset threshold (typically ranging from several MB to 16 MB) or by a timed timeout (e.g., every few seconds).
[0031] During the writing process, each log record is assigned a virtual offset that increments in the order of being written into the block. This virtual offset corresponds to the logical write order perceived by the business side and is used to support the "write and return" query contract.
[0032] In abnormal scenarios (such as buffer backpressure or process crashes), this module can work in conjunction with Write-Ahead Log (WAL) or replica replication protocols to ensure that data loss does not exceed the agreed semantic boundaries. This module provides stable batch processing for subsequent API processing and serves as the timing boundary for in-block optimization.
[0033] Step 2: API Pattern Specification Before the compressor starts, the API pattern specification module normalizes the request path of each record within the block. For example... Figure 1 The example shown below the "API Pattern Specification" module is as follows: Input: URLs containing specific parameter values, such as / user / 123 or / user / 456.
[0034] Output: Normalized pattern template, such as / user / id.
[0035] The preset aggregation rules include one or more of the following heuristic rules: Quantitative indicator rule: Count the number of different literals in a path segment under the same path template. If the number exceeds a preset threshold (in engineering, this threshold can be set around 16 as the starting point, and can be slightly relaxed for larger data scales), then the path segment is identified as a dynamic parameter and normalized to the corresponding parameter placeholder. For example, / user / 123 and / user / 456 are uniformly normalized to / user / {number}.
[0036] Length / prefix / suffix type index rules: By identifying continuous number segments, UUID format, or path segments that conform to specific regular expressions (such as pure numbers, fixed-length hexadecimal strings, etc.), they are determined as dynamic parameters and normalized.
[0037] Semantic auxiliary rules (optional): Used to handle boundary cases, such as identifying path segments that "appear to be parameters but have specific business semantics" (e.g., certain numeric strings represent fixed business types rather than user IDs). In this case, whitelists or custom rules can be combined to preserve their original meaning and avoid erroneous merging that could cause business ambiguity.
[0038] The output is a stable pattern identifier (Pattern_ID). The pattern identifier can take any of the following forms: (1) a normalized template string, such as / user / {number}; (2) a numeric or enumerated identifier, assigning a unique ID to each unique normalized template. Regardless of the specific form, the core function of the pattern identifier is to logically merge log records that were originally considered "different" due to different dynamic parameter values into the same category, thereby providing a stable sorting key for subsequent intra-block rearrangement.
[0039] For APIs with extremely small sample sizes and insufficient statistics, the module can mark them as low confidence and pass through the original path to avoid erroneous merging that could cause business ambiguity.
[0040] Step 3: Intra-block rearrangement This step is the core of the invention. For example... Figure 1 The "Before" and "After" comparison shown in the "Intra-Block Rearrangement / Clustering Function" module: Before rearrangement: Records belonging to the same pattern are scattered within the block, interspersed with records of other patterns, resulting in spatial discontinuity of homogeneous content.
[0041] After rearrangement: all records belonging to the same pattern are grouped together and arranged consecutively; records belonging to other patterns are also grouped together accordingly.
[0042] Specifically, the intra-block rearrangement module sorts all records within the block. The sort key is a composite key, which may include, but is not limited to, pattern identifiers, HTTP methods, status codes, content types, and key header signatures. After sorting, all records belonging to the same API pattern are logically grouped together to form continuous or near-continuous byte stream segments.
[0043] This sorting operation is a pure memory operation, and its overhead is acceptable relative to the subsequent Zstandard compression and I / O disk write, especially when the number of records per block is within an acceptable range and cache-friendly comparison is used. The compression time is often acceptable compared to Zstandard.
[0044] Step 4: Serialization and Header Injection The clustered and rearranged log records enter the "Serialization and Header Information" module. This module adds an independent log record header (LogRecordHeader) to each individual log record, which may contain metadata such as length, verification information, schema version, and dictionary ID, as well as business fields such as HTTP method and status code. At the same time, a block header (BlockHeader) is added to the entire block, which may contain dictionary version ID, mapping table offset, length before and after compression, and verification fields.
[0045] Step 5: Zstandard compression The serialized byte stream with header information added is then fed into the Zstandard compressor. For example... Figure 1 As shown, the compressor internally employs the LZ77 matching algorithm and entropy coding (often combined with Huffman coding and FSE based on the ANS approach). Because the pre-processed "intra-block rearrangement" provides better data locality, the LZ77 matching algorithm can find repeating byte sequences within a smaller offset range, and entropy coding can achieve a sharper symbol distribution, thus resulting in an improved compression ratio.
[0046] Step 6: Disk Writing and Index Mapping The compressed data blocks are ultimately written to read-only storage (such as disk files). To address the conflict between "physical block rearrangement" and "business-expected virtual order," such as... Figure 1 As shown, an index mapping layer is introduced.
[0047] Specifically, this embodiment establishes and stores a mapping relationship between the virtual offset (Virtual_Offset) of log records within the block buffer before reordering and the physical offset (Physical_Offset) after reordering and compression. This mapping relationship can be stored in an extended field of the block header or in a separate sidecar index file. During queries, if results need to be returned in the original write order, a lightweight offset transformation can be performed through this mapping table, thereby shielding the details of intra-block reordering from external query logic.
[0048] Through this mapping, the system still appears to the outside world as globally ordered append-write semantics, while internally it enjoys the compression benefits brought by intra-block rearrangement.
[0049] III. Optional Implementation Methods and Alternative Solutions Optional Implementation A (Main Implementation): Full API schema specification + intra-block rearrangement (composite key sorting) + Zstandard compression. Suitable for scenarios with diverse log patterns, significant path parameterization, and the ability to withstand CPU usage during block tail sorting; compression benefits are typically the highest.
[0050] Option B (Simplified Alternative): Instead of performing deep specification on the entire URL, only metadata such as Path + Method + Status + Content-Type is sorted lexicographically for stability. This approach has low implementation costs and eliminates the need to maintain complex parameter identification states. It can still significantly improve locality when the number of interfaces is moderate and the path hierarchy is clear, but its benefits are lower than Option A in scenarios with "a large number of random IDs under the same route".
[0051] Fallback mechanism: When the confidence level of the API specification falls below a preset threshold or a processing timeout occurs (e.g., exceeding a preset millisecond SLA), the specification step is skipped, and intra-block reordering is performed directly using the original path recorded in the log as the sorting key. Although this approach yields slightly less benefit than a complete API specification, it still provides better locality than a completely unordered arrival order, ensuring system robustness in complex environments. If Zstandard compression fails (e.g., under extreme memory pressure), it can fall back to an "uncompressed block + flag bit" strategy or degrade to a lower compression level, prioritizing availability.
[0052] IV. Compression Mechanism Analysis At the structural encoding level (LZ77): Zstandard first searches for an existing substring at the current position within a sliding window; if a match is found, the repeated bytes are replaced with a shorter structural description. If logs are interleaved by arrival time, the same JSON key order, the same header template, and the same static URL segment are often separated by other interface records, requiring large offsets for matching, or even exceeding the search depth; the proportion of literals increases under incorrect layout. After completing the API pattern specification and rearranging records with the same pattern, adjacent lines only change on a few fields (timestamp, ID, etc.), and long common prefixes appear repeatedly. LZ77 can encode in a cheaper way, such as with small offsets and repeat offsets; structurally, it is equivalent to stripping a large amount of the "skeleton" from the literal pool.
[0053] At the statistical coding level (entropy coding): After structured matching is completed, the residual literals are entropy-entropy encoded using FSE / Huffman isoentropy coding. Entropy coding efficiency is closely related to the peak level of the symbol conditional distribution. Rearrangement transforms the intra-block character / token distribution from a "multi-interface mixed multi-peak random overlap" to a "single-pattern dominated extreme tilt" (e.g., a large number of consecutive identical punctuation marks, quotation marks, key name fragments), resulting in a decrease in marginal coding bits per symbol. Even though Zstandard still uses a certain size (e.g., approximately 128KB) as the actual coding unit, intra-block clustering increases the class purity within each coding unit, leading to more accurate local statistics and a measurable reduction in compressed size.
[0054] V. Compatibility with "Overall Physical Order" In this invention, "overall order" means that the data file still grows sequentially in blocks, and the order in which blocks are appended is not globally rearranged; each record can still be located by "block number + intra-block offset / mapping". Intra-block rearrangement does not change the coarser-grained physical order clue of the "block sequence", thus it is compatible with processes such as object storage synchronization, incremental backup, and daily rolling files. If the business strongly relies on "intra-block equivalent time order", a compromise can be made by using a mapping table or parallel output of time-series replica indexes.
[0055] The above are merely preferred embodiments of the present invention. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention, and these should also be considered within the scope of protection of the present invention. These will not affect the effectiveness of the implementation of the present invention or the practicality of the patent.
Claims
1. A block-level data layout optimization and compression method based on API pattern normalization, characterized in that, Applied to log storage systems, the method performs the following steps within a single buffer while maintaining the overall order of log file appending: Receive multiple log records; For each log record, API pattern specification is performed to generate a corresponding pattern identifier; Based on the pattern identifier, multiple log records in the block buffer are rearranged within the block so that log records belonging to the same API pattern are continuous in the byte stream; The rearranged log records are serialized, then sent to the Zstandard compressor for compression and disk storage.
2. The block-level data layout optimization and compression method based on API pattern normalization according to claim 1, characterized in that, The API pattern specification for each log record includes: extracting the request path from the log record; mapping the request path containing dynamic parameters to the corresponding aggregation template according to the preset aggregation rules, and using the aggregation template as the pattern identifier.
3. The block-level data layout optimization and compression method based on API pattern normalization according to claim 1, characterized in that, The sorting key for the rearrangement within the block is a composite key, and the composite key includes at least the pattern identifier.
4. The block-level data layout optimization and compression method based on API pattern normalization according to claim 1, characterized in that, Also includes: After compression and disk write, a mapping relationship is established and stored between the virtual offset of the log records in the block buffer before reordering and the physical offset after reordering.
5. The block-level data layout optimization and compression method based on API pattern normalization according to claim 4, characterized in that, It also includes a rollback mechanism: when the confidence level of the API pattern specification is lower than a preset threshold or the processing times out, the specification steps are skipped and the original path recorded in the log is used as the sorting key to perform intra-block reordering.
6. The block-level data layout optimization and compression method based on API pattern normalization according to claim 1, characterized in that, The triggering condition for a single buffer is that the amount of data in the buffer reaches a preset threshold or the timer expires.
7. A block-level data layout optimization and compression device based on API pattern normalization, characterized in that, Deployed in a log storage system, the device is used to perform operations within a single buffer while maintaining the overall order of log file appending. The device includes: The log receiving module is used to receive multiple log records; The pattern specification module is used to specify the API pattern for each log record and generate a corresponding pattern identifier. The block reordering module is used to reorder multiple log records in the block buffer according to the mode identifier, so that log records belonging to the same API mode are continuous in the byte stream. The compression and disk write module is used to serialize the rearranged log records and send them to the Zstandard compressor for compression and disk write.
8. A log storage system, characterized in that, include: At least one processor; as well as A memory that is communicatively connected to the at least one processor; The memory stores instructions that can be executed by the at least one processor, which, when executed by the at least one processor, cause the log storage system to perform the block-level data layout optimization and compression method based on API pattern normalization as described in any one of claims 1-6.