A method for fast parsing of large volume data
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- AIR FORCE UNIV PLA
- Filing Date
- 2026-02-10
- Publication Date
- 2026-06-19
Smart Images

Figure CN122242484A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and in particular to a method for rapidly parsing large volumes of data. Background Technology
[0002] With the popularization of big data technology, the amount of large-capacity structured frame data (with fixed frame header, frame body, and frame tail format) generated by business scenarios such as industrial IoT, satellite communication, and video surveillance is growing exponentially. The parsing efficiency of such data and the retrieval efficiency of parsing results have become the core bottleneck restricting the performance of data processing systems.
[0003] In existing technologies, the parsing and subsequent processing of large-capacity frame data mainly suffer from the following problems:
[0004] First, the processing mode is inefficient. Existing solutions mostly adopt a serial reading and serial parsing mode, which fails to fully utilize the parallel computing capabilities of multi-core CPUs and has low input / output (I0) resource utilization. In addition, due to the lack of dedicated parsing logic for frame structure data, a generalized parsing method is often used, resulting in low efficiency in extracting key fields and poor identification and filtering of illegal or abnormal frames.
[0005] Secondly, there are flaws in the data reading and fragmentation methods. Traditional methods rely on conventional file I / O interfaces to read data, and frequent disk interactions lead to significant data loading delays. When attempting parallel processing by fragmenting data, a simple fixed-size partitioning method is often used, which easily splits complete frame data at fragment boundaries, forming cross-fragment frames. This can lead to subsequent parsing failures or the need for complex coordination mechanisms, ultimately reducing processing efficiency.
[0006] Third, the management of parsing rules and task scheduling are suboptimal. The formatting rules and validation rules required during parsing are typically loaded in real-time. When repeatedly processing the same type of frame data, external configuration files need to be read and parsed repeatedly, generating unnecessary overhead. Furthermore, because the fragmentation strategy and task allocation are not optimized based on the structural characteristics of the frame data, it can easily lead to an imbalance in multi-threaded environments, with threads either idling or experiencing excessive load.
[0007] Fourth, the result storage and index building mechanisms are lagging behind. Parsed result data is typically written to disk in a simple serial manner, failing to simultaneously build an efficient index structure. This forces subsequent data query operations to perform a full scan of the massive result file, resulting in huge query overhead and high response latency. Even when attempting to build an index, a record-by-record writing mode is often used, generating numerous fine-grained disk I / O operations, leading to slow index building speed and high disk pressure.
[0008] In summary, existing technical solutions generally suffer from drawbacks such as long parsing time, high system resource consumption, and low result retrieval efficiency when processing large-capacity frame data, making it difficult to meet the current actual business needs of high throughput, low processing latency, and easy and fast retrieval. Summary of the Invention
[0009] This invention addresses the problems of low parsing efficiency, high system resource consumption, and slow retrieval speed in existing large-capacity frame data parsing technologies by providing a method for fast parsing of large-capacity data. This method significantly improves the parsing speed and subsequent query efficiency of large-capacity structured frame data by optimizing key steps such as data sharding, parallel processing, memory mapping, and batch index construction.
[0010] To achieve the above objectives, the present invention adopts the following technical solution:
[0011] This invention proposes a method for fast parsing of large-capacity data, comprising the following steps:
[0012] S1. Read the basic metadata and frame-specific metadata of the large-capacity frame data file, calculate the optimal fragmentation granularity based on hardware adaptation and complete frame guarantee fragmentation rules, divide the large-capacity frame data into N data fragments, assign a unique identifier to each data fragment, and mark the start and end offsets of all complete frames in each data fragment.
[0013] S2. Organize the frame parsing rule set, which includes frame header verification and matching, frame body field extraction, and frame tail verification and judgment, classify and cache it in system memory according to frame type, and preload the parsing result index to build configuration.
[0014] S3. Map the large-capacity frame data file to the process virtual memory space, start a read thread pool that matches the number of CPU cores, and have each read thread read the complete frame data in its bound data segment in parallel to an independent memory buffer.
[0015] S4. Start a parsing thread pool that matches the number of read thread pools. Based on the frame parsing rule set cached in S2, parse the complete frames in each segment frame by frame, identify legal frames and abnormal frames, and temporarily store the parsing results of legal frames in a memory queue. Sort them initially by index field. The parsing includes frame header verification, frame body parsing, and frame tail verification.
[0016] S5. Merge the parsing results of all legal frames in the order of fragment identifier and frame identifier. Based on the index construction configuration preloaded in S2, construct a secondary index using a batch processing strategy. At the same time, record and count abnormal frames. Finally, output the parsing result file, index file and summary log.
[0017] Furthermore, in S1, the basic metadata includes the total data size of the large-capacity frame data file and the frame type identifier; the frame-specific metadata includes the frame separator, frame header length, frame body format definition, and frame tail check identifier.
[0018] Furthermore, in S1, the hardware adaptation and complete frame guarantee fragmentation rule is as follows: the fragment size is equal to the product of the optimal processing data amount per thread and the number of CPU cores; wherein, the optimal processing data amount per thread is determined based on the CPU cache size of the target server, and when dividing the fragments, the fragment boundaries do not split any complete frames.
[0019] Furthermore, in S3, each reading thread is bound to a data segment in S1, and each reading thread reads the corresponding complete frame data according to the start and end offsets of each complete frame in the data segment it is bound to.
[0020] Furthermore, in S4, the specific steps of frame-by-frame parsing are as follows:
[0021] S401. Verify the frame identifier, frame length, and frame header checksum according to the frame header checksum matching rules;
[0022] S402. Locate and extract the index field within the frame body according to the frame body field extraction rules, and perform data type conversion;
[0023] S403. Verify the frame tail identifier and the overall checksum of the frame data according to the frame tail check judgment rules.
[0024] Furthermore, in S4, the secondary index includes a primary index with frame identifier as the key and a secondary index with frame timestamp as the key; in S5, the batch disk write threshold of the batch processing strategy is configured according to the system memory and disk I / O performance.
[0025] Furthermore, in S4, a one-to-one pairing relationship is established between the parsing thread and the reading thread; each parsing thread obtains complete frame data from the memory buffer corresponding to its paired reading thread, and performs frame-by-frame parsing based on the frame parsing rule set.
[0026] Furthermore, in S5, the recording and statistics of abnormal frames include: recording and classifying the abnormal frames identified in S4 independently of the parsing results of legal frames, and this process is executed in parallel with the merging of the parsing results of legal frames and the construction of the secondary index.
[0027] The present invention also proposes a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the above-mentioned method for fast parsing of large-capacity data.
[0028] The present invention also proposes an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the above-mentioned method for fast parsing of large-capacity data.
[0029] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0030] (1) This invention replaces traditional file I / O reading with memory mapping technology, breaking through the latency bottleneck of frequent disk access; it performs intelligent sharding based on hardware parameters and complete frame guarantee rules, and combines a multi-threaded parallel parsing architecture to make full use of multi-core CPU computing resources; at the same time, by preloading the core frame parsing rules into memory and adopting batch caching and segmented disk strategies in the index building stage, it avoids the overhead caused by repeated rule loading and fine-grained disk writing. These measures work together to significantly improve the parsing speed and index building speed of large-capacity frame data.
[0031] (2) The intelligent sharding strategy of this invention realizes load balancing among processing threads and improves CPU utilization. The integrated structured secondary index (with frame ID as the primary key and timestamp as the secondary key) eliminates the need for full scans in subsequent query operations on the parsed results, thereby significantly improving data retrieval efficiency.
[0032] (3) The present invention adopts a complete frame-guaranteed fragmentation mechanism to fundamentally avoid parsing failure caused by frame data splitting; the independent exception recording and handling mechanism ensures that local data errors will not interrupt the overall process, enhancing the robustness of the system in non-ideal data environments. In addition, the method is based on a general hardware and software architecture, the frame parsing rules are clearly and concisely defined, the index configuration is flexible, and it is easy to deploy and apply in different business scenarios. Attached Figure Description
[0033] Figure 1 A flowchart illustrating the overall process of a method for rapid parsing of large amounts of data provided in this embodiment of the invention.
[0034] Figure 2 This is a flowchart of the sub-processes for frame data preprocessing and complete frame fragmentation in an embodiment of the present invention;
[0035] Figure 3 This is a flowchart of a sub-process for parallel full frame reading based on memory mapping in an embodiment of the present invention;
[0036] Figure 4 This is a flowchart of the multi-threaded parallel parsing of the frame structure-specific logic in an embodiment of the present invention.
[0037] Figure 5 This is a flowchart of the sub-process of batch index construction and result merging in an embodiment of the present invention. Detailed Implementation
[0038] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0039] Example
[0040] refer to Figure 1 This embodiment proposes a method for fast parsing of large-capacity data, specifically implemented on a server configured with multi-core CPUs, large-capacity memory, and solid-state drives, including the following steps:
[0041] S1, Reference Figure 2 Read large frame data files, preprocess the large frame data and perform complete frame fragmentation, specifically by following these sub-steps:
[0042] S11. Read the large-capacity frame data file, obtain the basic metadata of the file, and load the predefined frame format configuration file to obtain the frame-specific metadata of the large-capacity frame data file. The basic metadata includes the total data size of the large-capacity frame data file and the frame type identifier. The frame-specific metadata includes the frame separator, frame header length, frame body format definition, and frame tail check identifier. The frame body format definition is used to describe the position and type of each field in the frame body.
[0043] In this embodiment, the large-capacity frame data file is the satellite telemetry data file telemetry.dat, with a total data size of 80G and a frame type identifier of SAT-V2; the frame separator is 2 bytes of 0xAA55, the frame header length is 14 bytes, and the frame tail check identifier is a 2-byte CRC16 checksum.
[0044] S12. Read the number of CPU cores of the server and calculate the optimal fragmentation granularity based on the hardware adaptation and complete frame guarantee fragmentation rules. The hardware adaptation and complete frame guarantee fragmentation rules are as follows: the fragmentation size is equal to the product of the optimal single-threaded data processing volume and the number of CPU cores. The optimal single-threaded data processing volume is determined based on the CPU cache size of the target server and is usually set to an empirical value between 64MB and 256MB.
[0045] In this embodiment, the number of CPU cores is 12. Based on the performance benchmark test of the server, it was determined that when a single thread processes about 256MB of data, the CPU cache utilization and processing throughput reach a better balance. Therefore, 256MB is set as the optimal amount of data to be processed by a single thread, and the calculated slice size is 3GB.
[0046] S13. Starting from the beginning of the file, scan the data backwards. When the scan approaches the 3GB boundary, do not immediately cut off the data; instead, continue searching until the position of the next complete frame start separator 0xAA55 is found. This position is determined as the actual end boundary of the first data fragment #0, thus ensuring the integrity of the last frame within this fragment. Record the start and end offsets of all complete frames within the first data fragment #0.
[0047] S14. Using the end offset of data fragment #0 as the start offset of data fragment #1, repeat S13 until the entire large-capacity frame data file is traversed. Finally, the system generates a fragment information list, which includes the unique ID of all data fragments, the global file offset range, and a complete frame offset table within each data fragment.
[0048] In this step, no complete frames are cut off at the fragment boundaries, which fundamentally avoids the problem of cross-fragment frame parsing failure that may occur during subsequent parallel parsing.
[0049] S2. Organize the frame parsing rule set, classify the frame parsing rule set according to frame type and cache it in system memory, and preload the parsing result index to build the configuration.
[0050] The frame parsing rule set includes:
[0051] Frame header checksum matching rules: Define the legal format of the frame identifier, the valid range of frame length values, and the calculation rules and matching conditions of the frame header checksum, and generate a list of valid frame header judgments accordingly;
[0052] Frame body field extraction rules: Specify the starting offset, field length, and data type (such as integer, string, or binary) of each key field (such as frame ID, timestamp) in the frame body. Based on this, a list of frame body fields to be extracted is generated. This rule focuses on the accurate location and extraction of fields without involving complex logical conversions.
[0053] Frame tail check judgment rules: Define the fixed format of the frame tail identifier, the calculation algorithm of the overall checksum of the frame data and its legal threshold, and generate a list of valid frame tail judgments accordingly;
[0054] Index building configuration includes index fields (such as frame ID, timestamp), index storage format, building strategy parameters, batch disk write threshold, etc.
[0055] S3, Reference Figure 3 The large frame data file is mapped to the process virtual memory space, and a read thread pool matching the number of CPU cores is started. Each read thread reads the complete frame data within its bound data slice in parallel into an independent memory buffer; specifically, the following sub-steps are executed:
[0056] S31. The system calls the memory mapping function provided by the operating system (such as mmap() in Linux) to map the entire large frame data file to the virtual memory address space of the current process. Access to the large frame data file is converted into direct access to a specific memory address. The page cache management is handled by the underlying operating system, which greatly reduces the overhead of the traditional read() system call and its context switching and data copying.
[0057] S32. Create a read thread pool that matches the number of CPU cores. Dynamically allocate each data shard in the shard information list generated in S14 to idle read threads in the pool through the task scheduler. Preferably, allocate one or more complete data shards to each read thread for processing, establish the processing relationship between threads and data shards, and achieve lock-free parallelism.
[0058] S33. Each read thread, based on its assigned fragment ID, retrieves the complete frame offset table corresponding to that fragment from the fragment information list in memory. Then, using the memory address pointer obtained through memory mapping, and according to the offset table, copies the original data blocks of each complete frame from the mapping area to a pre-allocated, independent memory buffer for that thread. This step completely skips any unmarked fragmented data that may exist at the beginning or end of a fragment. All read threads operate in parallel, with independent workloads and no data contention, greatly improving the throughput of the data loading phase.
[0059] S4, Reference Figure 4 The system starts a parsing thread pool matching the number of read thread pools, and parses the complete frames in each segment frame by frame based on the frame parsing rule set cached in S2, identifying legal and abnormal frames. Specifically, this includes:
[0060] S41. Create a parsing thread pool with the same number of threads as the reading thread pool. To maximize pipeline efficiency, establish a one-to-one pairing relationship between parsing threads and reading threads;
[0061] S42. Each parsing thread receives the address of the memory buffer filled with raw frame data from its corresponding reading thread, and based on the frame parsing rule set preloaded into memory in S2, performs the following parsing process sequentially for each complete frame data in the memory buffer:
[0062] S421. Verify the frame identifier, frame length, and frame header checksum according to the frame header check matching rules; if any check fails, mark the frame as having an invalid frame header, record its offset and error type, skip the subsequent parsing steps, and process the next frame.
[0063] S422. For frames that pass the frame header verification, locate and extract the index fields such as frame_id and timestamp in the frame body according to the frame body field extraction rules, perform data type conversion, and perform value range validity verification.
[0064] S423. Calculate the overall checksum of the frame data according to the frame tail check judgment rules, and compare it with the checksum field read from the frame tail. If they match, mark it as a valid frame; otherwise, mark it as a check failure.
[0065] In this step, frames marked as having an invalid frame header and frames that failed verification are all considered abnormal frames.
[0066] S43. The parsing results (structured field data) of the identified legitimate frames are temporarily stored in a dedicated memory queue associated with the parsing thread. Simultaneously, the records in the queue are initially sorted according to the index fields preloaded in S2, preparing for subsequent efficient index merging and construction. For abnormal frames, their key information (such as frame offset, error type, and fragment ID) is recorded in a separate, thread-local abnormal information list to avoid frequent synchronous disk writes during parsing, ensuring the efficient operation of the main process.
[0067] S5, Reference Figure 5 The parsing results of all valid frames are merged, and a secondary index is built based on the preloaded index in S2. Simultaneously, abnormal frames are recorded and statistically analyzed. Finally, the parsing results are output. Specifically, the following sub-steps are executed:
[0068] S51. Merge the parsing results of all valid frames in the memory queue of the parsing thread according to the order of fragment identifier and frame identifier, generate a complete and ordered structured dataset, and output it as the final parsing result dataset.
[0069] S52. Based on the index building configuration preloaded in S2, a secondary index structure with frame ID as the first-level index and timestamp as the second-level index is built on the merged parsing result dataset using a batch caching and segmented disk strategy. This structure is then persistently stored as a secondary index file on disk. Specifically, index entries are cached in memory, and when the number reaches the configured batch disk write threshold, they are written to the secondary index file in batches, greatly reducing disk I / O interactions.
[0070] S53. Under the premise that the main parsing and index building process is not affected, the background thread performs unified classification, counting and statistics on the list of abnormal information collected in S4, and finally generates a detailed abnormal report log. This ensures that errors in local data will not cause the interruption of the overall task, and meets the requirements of high-throughput continuous processing.
[0071] S54. The system outputs a dataset of parsed results, a secondary index file efficiently built based on the parsed results, a summary log that records the overall task time, resource usage, and processing frame statistics, and an exception log that records all exception details, for monitoring and problem diagnosis.
[0072] As can be seen from the above embodiments, the present invention, through the synergy of technologies such as complete frame guarantee fragmentation, preloading of parsing rules and index configuration, parallel reading of memory mapping, parallel processing of dedicated parsing logic, and batch index construction, not only significantly improves the parsing speed and index construction speed, but also optimizes the system resource utilization and greatly improves the retrieval efficiency of subsequent data, thus achieving the goal of fast, stable, and easy-to-retrieve integrated processing of large-capacity structured frame data.
[0073] The specific embodiments of the present invention are provided to enable those skilled in the art to understand or implement the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention.
[0074] It should be understood that the present invention is not limited to the content already described above, and various modifications and changes can be made without departing from its scope. The scope of the present invention is limited only by the appended claims.
Claims
1. A method for fast parsing of large-capacity data, characterized in that, Includes the following steps: S1. Read the basic metadata and frame-specific metadata of the large-capacity frame data file, calculate the optimal fragmentation granularity based on hardware adaptation and complete frame guarantee fragmentation rules, divide the large-capacity frame data into N data fragments, assign a unique identifier to each data fragment, and mark the start and end offsets of all complete frames in each data fragment. S2. Organize the frame parsing rule set, which includes frame header verification and matching, frame body field extraction, and frame tail verification and judgment, classify and cache it in system memory according to frame type, and preload the parsing result index to build configuration. S3. Map the large-capacity frame data file to the process virtual memory space, start a read thread pool that matches the number of CPU cores, and have each read thread read the complete frame data in its bound data segment in parallel to an independent memory buffer. S4. Start a parsing thread pool that matches the number of read thread pools. Based on the frame parsing rule set cached in S2, parse the complete frames in each segment frame by frame, identify legal frames and abnormal frames, and temporarily store the parsing results of legal frames in a memory queue. Sort them initially by index field. The parsing includes frame header verification, frame body parsing, and frame tail verification. S5. Merge the parsing results of all legal frames in the order of fragment identifier and frame identifier. Based on the index construction configuration preloaded in S2, construct a secondary index using a batch processing strategy. At the same time, record and count abnormal frames. Finally, output the parsing result file, index file and summary log.
2. The method for fast parsing of large-capacity data according to claim 1, characterized in that, In S1, the basic metadata includes the total data size of the large-capacity frame data file and the frame type identifier; the frame-specific metadata includes the frame separator, frame header length, frame body format definition, and frame tail check identifier.
3. The method for fast parsing of large-capacity data according to claim 1, characterized in that, In S1, the hardware adaptation and complete frame guarantee fragmentation rules are as follows: the fragment size is equal to the product of the optimal processing data amount per thread and the number of CPU cores; wherein, the optimal processing data amount per thread is determined based on the CPU cache size of the target server, and when dividing the fragments, the fragment boundaries do not split any complete frames.
4. The method for fast parsing of large-capacity data according to claim 1, characterized in that, In S3, each read thread is bound to a data slice in S1. Each read thread reads the corresponding complete frame data according to the start and end offsets of each complete frame in the data slice it is bound to.
5. The method for fast parsing of large-capacity data according to claim 1, characterized in that, In S4, the specific steps of frame-by-frame parsing are as follows: S401. Verify the frame identifier, frame length, and frame header checksum according to the frame header checksum matching rules; S402. Locate and extract the index field within the frame body according to the frame body field extraction rules, and perform data type conversion; S403. Verify the frame tail identifier and the overall checksum of the frame data according to the frame tail check judgment rules.
6. The method for fast parsing of large-capacity data according to claim 1, characterized in that, In S4, the secondary index includes a primary index with frame identifier as the key and a secondary index with frame timestamp as the key; in S5, the batch disk write threshold of the batch processing strategy is configured according to the system memory and disk I / O performance.
7. The method for fast parsing of large-capacity data according to claim 1, characterized in that, In S4, a one-to-one pairing relationship is established between the parsing thread and the reading thread; each parsing thread obtains complete frame data from the memory buffer corresponding to its paired reading thread, and performs frame-by-frame parsing based on the frame parsing rule set.
8. The method for fast parsing of large-capacity data according to claim 7, characterized in that, In S5, the recording and statistics of abnormal frames include: recording and classifying the abnormal frames identified in S4 independently of the parsing results of legal frames, and this process is executed in parallel with the merging of the parsing results of legal frames and the construction of the secondary index.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by the processor, the program implements the method for fast parsing of large amounts of data as described in any one of claims 1 to 8.
10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the method for fast parsing of large amounts of data as described in any one of claims 1 to 8.