File processing method and device, electronic equipment and readable storage medium
By deduplicating and compressing kernel dump files, the redundancy problem of large kernel dump files is solved, resulting in smaller file sizes and higher compression efficiency, thus improving the efficiency of fault analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GODSON ZHONGKE (BEIJING) INFORMATION TECH CO LTD
- Filing Date
- 2026-02-10
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies generate a large amount of redundant data when processing large kernel dump files, leading to storage and transmission pressures and low compression efficiency.
The memory pages in the kernel dump file are deduplicated, and duplicate non-zero memory pages are filtered out. Then, LZO, Snappy, or Zstd algorithms are used for compression, and a mapping table between memory pages and compressed data blocks is established.
It reduces the size of the target file, improves compression efficiency, reduces storage and transmission pressure, and improves the efficiency of fault analysis.
Smart Images

Figure CN122240579A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to a file processing method, apparatus, electronic device, and readable storage medium. Background Technology
[0002] When the kernel crashes, the operating system generates a kernel dump file (Virtual Memory Core Dump, vmcore) to completely preserve the physical memory state at the moment of the kernel crash. This allows professional fault-finding tools to perform offline analysis and to accurately diagnose the root cause of the fault based on the analysis results.
[0003] If the operating system runs on a device with a large amount of memory, the kernel dump file will typically be large, putting pressure on storage and transmission. Current methods for handling large kernel dump files involve using existing tools (such as the kernel dump compression tool makedumpfile) to optimize certain memory pages within the kernel dump file, thereby reducing the final size of the kernel dump file.
[0004] Even with optimizations from existing tools, a large amount of redundant data still exists in the kernel dump file, resulting in a still large file size, which significantly impacts its storage, transmission, and subsequent fault analysis process. Summary of the Invention
[0005] In view of the above problems, embodiments of the present invention are proposed to provide a file processing method that overcomes or at least partially solves the above problems, which can reduce the size of the target file, and at the same time, since only the deduplicated memory pages need to be compressed, the compression efficiency of the kernel dump file to be processed can also be improved.
[0006] Accordingly, embodiments of the present invention also provide a file processing device, an electronic device, and a readable storage medium to ensure the implementation and application of the above methods.
[0007] In a first aspect, embodiments of the present invention disclose a file processing method, the method comprising: Based on the page content corresponding to each memory page in the kernel dump file to be processed, deduplication operations are performed on the memory pages respectively to obtain a set of target compressed pages; The target compressed pages are compressed separately to obtain compressed data blocks corresponding to each target compressed page; Based on the page content, obtain the correspondence between each memory page and the compressed data block; A target mapping table is established based on the compressed data blocks, and the target mapping table is stored in the target file corresponding to the kernel dump file; the target mapping table includes the mapping relationship between each memory page and the storage location of its corresponding compressed data block in the target file.
[0008] Secondly, embodiments of the present invention disclose a file processing apparatus, the file processing apparatus comprising: The deduplication module is used to perform deduplication operations on the memory pages according to the page content corresponding to each memory page in the kernel dump file to be processed, so as to obtain a set of target compressed pages; A compression module is used to compress the target compressed pages respectively to obtain compressed data blocks corresponding to each target compressed page; The mapping module is used to obtain the correspondence between each memory page and the compressed data block according to the page content; A module is established to create a target mapping table based on the compressed data blocks and store the target mapping table in the target file corresponding to the kernel dump file; the target mapping table includes the mapping relationship between each memory page and the storage location of its corresponding compressed data block in the target file.
[0009] Thirdly, embodiments of the present invention disclose an electronic device, including: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface communicate with each other through the communication bus; the memory is used to store at least one executable instruction, wherein the executable instruction causes the processor to perform the steps of any of the file processing methods described above.
[0010] Fourthly, embodiments of the present invention disclose a readable storage medium storing a program or instructions, which, when executed by a processor, can implement any of the file processing methods described above.
[0011] Fifthly, embodiments of the present invention disclose a computer program product, including a computer program, wherein when the computer program is executed by a processor, it performs the steps of any of the file processing methods described above.
[0012] The embodiments of the present invention have the following advantages: In related technologies, redundancy in kernel dump files is typically handled by using existing tools (such as makedumpfile) to filter and compress memory pages. This filtering process usually considers removing zero-padding pages. However, during normal system operation, kernel page merging and copy-on-write mechanisms still generate a large amount of identical data, resulting in numerous identical non-zero memory pages in the generated kernel dump file. Therefore, even after filtering zero-padding pages, redundancy remains. The solution provided in this application first performs deduplication on the memory pages in the kernel dump file, then compresses the deduplicated pages to obtain compressed data blocks, and finally constructs the target file based on these compressed data blocks. Because deduplication is performed on the memory pages in the kernel dump file, the size of the target file obtained by compressing the memory pages is reduced. Furthermore, during the compression process, since only deduplicated memory pages need to be compressed, the efficiency of the compression operation is improved. Attached Figure Description
[0013] Figure 1 This is a flowchart illustrating the steps of an embodiment of the document processing method of the present invention; Figure 2 This is a structural block diagram of an embodiment of a file processing method of the present invention; Figure 3 This is a schematic diagram of the structure of the electronic device provided in an embodiment of the present invention. Detailed Implementation
[0014] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0015] When a system crashes, kdump can generate a kernel dump file (vmcore). kdump is a service that provides a crash dump mechanism. After a kernel crash, the kdump service captures kernel memory in the vmcore file and generates additional diagnostic files to aid in troubleshooting and post-crash analysis. The contents of the vmcore file can be analyzed using crash tools to determine the cause of the system crash. The vmcore file is the kernel dump file at the time of a system crash. Because its size is usually related to the size of the system's physical memory, devices with larger memory generally generate larger vmcore files.
[0016] This file may contain the following types of data records: Physical memory contents: All physical memory data at the time of system crash; Process list: Information about all processes at the time of the crash; CPU register status; Device status: The registers and status information of a hardware device; Cached data: Data in the file system cache, which may include file content that has not been written to disk; Network connectivity: network sockets and connection status, and the contents of the network protocol stack buffers; Kernel modules: Information and status of loaded kernel modules; Virtual memory mapping: the virtual memory layout of a process; Crash information: The type and location of the error that caused the crash.
[0017] Using crash analysis tools, the information carried in the aforementioned vmcore file can be analyzed in depth to determine the root cause of the system crash. For example, it can pinpoint defective drivers, problematic kernel patches, or specific abnormal system calls, allowing for targeted fixes later. Therefore, kernel dumps of vmcore files are crucial for system-level fault analysis.
[0018] If the system malfunctions and the device has a large amount of memory, it will typically result in a large vmcore file, putting pressure on storage and transmission. Related technologies take this into account, and typically use existing tools (such as the kernel dump compression tool makedumpfile) to process the memory pages storing the aforementioned data records in the vmcore file. This includes filtering out pages filled with zeros and compressing the filtered content page by page to reduce the size of the vmcore file.
[0019] However, even after zero-page removal, the vmcore may still contain a large number of identical non-zero memory pages, generated through mechanisms such as Kemel SamePage Merging (KSM) and Copy-on-write (CoW).
[0020] Therefore, the VMcore processed by the makedumpfile tool in related technologies still contains a large amount of redundancy, resulting in a suboptimal file size. Furthermore, compressing each page independently involves repeatedly performing compression operations on these redundant pages with identical content, which not only fails to achieve optimal file size but also wastes valuable processing time and computing resources.
[0021] Reference Figure 1 The diagram illustrates a flowchart of an embodiment of a file processing method according to the present invention, the method including the following steps: Step 101: Based on the page content corresponding to each memory page in the kernel dump file to be processed, perform deduplication operation on the memory pages respectively to obtain a set of target compressed pages; Step 102: Compress the target compressed pages respectively to obtain compressed data blocks corresponding to each target compressed page; Step 103: Based on the page content, obtain the correspondence between each memory page and the compressed data block; Step 104: Establish a target mapping table based on the compressed data blocks, and store the target mapping table in the target file corresponding to the kernel dump file; the target mapping table includes the mapping relationship between each memory page and the storage location of its corresponding compressed data block in the target file.
[0022] The file processing method provided in this embodiment of the invention performs deduplication on the memory pages that need to be compressed in the kernel dump file, thus reducing the size of the target file; at the same time, since only the deduplicated memory pages need to be compressed, the compression efficiency can also be improved.
[0023] Specifically, the kernel dump file is first deduplicated, removing non-zero memory pages with duplicate content to reduce its size and decrease the pressure on compression. After removing duplicate non-zero memory pages, the set of target compressed pages is obtained. Target compressed pages refer to the memory pages in the kernel dump file that need to be compressed; they contain no duplicate content and represent the result after redundancy removal.
[0024] Furthermore, the page content of the target compressed pages in the set is compressed page by page to obtain the compressed data block corresponding to each target compressed page, that is, the compressed data block corresponding to different page contents.
[0025] The target compressed page can be compressed using either the LZO (Lempel-Ziv-Oberhumer) real-time compression algorithm or the Snappy compression algorithm to obtain compressed data blocks. Both LZO and Snappy compression algorithms offer high compression / decompression speeds.
[0026] Optionally, in this embodiment, the target compressed page is compressed using the Zstd (Zstandard) compression algorithm to obtain compressed data blocks. Compared to compression algorithms like LZO and Snappy, the Zstd compression algorithm provides a higher compression ratio while maintaining the same high compression / decompression speed, achieving Pareto optimality in the balance between speed and compression ratio. In scenarios involving large files like vmcore, for the same file, a higher compression ratio means a smaller final compressed file size, i.e., a higher compression ratio translates to greater storage space savings. Therefore, selecting the Zstd compression algorithm not only ensures compression speed but also provides a high compression ratio, further reducing the size of the compressed result and saving storage space.
[0027] Since the set of target compressed pages is the result of filtering out duplicate pages from all memory pages, the content of each memory page is reflected in the set of target compressed pages. Furthermore, the content of each page in the target compressed pages corresponds one-to-one with the generated compressed data blocks, so each memory page can find the compressed data block corresponding to its content. Even if the number of target compressed pages is less than the number of memory pages, because the target compressed pages correspond one-to-one with the deduplicated memory pages, all current memory pages can obtain the correspondence between each memory page and the compressed data block based on their page content.
[0028] The target compressed page is compressed to generate compressed data blocks, which are then stored in the target storage area for later retrieval. A target mapping table is established to represent the correspondence between memory pages and compressed data blocks, enabling the original kernel dump file to be processed to be obtained based on the current compressed data block. The target storage area is a component of the target file. Specifically, when the fault analysis tool obtains the kernel dump file to be processed, it needs to search for compressed data blocks in the target storage area after storing compressed data blocks in the target storage area. Different compressed data blocks are stored in different storage locations within the target storage area; therefore, the target mapping table includes the mapping relationship between each memory page and its corresponding compressed data block in the target storage area. For example, the target mapping table stores the unique identifier of each memory page (such as the page frame number) and the storage location of the compressed data block corresponding to that memory page. It also stores the starting offset in the target storage area and the byte length of the compressed data block, allowing the target mapping table to be queried using the unique identifier of the memory page to obtain its corresponding compressed data block. By decompressing the compressed data block, the memory page can be further restored.
[0029] The target storage area, located within the target file, is a dedicated region for storing compressed data blocks. Specifically, when storing compressed data blocks, an initial target file already exists, containing the table structure to be populated; for example, a target mapping table may not necessarily exist. After the target mapping table is generated, it needs to be stored in the target file—that is, the portion to be populated is filled to update the target file. Finally, the updated target file is output for processing by fault analysis tools.
[0030] It is understandable that the target mapping table is constructed based on the compressed data blocks, that is, by constructing the target mapping table through the correspondence between the storage location of the compressed data blocks and the unique identifier of each memory page.
[0031] The valid content of each memory page in the kernel dump file to be processed can be obtained based on the target mapping table and the target storage area. For example, the kernel dump file to be processed can be recovered using the target mapping table and the compressed data blocks in the target storage area. The target mapping table contains the correspondence between each memory page in the kernel dump file and the compressed data blocks. Based on this correspondence, the compressed data blocks corresponding to the memory pages can be decompressed to obtain the kernel dump file to be processed.
[0032] For example, the kernel dump file to be processed includes memory pages a, b, c, d, e, and f, where memory pages a, b, and c have the same content, and memory pages e and f have the same content. During deduplication, the first memory page appearing in memory pages a, b, and c is selected; for example, memory page a is selected as the representative of the set with the same content. Similarly, memory page e is selected as the representative of the set with the same content in memory pages e and f, resulting in the set of target compressed pages {a, d, f}. Each target compressed page is compressed to obtain compressed data blocks a', d', and f'. If compression is performed using the Zstd algorithm, then compressed data blocks a', d', and f' are all in Zstd compressed frame format; simultaneously, all compressed data blocks are stored in the target storage area of the target file corresponding to the kernel dump file. Since memory pages b and c have the same content as memory page a, they also correspond to compressed data block a' based on their page content. Similarly, since memory page e has the same content as memory page f, it also corresponds to compressed data block f' based on its page content. Therefore, the correspondence between memory pages and compressed data blocks includes: the correspondence between memory pages a, b, and c and compressed data block a'; the correspondence between memory page d and compressed data block d'; and the correspondence between memory pages e and f and compressed data block f'. The target mapping table stores the correspondence between each memory page and its corresponding compressed data block's storage location.
[0033] The kernel dump file to be processed is recovered based on the target mapping table and the compressed data blocks in the target storage area. Specifically, when the fault analysis tool needs to perform fault analysis based on the kernel dump file, the storage location of compressed data block a' in the target storage area is obtained according to the mapping relationship in the target mapping table, and compressed data block a' is retrieved based on its storage location. According to the mapping relationship, compressed data block a' is decompressed and stored once at each of the memory pages a, b, and c in the kernel dump file (or at their respective positions in the order of all memory pages). (Alternatively, it may be decompressed only once at a, and then copied and stored once at the aforementioned memory pages b and c). Since memory pages b and c also correspond to compressed data block a', memory pages a, b, and c in the kernel dump file can be recovered. Following the same processing logic, the storage location of compressed data block d' in the target storage area is obtained, and compressed data block d' is retrieved based on its storage location. Finally, compressed data block d' is decompressed and stored at the location of memory page d in the kernel dump file to recover memory page d. Finally, the storage location of compressed data block f' in the target storage area is obtained from the target mapping table, and compressed data block f' is retrieved according to the storage location. According to the mapping relationship, the compressed data block f' is decompressed and stored once at the respective positions of memory pages e and f in the kernel dump file to be processed (or decompressed once and stored once at the aforementioned memory pages e and f). Since there is also a correspondence between memory page e and compressed data block f', memory pages e and f in the kernel dump file to be processed can be restored. Fault analysis is performed on the restored memory pages a to f to confirm the system fault in the original kernel dump file to be processed.
[0034] Therefore, the target file with the target storage area is updated according to the target mapping table. The updated target file includes the data in the complete kernel dump file to be processed before deduplication and compression are performed. Moreover, the size of the target file is much smaller than that of the kernel dump file to be processed, which can reduce the data storage and transmission pressure and improve the user experience.
[0035] Understandably, when determining the representative of page content, memory pages a, b, and c with the same content can also be randomly selected, for example, page b or c can be chosen as the representative of that page content; it is sufficient to ensure that each page content has a representative. Similarly, memory pages e and f can also be randomly selected as the representative of the set with the same page content.
[0036] The target file supports a custom format, which can be read by the final fault analysis tools. Furthermore, the content within the target file can be stored in either a preset location within the custom format or a user-specified location, without any restrictions.
[0037] Optionally, the method further includes: preprocessing the original kernel dump file to obtain the kernel dump file to be processed; wherein the preprocessing includes removing zero-filled pages.
[0038] In practice, a kernel dump file records the instantaneous state of the system after a crash. Therefore, the generated raw kernel dump file may include various types of parameters, such as zero-filled pages, cached pages that do not contain private pages and can be rebuilt after a system reboot, and free pages. All of these can be preprocessed using existing memory dumping tools. For example, the `makedumpfile` tool can be used for preprocessing to filter out zero-filled pages, free pages, and cached pages that can be rebuilt after a system reboot. Users can also choose to enable other processing tools to exclude user process data pages and all cached pages containing private pages.
[0039] After preprocessing the original kernel dump file to obtain the kernel dump file to be processed, deduplication and compression operations are performed on the kernel dump file. On the one hand, since at least zero-padding pages are removed, the deduplication pressure is reduced. During deduplication, it is not necessary to identify or process zero-padding pages, thus reducing the amount of data to be processed and optimizing data processing speed. On the other hand, the removal of zero-padding pages during preprocessing can alleviate the pressure of data compression to some extent. Furthermore, preprocessing also supports the removal of cached pages that can be rebuilt after a system reboot and free pages, further reducing the amount of data to be compressed in the kernel dump file, which can further reduce the computational resources and time required for data compression and improve processing efficiency.
[0040] In related technologies, redundancy in kernel dump files is typically handled by using existing tools (such as makedumpfile) to filter and compress memory pages within the kernel dump file. This filtering process usually only considers removing zero-filled pages. However, during normal system operation, kernel page merging, copy-on-write, and other mechanisms generate a large amount of identical data, resulting in a large number of identical non-zero memory pages in the generated kernel dump file. Therefore, even after filtering zero-filled pages, redundancy issues still exist.
[0041] In the solution provided in this application, considering the existence of identical non-zero memory pages, a deduplication operation is first performed on the memory pages in the kernel dump file, and the deduplicated memory pages (i.e., the target compressed pages) are then compressed to obtain the target file. Because the deduplication operation has already been completed on the memory pages in the kernel dump file, the size of the target file obtained by compressing the memory pages can be reduced; at the same time, during the compression process of memory pages, since only the deduplicated memory pages need to be compressed, i.e., only the target compressed pages are processed, the efficiency of the compression operation can also be improved.
[0042] Furthermore, by performing compression on each target compressed page separately, different compressed data blocks can be obtained, with each compressed data block corresponding one-to-one with the page content of the target compressed page. Based on the page content, a further correspondence between all memory pages and compressed data blocks can be established, and a target mapping table can be built according to this correspondence. When processing the final output target file of the kernel registration file using the fault analysis tool, the correspondence between all memory pages and each compressed data block in this target mapping table can be used to decompress the compressed data blocks according to the position or order of each memory page to restore the complete kernel dump file to be processed, thus completing the fault analysis.
[0043] In one optional embodiment of this application, the step of performing deduplication operations on the memory pages respectively to obtain a set of target compressed pages may include: Step S01: Based on the page content corresponding to each memory page in the kernel dump file to be processed, encode each memory page to obtain the encoding value corresponding to each memory page. Step S02: Obtain a memory page corresponding to each encoded value to form a set of target compressed pages.
[0044] In this embodiment, the content of each memory page is first encoded, with each page content corresponding to a unique encoding value. If the content of memory pages is the same, they correspond to the same encoding value. Subsequently, the encoding values are used to identify whether duplicate pages exist, or to obtain the correspondence between each memory page and the compressed data block. By using the encoding values to perform the above-mentioned duplicate identification or relationship retrieval process, it is not necessary to read the complete page content, which reduces the amount of data that needs to be compared or identified, speeds up the identification or other processing of page content, and improves data processing efficiency.
[0045] Specifically, each memory page is first independently encoded based on its content, and the resulting encoded value corresponds to the page content. For example, checksum-based algorithms (such as CRC32) or hash algorithms (such as FarmHash, MurmurHash, etc.) can be used to encode the page content. Checksum-based algorithms are faster; hash algorithms generate fewer collisions compared to checksum-based algorithms, ensuring that different page contents correspond to different encoded values and avoiding collision errors such as different page contents corresponding to the same encoded value.
[0046] After encoding each memory page, duplicate memory pages are identified by their encoded values; that is, duplicate memory pages are removed using the encoded values. Specifically, memory pages with the same encoded value are grouped into a set, and any one of these memory pages is extracted as a representative, or the first memory page to appear in the set is selected as the representative. In other words, only one memory page is selected as the representative for each encoded value, and this memory page is the target compressed page corresponding to the current encoded value. Finally, the memory pages corresponding to all different encoded values are obtained, forming a set of target compressed pages.
[0047] For example, encoding each memory page to obtain a corresponding encoded value for each memory page may include: Each memory page is encoded using a hash algorithm, and the hash value corresponding to each memory page is used as the encoding value of that memory page. That is, in practical use, a hash algorithm can be selected to encode the page content of the memory pages, and the resulting hash value is used as the encoding value of each memory page. Simultaneously, after obtaining the hash value as the encoding value, each hash value is stored in the key field (i.e., the key area) of the hash table, and the value field (i.e., the value area) corresponding to each hash value serves as a unique identifier (such as a page frame number) for all memory pages corresponding to the current hash value.
[0048] Because of the uniqueness of the parameters in the hash table key field (key area), each hash value will be unique after being stored in the hash table. This means that the encoded values of memory pages stored in the hash table will not have duplicate values, further demonstrating that hash tables can be used to deduplicate memory pages. If multiple memory pages have the same page content, the hash value obtained by encoding that page content will correspond to a unique identifier (such as a page frame number) for each of the multiple memory pages.
[0049] In one example, consider the following pending memory page, where Hash represents the hash value: PFN 700: Page content A (Hash: 0x1A…) PFN 701: Page content B (Hash: 0x9C…) PFN 702: Page content A (Hash: 0x1A…) -> Duplicate Page PFN 703: Page content C (Hash: 0xE5…) PFN 704: Page content A (Hash: 0x1A…) -> Duplicate Page The generated hash table is shown in Table 1: Table 1
[0050] Because of the uniqueness of the hash table key field (key area), when storing hash values calculated based on page content, the same hash value is stored only once. That is, when calculating compressed data blocks subsequently, the page content of the current key field is the result after deduplication of the page content; an identifier (such as the page frame number) is taken from the value field (value area) corresponding to each key field (key area), and the memory page corresponding to this identifier is the target compressed page. The set of all memory pages corresponding to the obtained identifiers is the set of target compressed pages.
[0051] Specifically, the xxHash algorithm, while maintaining the advantages of hash algorithms (i.e., a low collision rate), also possesses speed advantages and good portability. It can quickly generate hash values as encoding values and is not dependent on the platform's instruction set. Therefore, the xxHash algorithm can be chosen to encode page content.
[0052] Among them, the 64-bit version of the xxHash algorithm, namely XXH64, is preferred, as it has a lower probability of hash collisions compared to the 32-bit version.
[0053] In this example, the xxHash algorithm is used for encoding to obtain the encoded value corresponding to the page content. This improves the encoding speed while maintaining a low collision probability, and further enhances the comparison speed of the page content.
[0054] In this embodiment, the page content is first encoded, and the encoded values are compared to determine whether memory pages are duplicated. Then, based on the encoded value, a target compressed page is selected from a set of duplicate memory pages as the corresponding target compressed page, resulting in the final set of target compressed pages. For example, after encoding using a hash algorithm, for each hash value stored in the hash table, the memory page represented by any unique identifier (such as a page frame number) is taken as the target compressed page corresponding to that hash value (encoded value). Compared to the comparison step of reading the complete page content to determine whether it is duplicated, this method reduces a large amount of data that needs to be repeatedly read and processed during the comparison process, thereby speeding up data processing and improving the efficiency of deduplication.
[0055] In an optional embodiment of the present invention, establishing a target mapping table based on the compressed data block may include: Step S11: Obtain the page frame number corresponding to each memory page; Step S12: Based on the correspondence between each memory page and the compressed data block, obtain the mapping relationship between the page frame number and the storage location of the compressed data block in the target file; wherein, the storage location of the compressed data block in the target file includes: the starting offset and the byte length occupied by the compressed data block in the target file; Step S13: Establish the target mapping table according to the mapping relationship.
[0056] In this embodiment, the Page Frame Number (PFN) is used as the unique identifier for each memory page. A mapping relationship between memory pages and compressed data blocks is established through the PFN, and a target mapping table is built based on this mapping relationship. Each memory page has a unique page frame number, thus allowing different memory pages to be identified.
[0057] After compressed data blocks are stored in the target storage area, they need to be located based on their storage location. Therefore, directly establishing the correspondence between the storage location of the compressed data blocks and their page frame numbers allows us to locate the target storage area using the page frame number, obtain the corresponding compressed data block at its storage location within the target storage area, and further restore the kernel dump file. The storage location includes the starting offset and the number of bytes occupied by the compressed data block in the target storage area. The starting offset is used to pinpoint the exact location of the compressed data block, and the number of bytes occupied indicates how many bytes need to be read after the starting offset to obtain the complete compressed data block.
[0058] Understandably, offsets are typically relative values and require a reference point. In this scheme, the starting offset is the offset in bytes of the compressed data block within the target storage area relative to the starting position of the target storage area; that is, the reference point for the starting offset in this scheme is the starting position of the target storage area. By combining the starting position of the target storage area and the starting offset corresponding to the compressed data block, the absolute position of the compressed data block can be accurately determined.
[0059] For example, the kernel dump file to be processed includes memory pages a, b, c, d, e, and f. Each page is encoded (e.g., using a hash algorithm) to obtain its encoded value. Memory pages a, b, and c have the same encoded value, as do memory pages e and f. Based on these encoded values, it is determined that the page contents of memory pages a, b, and c are the same, and the page contents of memory pages e and f are the same. The page frame numbers of each memory page are Pa, Pb, Pc, Pd, Pe, and Pf, respectively. Therefore, memory pages a, d, and e are selected as target compression pages. Compression algorithms such as Zstd are called to compress each target compression page, resulting in compressed data blocks a', d', and e', which are then stored in the target storage area of the target file. The page frame numbers Pa, Pb, Pc, Pd, Pe, and Pf, as well as the storage locations of compressed data blocks a', d', and e' in the target storage area, are then filled into the target mapping table. The aforementioned parameters will be filled into the target mapping table according to the mapping relationships between page frame numbers Pa, Pb, Pc and compressed data block a' in the target storage area, page frame number Pd and compressed data block d' in the target storage area, and page frame numbers Pe, Pf and compressed data block e' in the target storage area. This will be done through sequential append-only appending; where sequential appending means that new data is written only to the end of the file without modifying or overwriting existing data. When the fault analysis tool performs fault analysis based on the target file, it needs to obtain memory pages a, b, c, d, e, and f. Then, at the storage location corresponding to page frame number Pa, the decompression algorithm corresponding to the aforementioned compression algorithm (such as Zstd compression algorithm) is called to decompress compressed data block a'. The decompression result of compressed data block a' at page frame number Pa is copied at the locations corresponding to page frame numbers Pb and Pc; compressed data block d' is decompressed at the location corresponding to page frame number Pd; compressed data block e' is decompressed at the location corresponding to page frame number Pe; and the decompression result of compressed data block e' is copied at the location corresponding to page frame number Pf. Finally, memory pages a, b, c, d, e, and f are obtained for the fault analysis tool to analyze and identify.
[0060] In this approach, since the hash table (Table 1) identifies that memory pages a, b, and c contain the same content, and page frame number Pa is the first page frame number to appear, the compressed data block a' will be decompressed at page frame number Pa. Subsequent page frames Pb and Pc will copy the decompression result from page frame number Pa. This avoids repeatedly decompressing the same compressed data block, improving data processing efficiency. In some examples, it's possible to bypass the check that the current page frame number is the first page frame number to appear in the current page content and directly decompress the compressed data block at the corresponding page frame number position based on the correspondence between page frame numbers and compressed data blocks. This eliminates the need to confirm whether the current page frame number is the first page frame number to appear in the same page content, reducing the complexity of instruction writing. When processing page frame numbers Pa, Pb, and Pc, the compressed data block a' is decompressed at page frame numbers Pa, Pb, and Pc respectively. There is no need to determine the relationship between page frame numbers Pb, Pc, and Pa, and the decompression result of compressed data block a' at page frame number Pa will not be copied at page frame numbers Pb and Pc.
[0061] For example, the final target mapping table can be obtained by using the function `struct PageMapEntry()` to create page mapping entries. The instructions involved include: `uint64_t pfn`, which defines `PFN` as an unsigned 64-bit integer; `uint64_t data_offset`, which defines the starting offset of the compressed data block's storage location as an unsigned 64-bit integer; and `uint32_t data_size`, which defines the length of the compressed data block's storage location in bytes as an unsigned 64-bit integer.
[0062] Taking the page content parameters of the memory pages contained in the hash table in Table 1 above as an example. Simultaneously, compression is performed on memory page PFN 700 (i.e., page content A), and the resulting compressed data block is stored at a starting offset of 0, occupying a length of 100 bytes. Compressing memory page PFN 701 (i.e., page content B) results in a compressed data block stored at a starting offset of 100, occupying a length of 2000 bytes. Compressing memory page PFN 703 (i.e., page content C) results in a compressed data block stored at a starting offset of 2100, occupying a length of 500 bytes.
[0063] Therefore, a target mapping table is shown in Table 2: Table 2:
[0064] In one example, the target mapping table is written sequentially in an append-only manner, generating a corresponding mapping record for each memory page detected in a kernel dump file to be processed (regardless of whether it is a duplicate).
[0065] For example, when establishing the target mapping table, if the page content of the memory page is detected to be appearing for the first time, the storage location of the compressed data block corresponding to the new page content is filled into the area corresponding to the unique identifier (such as the page frame number PFN) of the current memory page; if the page content of the memory page is not appearing for the first time (for example, it can be retrieved in the hash table constructed above by using the page frame number of the memory page to confirm whether it is the first memory page in the same page content, where "first" can be determined by the size of the page frame number in the set of page frame numbers corresponding to the page content), then the storage location corresponding to the page frame number of the first appearance of the current page content is copied.
[0066] For example, for memory page PFN700, according to Table 1, it is the first memory page to appear in the set corresponding to page content A in order of page frame number, and then it is stored in the storage location of the compressed data block corresponding to page content A. For memory page PFN704, according to Table 1, it is not the first memory page to appear in the set corresponding to page content A. Therefore, copy the storage location corresponding to the page frame number that has already appeared in the set corresponding to page content A, such as copying the storage location corresponding to PFN700 or PFN702.
[0067] In another example, when building the target mapping table, it supports not identifying whether the current page content is appearing for the first time, i.e., not executing the copy storage location step in the previous example. The storage location of the compressed data block is determined separately for each page frame number, and storage is performed independently. It is not necessary to confirm whether the current page frame number is the first page frame number where the same page content appears; the target mapping table is built directly based on the correspondence, which reduces the complexity of instruction writing. Furthermore, the storage location of the compressed data block, such as the starting offset (Data Offset) and the occupied byte length (Data Size), is also considered.
[0068] In this embodiment, different memory pages are identified by page frame numbers, which are stored in a target mapping table. The target mapping table also includes the storage locations of compressed data blocks and stores the correspondence between each page frame number and the storage location of the compressed data blocks. The fault analysis tool can read the information from each memory page in the kernel dump file to be processed through the target mapping table, and decompress the corresponding compressed data blocks according to the page frame number order. After decompression, the restored memory pages are obtained, thereby restoring the original kernel dump file to be processed and completing the system fault analysis.
[0069] In an optional embodiment of the present invention, the target file is an Executable and Linkable Format (ELF), the target mapping table is stored in a first custom ELF segment of the ELF file, and the compressed data blocks are stored in a second custom ELF segment of the ELF file. That is, after deduplication and compression operations are performed on the kernel dump file, the generated target file is stored in ELF format. Most tools for fault analysis of kernel dump files can read ELF format; therefore, choosing to output the target file in ELF format expands the scope of its use.
[0070] In one example, the target file includes a file header and segment content; the segment content includes a first custom ELF segment and a second custom ELF segment; the file header includes an identifier of the compression algorithm used to compress the target compressed page, the location of the target mapping table, and the location of the target storage area. The first and second custom ELF segments are fixed areas within the segment content, used only to store the target mapping table and the target storage area, ensuring that the target mapping table and the target storage area have storage space and that this storage space is not occupied or contaminated by other data.
[0071] Specifically, the final constructed ELF file, serving as the target file, is optimized. The ELF file contains a header and segment content. The segment content stores information requiring significant storage space for saving and transmission, while the header stores metadata. Metadata is information about the organization of data, data fields, and their relationships. The metadata in the header includes descriptions of the data in the segment content, as well as parameters that aid in using the data, such as the location of the target storage area within the segment content and the name or identifier of the compression algorithm used to compress the data blocks in the target storage area.
[0072] In this embodiment, the data in the file header is used by the fault analysis application to accurately locate the target mapping table and the target storage area after obtaining the target file. Based on the target mapping table, the application finds the compressed data blocks corresponding to each memory page in the target storage area, and selects the corresponding decompression algorithm according to the compression algorithm corresponding to the identifier stored in the file header to decompress the compressed data blocks, so as to restore the complete kernel dump file to be processed and complete the fault analysis.
[0073] In an optional embodiment of the present invention, an operating system kernel dump file optimization method based on content deduplication and compression is provided, comprising: receiving a non-zero memory page data stream; calculating a hash value for each memory page using the xxHash algorithm, wherein the hash value is a content signature of the current page content; identifying memory pages with the same content based on the hash value, and forming a set; selecting a representative memory page for each set as a target compression page; compressing the page content of the target compression page using a compression algorithm such as Zstd; storing the compressed data block obtained in the compressed page data area (i.e., the target storage area); recording the mapping relationship between the page frame number of the target compressed page and the storage location (including the starting offset and the occupied byte length) of the corresponding compressed data block in a target mapping table; and constructing a target file containing the target mapping table and the compressed page data area (i.e., the target storage area). By eliminating non-zero page redundancy and compressing the page content of unique memory pages, the kernel dump file size is significantly reduced, and the dump efficiency is improved.
[0074] In one example, the processing flow of this application specifically includes the following steps: Step S21: Obtain the kernel dump file to be processed. Specifically, obtain the data stream representing the non-zero memory pages of the operating system kernel, i.e., the kernel dump file to be processed. This data stream can be data that has been preprocessed using existing mechanisms (such as the filtering options of the makedumpfile tool), for example, data obtained by filtering out zero pages or other pages of specified exclusion types from the original kernel dump file.
[0075] Step S22: Calculate the content signature. Use the xxHash algorithm (preferably its 64-bit version, XXH64) to calculate a hash value as a content signature for each memory page in the kernel dump file obtained in step S21. The xxHash algorithm is preferred because of its high computation speed and sufficiently low hash collision probability, making it suitable for handling large-scale memory data.
[0076] Step S23: Identify duplicate pages. For example, a hash table data structure can be established and maintained. Maintenance refers to the process of building the table from the first data entry and continuously adding new data. This hash table uses the hash values calculated in step S22 as keys, and each hash value is associated with a list of page frame numbers (PFNs) containing all memory pages with that hash value as values. In other words, it constructs and stores the correspondence between all hash value keys and page frame number values calculated in step S22. By updating this hash table, memory pages with identical content (i.e., identical hash values) can be efficiently identified. The hash table is detailed in Table 1 above. Updating the hash table is as follows: When memory page PFN 701 is identified, and it is confirmed to be a completely new hash value (i.e., completely new page content), the new hash value is entered into the key area, and PFN 701 is entered into the corresponding value area. When memory page PFN 704 is identified, and its hash value confirms that its page content is consistent with that of PFN 700 and PFN 702 (all being page content A), PFN 704 is entered into the value area corresponding to page content A.
[0077] Step S24: Compress and store the unique content. Specifically, for each set of memory pages with the same page content represented by each hash value in the hash table maintained in step S23, select any one memory page content from each set as the representative of the current set. This representative is called the target compressed page. That is, for each key in the hash table, select any one (or the first) of its multiple corresponding values as the representative, obtain the page content of the memory page corresponding to the page frame number identified by the representative, and compress the page content using the Zstd compression algorithm to obtain the compressed data block corresponding to the representative (i.e., the target compressed page).
[0078] Each compressed data block corresponding to a target compressed page only needs to be stored once, in the target storage area within the target file. This target storage area is a contiguous, specifically designated, unique compressed page data area. "Specifically designated" describes the storage format, meaning that all compressed data blocks are stored together in a well-defined structure or space. In this embodiment, each compressed data block is stored in the target storage area specified within the segment content of the target file. "Unique" describes the storage content, meaning that the stored compressed data blocks are a deduplicated, non-repeating set.
[0079] Step S25: Create a target mapping table. Specifically, a Page Map Table is created as the target mapping table. This table is responsible for recording the mapping relationship between the page frame number (PFN) of each memory page and the storage state of its corresponding compressed data block. Specifically, for each PFN, the table stores the starting offset and the actual size (i.e., the byte length occupied by the compressed data block) of the compressed data block representing the page content in the target storage area (i.e., the unique compressed page data area).
[0080] Step S26: Update the final optimized kernel dump file, i.e., update the target file. The target file preferably uses the extended ELF format, which contains the target mapping table generated in step S25 and the target storage area generated in step S24. Optionally, the target mapping table is stored in a first custom ELF segment named .vmcore_pagemap, and the target storage area is stored in a second custom ELF segment named .vmcore_czdata.
[0081] An ELF file consists of two parts: a header and segment content. The header identifies the specific locations of different data within the segment content, while the segment content actually stores the data that needs to be saved or transmitted. Both the first and second custom ELF segments are located within the segment content of the ELF file.
[0082] Specifically, the target file originally included at least the target storage area. In this step, the target mapping table is written into the target file, thereby updating the target file. Finally, the updated target file is output for the fault analysis tool to identify and handle faults.
[0083] In addition, the target file may contain metadata necessary for reading or using relevant parameters, stored in the file header. Metadata includes identifiers recognizable by fault analysis tools, explicitly indicating the hash algorithm (xxHash, 64-bit version), compression algorithm (Zstd), target mapping table, and the specific location of the target storage area used by the current target file. For example, target files are shown in Table 3: Table 3:
[0084] For example, the starting address of the current target file is 0x0000. The area before the first custom ELF segment is the file header location. That is, the first four areas in the table are the file header locations, used to store metadata content; Section: .vmcore_pagemap and Section: .vmcore_czdata are the segment content areas of the current target file.
[0085] In the example above, the current target file includes: the hash algorithm XXH64 and the compression algorithm Zstd used by the current target file are stored in the 64 bytes of the initial ELF Header. The name of the first custom ELF segment is stored in the third region to indicate the storage location of the target mapping table in the current file; at the same time, the name of the second custom ELF segment is stored to indicate the storage location of the compressed data block, i.e., the target storage area, in the current file.
[0086] It is understandable that Section .vmcore_pagemap in Table 3 is a simplified representation of the target mapping table in the first custom ELF segment. Among them, PFN 700 Entry (700, 0, 100) is memory page PFN700, page frame number PFN is 700, and the corresponding compressed data block storage location has a starting offset of 0 and a length of 100 bytes; that is, PFN 700Entry (700, 0, 100) is a shorthand for the parameter in row 0 of Table 2, and so on.
[0087] The Section .vmcore_czdata in Table 3, which is the second custom ELF section, displays brief information about each compressed data block. For example, [Compressed Block A (100 bytes)]Offset: 0 indicates that the compressed data block A corresponding to page content A has a starting offset of 0 and occupies 100 bytes, and so on.
[0088] Specifically, the parameters stored in the file header are collectively referred to as metadata. For hash and compression algorithms, the metadata can directly store the algorithm name or store related algorithm identifiers. Regarding the identifiers related to hash and compression algorithms in the metadata, certain records exist when specific hash and compression algorithms are called during the process of obtaining the target file using this solution. These recorded identifiers can be directly written into the file header as metadata, or the recorded identifiers can be converted into a format recognizable by the fault analysis tool and written into the file header to provide hash and compression algorithm information to the fault analysis tool.
[0089] Regarding the location information of the target mapping table and target storage area in the metadata, the target storage area is a pre-allocated block within the segment content used to store compressed data blocks. Therefore, when the target storage area is partitioned within the segment content, the offset of the target storage area relative to the start position of the segment content area can be written in the file header, which is the location information of the target storage area. Alternatively, the name of the second custom ELF segment partitioned for the target storage area can be used as the location information of the target storage area, as shown in Table 3. The second custom ELF segment only includes the target storage area, so the name of the second custom ELF segment can also be used as the location identifier of the target storage area. By querying the second custom ELF segment in the segment content by name, the location of the target storage area is confirmed. The location of the target mapping table can be obtained by retrieving its specific storage location after storing it in the segment content; or, since the target mapping table is stored in the first custom ELF segment, i.e., the starting position of the target mapping table and the first custom ELF segment is the same, the name of the first custom ELF segment can be used as the location of the target mapping table.
[0090] The stored metadata identifies the hash algorithm used (or directly stores the name of the hash algorithm). This allows fault analysis tools to interpret the key value in the target mapping table using the format of the corresponding hash value of the hash algorithm, avoiding problems such as insufficient bytes read leading to incorrect hash value retrieval and ensuring efficient hash value reading. Simultaneously, it improves the scalability of the target file format, supporting the addition of new features for the hash algorithm during future technological evolution.
[0091] The identifier (or name) of the compression algorithm in the metadata is used to provide the fault analysis tool with the decompression method for the compressed data block. This allows the fault analysis tool to query the target storage area by the page frame number of each memory page in the target mapping table, obtain the corresponding compressed data block, and then accurately decompress the compressed data block to restore the page content of the memory page.
[0092] Optionally, the metadata may also store the name of the target mapping table and the target storage area. This allows developers or tools to understand the file structure of the target file and supports debugging.
[0093] The location of the target mapping table and the target storage area in the metadata is used to provide the fault analysis tool with an accurate target location, avoiding the fault analysis tool blindly traversing all spaces in the target file to search for the target data, which is not only time-consuming but also unreliable. For example, the name of the target storage area in the metadata could be named as .vmcore_pagemap for the first custom ELF segment storing the target mapping table, and .vmcore_czdata for the second custom ELF segment storing the target storage area.
[0094] For example, since metadata is stored in the file header, after the fault analysis tool obtains the target file, it first identifies the file header in sequence, obtains the compression algorithm used, and gets the location of the target mapping table and the target storage area in the segment content; based on the aforementioned locations, it reads the target mapping table and the data in the target storage area from the segment content. Further, based on the mapping relationship between page frame numbers and compressed data block locations in the target mapping table, it searches for the compressed data block corresponding to the page frame number in the target storage area, and decompresses the compressed data block according to the decompression method corresponding to the compression algorithm obtained in the file header to obtain the memory page corresponding to the page frame number.
[0095] In summary, this application provides a file processing method that, during the kernel dump file generation process, accurately identifies and eliminates redundancy in non-zero pages with identical content, and simultaneously performs targeted compression of the page content of unique memory pages after deduplication, resulting in significant beneficial effects. This includes: Significantly reducing file size: Content deduplication avoids storing redundant data, and compression further reduces the size of unique memory pages. The combination of these two methods results in a final target file size far smaller than a kernel dump file that only undergoes page-by-page compression or page-only filtering. Since compression is performed only once on each unique page after deduplication, redundant compression operations on large amounts of duplicate content are avoided, thus saving CPU resources and shortening the overall dump file generation time.
[0096] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of the present invention are not limited to the described order of actions, because according to the embodiments of the present invention, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions involved are not necessarily essential to the embodiments of the present invention.
[0097] Reference Figure 2 The diagram shows a structural block diagram of a document processing apparatus according to the present invention, which may include: The deduplication module 201 is used to perform deduplication operations on the memory pages according to the page content corresponding to each memory page in the kernel dump file to be processed, so as to obtain a set of target compressed pages; Compression module 202 is used to compress the target compressed pages respectively to obtain compressed data blocks corresponding to each target compressed page; The mapping module 203 is used to obtain the correspondence between each memory page and the compressed data block according to the page content; The module 204 is used to establish a target mapping table based on the compressed data block and store the target mapping table in the target file corresponding to the kernel dump file; the target mapping table includes the mapping relationship between each memory page and the storage location of its corresponding compressed data block in the target file.
[0098] Optionally, the deduplication module 201 includes: The encoding module is used to encode each memory page according to the page content corresponding to each memory page in the kernel dump file to be processed, so as to obtain the encoding value corresponding to each memory page. The acquisition module is used to acquire a memory page corresponding to each of the encoded values, forming a set of target compressed pages. Optionally, the encoding module is specifically used to: encode each memory page using a hash algorithm, and use the hash value corresponding to each memory page as the encoding value of the memory page.
[0099] Optionally, the step of establishing a target mapping table based on the compressed data block includes: The number acquisition module is used to acquire the page frame number corresponding to each memory page. The mapping acquisition module is used to obtain the mapping relationship between the page frame number and the storage location of the compressed data block in the target file according to the correspondence between each memory page and the compressed data block; wherein, the storage location of the compressed data block in the target file includes: the starting offset and the byte length occupied by the compressed data block in the target file; The target establishment module is used to establish the target mapping table based on the mapping relationship.
[0100] Optionally, the document processing apparatus further includes: A preprocessing module is used to preprocess the original kernel dump file to obtain the kernel dump file to be processed; wherein, the preprocessing includes removing zero-filled pages.
[0101] Optionally, the target file is in ELF executable and linkable format, the target mapping table is stored in a first custom ELF segment of the ELF file, and the compressed data block is stored in a second custom ELF segment of the ELF file.
[0102] Optionally, the target file includes a file header and segment content; The segment content includes the first custom ELF segment and the second custom ELF segment; The file header includes the identifier of the compression algorithm used to compress the target compressed page, the location of the target mapping table, and the location of the target storage area.
[0103] This invention provides a file processing apparatus. In related technologies, redundancy processing in kernel dump files typically involves using existing tools (such as makedumpfile) to filter and compress memory pages in the kernel dump file. The filtering process usually considers removing zero-padding pages from the memory pages. However, during normal system operation, kernel page merging, copy-on-write, and other mechanisms still generate a large amount of identical data, resulting in a large number of identical non-zero memory pages in the generated kernel dump file. Therefore, even after filtering zero-padding pages, redundancy remains. The solution provided in this application first performs deduplication on the memory pages in the kernel dump file, then compresses the deduplicated memory pages to obtain compressed data blocks, and then constructs the target file based on the compressed data blocks. Because deduplication is performed on the memory pages in the kernel dump file, the size of the target file obtained by compressing the memory pages can be reduced. Simultaneously, during the compression process, since only deduplicated memory pages need to be compressed, the efficiency of the compression operation can be improved.
[0104] Furthermore, by performing compression on each target compressed page (i.e., the deduplicated memory page), different compressed data blocks can be obtained, with each compressed data block corresponding one-to-one with the page content of the target compressed page. Based on the page content, a further correspondence between all memory pages and compressed data blocks can be established, and a target mapping table can be built according to this correspondence. When the fault analysis tool processes the final target file obtained from this solution, it can use the correspondence between all memory pages and each compressed data block in the target mapping table to decompress the compressed data blocks according to the position or order of each memory page to restore the complete kernel dump file to be processed, thus completing the fault analysis.
[0105] As the device embodiment is basically similar to the method embodiment, the description is relatively simple, and relevant parts can be found in the description of the method embodiment.
[0106] Reference Figure 3 This is a schematic diagram of the structure of the electronic device provided in an embodiment of the present invention. Figure 3As shown, the electronic device includes: a processor, a memory, a communication interface, and a communication bus. The processor, the memory, and the communication interface communicate with each other through the communication bus. The memory is used to store at least one executable instruction, which causes the processor to perform the steps of the file processing method of the aforementioned embodiment.
[0107] This invention provides a non-transitory computer-readable storage medium that, when the instructions in the storage medium are executed by a terminal's program or processor, enables the terminal to perform the steps of the file processing method described in the foregoing embodiments. The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0108] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, apparatus, or computer program products. Therefore, embodiments of the present invention can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of the present invention can take the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0109] Embodiments of the present invention are described with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0110] These computer program instructions may also be stored in a computer-readable storage medium capable of directing a computer or other programmable data processing terminal device to operate in a predictive manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0111] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0112] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.
[0113] This document uses specific examples to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A file processing method, characterized in that, The method includes: Based on the page content corresponding to each memory page in the kernel dump file to be processed, deduplication operations are performed on the memory pages respectively to obtain a set of target compressed pages; The target compressed pages are compressed separately to obtain compressed data blocks corresponding to each target compressed page; Based on the page content, obtain the correspondence between each memory page and the compressed data block; A target mapping table is established based on the compressed data blocks, and the target mapping table is stored in the target file corresponding to the kernel dump file; the target mapping table includes the mapping relationship between each memory page and the storage location of its corresponding compressed data block in the target file.
2. The file processing method according to claim 1, characterized in that, The step of performing deduplication operations on the memory pages to obtain a set of target compressed pages includes: Based on the page content corresponding to each memory page in the kernel dump file to be processed, each memory page is encoded to obtain the encoding value corresponding to each memory page. Each memory page corresponding to the encoded value is obtained to form a set of target compressed pages.
3. The file processing method according to claim 2, characterized in that, The step of encoding each memory page to obtain a corresponding encoded value for each memory page includes: Each memory page is encoded using a hash algorithm, and the hash value corresponding to each memory page is used as the encoded value of the memory page.
4. The file processing method according to claim 1, characterized in that, The step of establishing a target mapping table based on the compressed data blocks includes: Obtain the page frame number corresponding to each memory page; Based on the correspondence between each memory page and the compressed data block, a mapping relationship is obtained between the page frame number and the storage location of the compressed data block in the target file; wherein, the storage location of the compressed data block in the target file includes: the starting offset and the byte length occupied by the compressed data block in the target file; The target mapping table is established based on the mapping relationship.
5. The file processing method according to claim 1, characterized in that, The method further includes: The original kernel dump file is preprocessed to obtain the kernel dump file to be processed; wherein, the preprocessing includes removing zero-filled pages.
6. The document processing method according to claim 1 or 4, characterized in that, The target file is in ELF executable and linkable format; the target mapping table is stored in the first custom ELF segment of the ELF file, and the compressed data block is stored in the second custom ELF segment of the ELF file.
7. The document processing method according to claim 6, characterized in that, The target file includes a file header and segment content; The segment content includes the first custom ELF segment and the second custom ELF segment; The file header includes the identifier of the compression algorithm used to compress the target compressed page, the location of the target mapping table, and the location of the target storage area.
8. A document processing device, characterized in that, The device includes: The deduplication module is used to perform deduplication operations on the memory pages according to the page content corresponding to each memory page in the kernel dump file to be processed, so as to obtain a set of target compressed pages; A compression module is used to compress the target compressed pages respectively to obtain compressed data blocks corresponding to each target compressed page; The mapping module is used to obtain the correspondence between each memory page and the compressed data block according to the page content; A module is established to create a target mapping table based on the compressed data blocks and store the target mapping table in the target file corresponding to the kernel dump file; the target mapping table includes the mapping relationship between each memory page and the storage location of its corresponding compressed data block in the target file.
9. An electronic device, characterized in that, include: The processor, memory, communication interface, and communication bus are provided, wherein the processor, memory, and communication interface communicate with each other via the communication bus. The memory is used to store at least one executable instruction that causes the processor to perform the steps of the file processing method as described in any one of claims 1 to 7.
10. A readable storage medium, characterized in that, The readable storage medium stores a program or instructions that, when executed by a processor, implement the steps of the file processing method as described in any one of claims 1 to 7.
11. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the file processing method as described in any one of claims 1 to 7.