Method and system for merging Delta small files based on Spark
A small file and merging algorithm technology, applied in the direction of file system, file system function, file access structure, etc., can solve the problems of unsatisfactory data processing performance, achieve good promotion and use value, easy processing, and improve reading efficiency.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0054] as attached figure 1 Shown, the Spark-based method of the present invention carries out small file merger to Delta, and this method is specifically as follows:
[0055] S1. Use Spark to read the DeltaLog file, analyze the DeltaLog to obtain the metadata information of each data file;
[0056] S2. Spark counts the number of small files and the total number of files according to the size of all files according to the merge strategy;
[0057] S3. Perform statistical operations on the files, and generate metadata information CompactionMetadata describing the merger according to the statistical information;
[0058] S4. Spark judges whether the files need to be merged according to the merged metadata information and the file merge strategy:
[0059] ①, if yes, then determine its merging rules, and execute step S5;
[0060] ②, if not, exit;
[0061] S5. Spark determines the size and quantity of small files to be merged and target files according to the merge policy and me...
Embodiment 2
[0087] as attached figure 2 Shown, the Spark-based system of the present invention carries out small file merger to Delta, and this system comprises,
[0088] The acquisition module is used to obtain the absolute path and size of the current table and directory files through DeltaLog; specifically, Deltalog obtains all the files of the current snapshot, and records the absolute path, file size and recording time of each file;
[0089] The partition module is used to obtain the partition according to the absolute path through Spark, realize the partition information according to the path separator, and obtain the current partition CompactionMetadata according to the partition through Spark, and realize the conversion of DeltaLog information into merged metadata information; the key code is as follows:
[0090]
[0091] The selection module is used to select and merge the merge algorithm according to the CompactionMetadata through Spark. Spark provides different merge algori...
Embodiment 3
[0103] An embodiment of the present invention also provides an electronic device, including: a memory and at least one processor;
[0104] Wherein, the memory stores computer-executable instructions;
[0105] The at least one processor executes the computer-executed instructions stored in the memory, so that the at least one processor executes the method for merging small files based on Spark as in any embodiment.
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com