Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and system for merging Delta small files based on Spark

A small file and merging algorithm technology, applied in the direction of file system, file system function, file access structure, etc., can solve the problems of unsatisfactory data processing performance, achieve good promotion and use value, easy processing, and improve reading efficiency.

Active Publication Date: 2021-01-19
SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD
View PDF7 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

It is mainly used when the system switches from a traditional relational database to a big data platform, or as the business volume increases, the traditional database cannot meet the data processing performance due to the continuous increase of historical data, and the data needs to be transferred to Delta, using Spark as the Computing Engine Scenarios
Compared with HIVE, Delta provides HDFS-based big data update and delete functions, but due to the limitation of Delta's own design for updating, the continuous insertion of data will generate a large number of small files

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for merging Delta small files based on Spark
  • Method and system for merging Delta small files based on Spark
  • Method and system for merging Delta small files based on Spark

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0054] as attached figure 1 Shown, the Spark-based method of the present invention carries out small file merger to Delta, and this method is specifically as follows:

[0055] S1. Use Spark to read the DeltaLog file, analyze the DeltaLog to obtain the metadata information of each data file;

[0056] S2. Spark counts the number of small files and the total number of files according to the size of all files according to the merge strategy;

[0057] S3. Perform statistical operations on the files, and generate metadata information CompactionMetadata describing the merger according to the statistical information;

[0058] S4. Spark judges whether the files need to be merged according to the merged metadata information and the file merge strategy:

[0059] ①, if yes, then determine its merging rules, and execute step S5;

[0060] ②, if not, exit;

[0061] S5. Spark determines the size and quantity of small files to be merged and target files according to the merge policy and me...

Embodiment 2

[0087] as attached figure 2 Shown, the Spark-based system of the present invention carries out small file merger to Delta, and this system comprises,

[0088] The acquisition module is used to obtain the absolute path and size of the current table and directory files through DeltaLog; specifically, Deltalog obtains all the files of the current snapshot, and records the absolute path, file size and recording time of each file;

[0089] The partition module is used to obtain the partition according to the absolute path through Spark, realize the partition information according to the path separator, and obtain the current partition CompactionMetadata according to the partition through Spark, and realize the conversion of DeltaLog information into merged metadata information; the key code is as follows:

[0090]

[0091] The selection module is used to select and merge the merge algorithm according to the CompactionMetadata through Spark. Spark provides different merge algori...

Embodiment 3

[0103] An embodiment of the present invention also provides an electronic device, including: a memory and at least one processor;

[0104] Wherein, the memory stores computer-executable instructions;

[0105] The at least one processor executes the computer-executed instructions stored in the memory, so that the at least one processor executes the method for merging small files based on Spark as in any embodiment.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and a system for merging Delta small files based on Spark, belongs to the field of big data storage and computing application, and aims to solve the technical problemof how to combine Spark with Delta to realize quick positioning and merging of the small files, and the adopted technical scheme is as follows: S1, reading Delta Log files by using Spark, and storingthe Delta Log files in a database, and analyzing DeltaLog to obtain metadata information of each data file; s2, enabling the Spark to count the number of small files and the total number of files forall the files according to the size based on a merging strategy; s3, performing statistical operation on the file, and generating metadata information CompactionMetadata for describing combination according to statistical information; s4, enabling the Spark to judge whether the files need to be merged or not according to the merged metadata information and a file merging strategy: (1) if yes, determining a merging rule of the files, and executing the step S5; 2, if not, exiting; and S5, allowing the Spark to determine the sizes and the numbers of the small files and the target files needing tobe merged according to the merging strategy and the metadata information.

Description

technical field [0001] The invention relates to the field of large data storage and computing applications, in particular to a Spark-based method and system for merging Delta small files. Background technique [0002] The big data strategy refers to taking big data as a basic strategic resource, fully implementing actions to promote the development of big data, accelerating the sharing, opening up and development and application of data resources, and helping industrial transformation and upgrading and social governance innovation. Then the most expensive resource in the future must be data. How to collect, store, and calculate data has become a hot topic. Delta and Spark are currently the most widely used technologies for storage and computing in the field of big data, and have attracted widespread attention from the industry. [0003] Spark is a memory-based distributed computing framework and has a high degree of support for the Hadoop ecosystem, such as supporting readin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/17G06F16/13G06F16/182
CPCG06F16/1724G06F16/13G06F16/182Y02D10/00
Inventor 周永进刘传涛张晖高传集
Owner SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products