Spark-based cassandra data import method, device, equipment and medium

A data import and data technology, applied in the field of data processing, to achieve the effect of uniform data, reducing impact and reducing the number of small files

Active Publication Date: 2022-07-05
同盾(广州)科技有限公司
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to overcome the deficiencies of the prior art, one of the objectives of the present invention is to provide a Spark-based Cassandra data import method, which calculates the number of partitions according to the size of the SSTable single file and the total data volume, and divides the data equally according to the token value, In order to prevent the problem of data imbalance in the partition when importing Cassandra

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark-based cassandra data import method, device, equipment and medium
  • Spark-based cassandra data import method, device, equipment and medium
  • Spark-based cassandra data import method, device, equipment and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0050] Embodiment 1 provides a Spark-based Cassandra data import method, which aims to prevent data imbalance and data skew by equally dividing the data to be imported into each partition, and reduce the probability of memory overflow.

[0051] Spark is a unified analysis engine for large-scale data processing. Spark provides a comprehensive and unified framework for managing various datasets and data sources (batch data or real-time data) with different properties (text data, graph data, etc.). streaming data) for big data processing needs.

[0052] Please refer to figure 1 As shown, a Spark-based Cassandra data import method includes the following steps:

[0053] S110, obtain the data volume of the data to be imported and the SSTable single file size, and calculate the required number of partitions N according to the data volume and the SSTable single file size;

[0054] The SSTable in S110 is the basic storage unit of Cassandra, and the size of a single SSTable file is se...

Embodiment 2

[0081] The second embodiment is carried out on the basis of the first embodiment, and mainly improves the parallel processing process.

[0082] After Spark completes partition (partition) interval calculation and shuffle partition sorting, that is, after completing the steps of S110-S130, when directly using CQLSSTableWriter+SSTableLoader to import data into Cassandra, the parallelism and traffic are difficult to control, which affects the performance of the Cassandra cluster.

[0083] Therefore, in this embodiment, after the SSTable file is generated in step S140 of the first embodiment, a step of copying the SSTable file to the distributed file system is added, and the copy path is recorded, so as to control the parallel number.

[0084] The distributed file system in this embodiment selects hdfs.

[0085] Please refer to Figure 4 shown, including the following steps:

[0086] S210. Calculate the parallel number M according to the number of Cassandra nodes;

[0087] Cass...

Embodiment 3

[0096] Embodiment 3 discloses a device corresponding to the Spark-based Cassandra data import method corresponding to the above embodiment, which is the virtual device structure of the above embodiment, please refer to Figure 5 shown, including:

[0097]The partition calculation module 310 is used to obtain the data volume and the SSTable single file size, and calculate the required number of partitions N according to the data volume and the SSTable single file size;

[0098] The partition allocation module 320 is used for reading data, and calculating a token value according to the key of the data; according to the token value, allocating the data to the N partitions, and assigning each Sort the data in the partition;

[0099] The file generation module 330 is used to read the sorted data using CQLSSTableWriter to generate an SSTable file;

[0100] The file import module 340 is configured to import the SSTable file into the Cassandra cluster through SSTableload.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for importing Cassandra data based on Spark, which relates to the technical field of data processing and is used to solve the problem that the performance of Cassandra is reduced when data is imported into Cassandra through Spark. The method includes the following steps: obtaining data to be imported The amount of data and the size of a single SSTable file, and the required number of partitions N is calculated according to the amount of data and the size of a single SSTable file; the token value is calculated according to the key of the data; according to the token value, the data is allocated to the N partitions, and sort; use CQLSSTableWriter to read the sorted data to generate SSTable files; process the SSTable files in parallel, and import the SSTable files into the Cassandra cluster through SSTableload. The invention also discloses a Spark-based Cassandra data import device, an electronic device and a computer storage medium. The present invention partitions data through Spark, thereby improving the processing performance of Cassandra when data is imported.

Description

technical field [0001] The invention relates to the technical field of data processing, in particular to a Spark-based Cassandra data import method, device, device and medium. Background technique [0002] In recent years, products and applications related to the Internet of Things, artificial intelligence, and smart cities have emerged one after another. These products and applications have also promoted the vigorous development of big data technology. With the exponential growth of data scale, data processing and storage methods have become The main research direction of the relevant company. At present, Cassandra, as an open source distributed NoSQL database storage system, is used by more and more companies as a data storage system due to its high write performance and high read performance. [0003] As a distributed data storage system, Cassandra can provide relatively complete data reading, writing and management functions. Cassandra stores data in the form of SSTable...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/25G06F16/27
CPCG06F16/258G06F16/27
Inventor 程万胜
Owner 同盾(广州)科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products