Data acquisition method and device based on Spark computing framework

A computing framework and data acquisition technology, applied in the computer field, can solve problems such as insufficient performance of JdbcRDD functions, affecting spark data import performance, etc., to achieve the effect of improving data import performance, reducing data transmission overhead, and improving computing performance

Active Publication Date: 2018-09-14
NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT +1
View PDF3 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The technical problem to be solved by the present invention is a data acquisition method and device based on the Spark computing framework, which is used to solve the problem in the prior art that the data import performance of spark is affected due to the lack of JdbcRDD function and performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data acquisition method and device based on Spark computing framework
  • Data acquisition method and device based on Spark computing framework
  • Data acquisition method and device based on Spark computing framework

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] The present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0024] This embodiment provides a data acquisition method based on the Spark computing framework. This embodiment is executed on the Spark side. like figure 1 Shown is a flowchart of a data acquisition method based on the Spark computing framework according to an embodiment of the present invention.

[0025] Step S110, after receiving the table object access request, obtain the computing resource information of Spark and the data distribution information of the data tables to be accessed in the MPP cluster (also called MPP database cluster).

[0026] The table object access request is used to request to access the data tables stored in the MPP cluster. According to the table object access ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data acquisition method and device based on a Spark computing framework. The method includes the steps of receiving a table object access request, and acquiring computing resource information of Spark and data distribution information of a data table to be accessed in an MPP cluster; generating a plurality of Partitions according to the computing resource information andthe data distribution information, wherein each Partition corresponds to part of the data in the data table; obtaining the data table from the MPP cluster by generating the Partitions. The method fully utilizes the data storage characteristics of the MPP cluster, and quickly acquires a data set directly from a storage node of the MPP through the multiple Partitions. Further, in the case that computing resources are sufficient, the data table of the storage node may be further split to achieve the purpose of improving parallelism and improving data import performance. According to the data distribution condition of the MPP cluster, data can be preferentially obtained from local storage, data transmission overhead is reduced, network bandwidth is saved, network delay is reduced, and computing performance is improved.

Description

technical field [0001] The present invention relates to the field of computer technology, in particular to a data acquisition method and device based on a Spark computing framework. Background technique [0002] Apache Spark (Spark for short) is a fast and general computing engine designed for large-scale data processing. Spark natively provides access interfaces to file systems including HDFS (Hadoop Distributed File System, Distributed File System), and Spark can usually be used to calculate and analyze structured data in the database. [0003] Specifically, Spark can export the data of the target database to a file through the database export tool, and then calculate and analyze the data through the file, but this method is cumbersome, time-consuming, error-prone, and is largely affected by the environment. limit. [0004] In addition, you can also directly access the database system through the JdbcRDD provided by Spark. JdbcRDD is a general-purpose database access in...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 吕雁飞刘欣然张鸿蒋旭马秉楠惠榛朱亚南
Owner NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products