Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A background refresh method based on spark-sql big data processing platform

A technology of big data processing and background, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of waste of control node resources, low query and import efficiency, performance discount, etc., to improve the utilization rate of system resources , the effect of increasing the query time and shortening the query time

Active Publication Date: 2018-01-16
深圳市华讯方舟光电技术有限公司 +1
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0019] 2. The existing data import program based on hive or Spark-SQL is written in Scala language and runs on the JVM virtual machine, which has problems such as low efficiency, slow speed, and easy memory overflow.
Scala is a pure object-oriented programming language. It uses the Scalac compiler to compile source files into Java class files (that is, bytecodes running on the JVM), so it is an interpreted language with low query and import efficiency.
[0020] 3. In the Standalone mode of the Spark big data processing platform, there is a waste of resources in the control node
During the running of the cluster, the import of external data and the real-time query of the data are usually carried out synchronously. Therefore, the resources of the machines in the cluster will be allocated to the data import program and the data query program at the same time. In terms of application, there will be more or less conflicts between the two, and in severe cases, the performance of the two will be greatly reduced

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A background refresh method based on spark-sql big data processing platform
  • A background refresh method based on spark-sql big data processing platform
  • A background refresh method based on spark-sql big data processing platform

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] Such as figure 2 and image 3 As shown, the background refresh method based on the Spark-SQL big data processing platform in this embodiment is to create a refresh process in the entry function of Spark-SQL and set a timing refresh mechanism to regularly scan the specified table space file directory of the distributed file system HDFS structure, as a preference, the refresh result is stored in the memory to support the query request of the table data.

[0043] Add configuration items in hive-site.xml under the conf folder of the Spark installation directory, and you can customize whether the background refresh process is enabled, the refresh interval, and the set of large data tablespaces to be refreshed.

[0044] If the refresh process is enabled, there is no directory structure information of the specified tablespace in the memory before the first refresh of the refresh process is completed. At this time, if Spark-SQL receives a query statement, it uses the original...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Disclosed in the present invention is a background refreshing method based on a Spark-SQL big data processing platform. A new process is created and a timed refreshing mechanism is set in an entry function of Spark-SQL, and a specified table space file directory structure of a Hadoop distributed file system (HDFS) is periodically scanned. Configuration items are added in a hive-site.xml under a conf folder of a Spark installation directory, and thus, whether to open a refreshing process, a refreshing interval and a big data table space set to be refreshed can be configured in a customized manner. In the present invention, under the background of big data, a first query time of the Spark-SQL big data processing platform is greatly reduced; taking 20T data as an example, a big data table is partitioned into 25 regions in a manner of taking hour as a first subregion, is partitioned into 1001 regions in a manner of taking first three digits of a mobile phone number as a second subregion, and is subjected to compressed storage according to a PARQUET format; for the query querying for a total amount of all data of a certain number section of a certain period of time, the original first query time is approximately 20 minutes, and by means of the background refreshing method optimized by the present invention, the time of the first query is reduced to approximately 45 seconds.

Description

technical field [0001] The invention relates to a background refreshing method of a big data processing platform, in particular to a background refreshing method based on a Spark-SQL big data processing platform. Background technique [0002] With the development of the Internet, mobile Internet and the Internet of Things, we have ushered in an era of big data, and the processing and analysis of these big data has become a very important and urgent need. [0003] With the development of technology, the big data processing platform experienced the initial Hadoop and Hbase, and later developed SQL-based Hive, Shark, etc. Processing platforms such as Hbase based on key-value are also gradually emerging. Today, the rise of the concept of SQL-on-Hadoop has promoted the development and growth of the Spark ecosystem, and has gradually become the most popular, most used, and most efficient big data processing platform. [0004] No matter which big data processing platform is adopt...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/00
Inventor 王成冯骏范丛明赵术开
Owner 深圳市华讯方舟光电技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products