Spark workflow scheduling method and system with privacy protection

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of privacy protection and scheduling method, which is applied in the field of Spark workflow scheduling method and system with privacy protection, which can solve the problems of increased computing overhead, data privacy and security cannot be guaranteed, etc.

Pending Publication Date: 2020-10-30

NANJING COLLEGE OF INFORMATION TECH

View PDF0 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, the encryption and decryption of data will inevitably lead to an increase in computing overhead, and if the key is leaked, the privacy and security of the data will not be guaranteed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0018] A Spark workflow scheduling method with privacy protection, including: judging and marking input data according to privacy rules, marking the input data conforming to the privacy rules as private data, and the rest of the data as common data; The data is marked for privacy in units of partitions. The partitions containing private data are marked as private partitions, and the rest are common partitions; common partitions and Spark-ready tasks that need to use common partitions as input are scheduled to common data centers in the Spark cluster. Process on the node to obtain the first output data; schedule the privacy partition and the Spark ready task that needs to use the privacy partition as input to the node of the designated privacy data center in the Spark cluster for processing to obtain the second output data; judge the first output data And whether the second output data is the final result or an intermediate result, if it is the final result, the corresponding wo...

Embodiment 2

[0035] Based on the privacy-protected Spark workflow scheduling method described in Embodiment 1, this embodiment provides a privacy-protected Spark workflow scheduling system, including:

[0036] The first module is used to judge and mark the input data according to the privacy rules, mark the input data conforming to the privacy rules as private data, and the rest of the data as ordinary data;

[0037] The second module is used to mark privacy data and common data in units of partitions, the partitions containing private data are marked as private partitions, and the rest of the partitions are common partitions;

[0038] The third module is used to schedule common partitions and Spark-ready tasks that need to use common partitions as input to the nodes of the common data center in the Spark cluster for processing to obtain the first output data; The Spark ready task is scheduled to be processed on the node of the designated privacy data center in the Spark cluster, and the s...

Embodiment 3

[0041] Based on the privacy-protected Spark workflow scheduling method described in Embodiment 1, this embodiment provides a non-transitory computer-readable storage medium on which a computer program is stored. When the program is executed by a computer, the implementation The method described in Example 1.

[0042] Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a Spark workflow scheduling method and system with privacy protection, and belongs to the technical field of Spark big data processing. According to the technical scheme, the data with privacy can be processed in the designated data center, the privacy protection requirement is met, meanwhile, the processing efficiency of the whole Spark workflow is improved, and the execution time is shortened. The method comprises: dividing the input data into private data and common data according to a privacy rule; carrying out privacy marking by taking the partition as a unit, anddividing into a privacy partition and a common partition; scheduling a Spark ready task taking the common partition as an input to a common data center for processing to obtain first output data; scheduling the private partition as an input Spark ready task to a private data center for processing to obtain second output data; and judging whether the first output data and the second output data arefinal results or intermediate results, and if the first output data and the second output data are the final results or the intermediate results, performing privacy confirmation, marking and partitioning again until all Spark ready tasks in all Spark ready queues are completely processed.

Description

technical field [0001] The invention belongs to the technical field of Spark big data processing, and in particular relates to a Spark workflow scheduling method and system with privacy protection. Background technique [0002] Spark is a relatively new distributed computing framework based on parallel computing technology. The core of Spark uses a data structure called RDDs (Resilient Distributed Data Sets) to provide a unified view of distributed data. However, the data represented in RDDs may lead to the leakage of private data processed by the application, and the two default scheduling strategies of Spark, FIFO and FAIR, cannot effectively protect private data. This makes the Spark framework unable to flexibly handle some scenarios where the input data has a small amount of privacy protection requirements and the processing results can be shared with the outside world. A Spark application usually contains a set of jobs with a partial order relationship, and a job can b...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F21/62G06F21/53G06F9/50

CPCG06F21/6245G06F21/53G06F9/5061

Inventor 顾海花张霞孙仁鹏傅婧

Owner NANJING COLLEGE OF INFORMATION TECH

Spark workflow scheduling method and system with privacy protection

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology