Supercharge Your Innovation With Domain-Expert AI Agents!

A big data time series analysis method based on the fusion of pyspark and pandas

A time series analysis and big data technology, applied in the field of big data analysis, can solve problems such as inability to adapt to time series data processing, and achieve the effect of improving analysis efficiency and speed, and improving operating efficiency and performance.

Active Publication Date: 2022-07-19
NANJING INST OF RAILWAY TECH
View PDF15 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, Spark, the current mainstream platform for big data processing, provides coarser-grained and limited timing analysis methods for timing analysis, while Pandas, the current mainstream timing analysis library, provides a rich timing analysis algorithm toolkit, but it can only run on a single machine , unable to adapt to the processing of large-scale time series data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A big data time series analysis method based on the fusion of pyspark and pandas
  • A big data time series analysis method based on the fusion of pyspark and pandas
  • A big data time series analysis method based on the fusion of pyspark and pandas

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0083] In this embodiment, the time series analysis is performed on the alarm information of a certain security cloud platform. The security cloud platform summarizes the security alarm event data of multiple companies. These data are collected from the company's local security product end to the cloud in real time, and aggregated and stored. And the data involves all the security alarm events of each company, and the amount of data is large.

[0084] Due to the characteristics of security analysis, it is necessary to conduct global cross-company data analysis, especially the time series analysis of security events, such as the time pattern of an attacker's IP attack on various companies, and a certain type of attack trend in the industry. However, these types of timing analysis are difficult to complete if only relying on the Spark platform, and the method proposed in this application can complete the timing analysis task efficiently and with high performance.

[0085] Table ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a big data time series analysis method based on the fusion of PySpark and Pandas, which uses Spark to perform transformation operations such as filtering and aggregation on large-scale data, and converts the large-scale time series data into small-scale uniform equidistant time series through a downsampling method Then use PySpark's toPandas method to convert the data into Pandas DataFrame, and finally use the time series analysis algorithm provided by the Pandas library for time series analysis. It has better operating efficiency and performance in practical applications.

Description

technical field [0001] The invention belongs to the technical field of big data analysis, and in particular relates to a big data time series analysis method based on the fusion of PySpark and Pandas. Background technique [0002] Python and R languages ​​are mainstream programming languages ​​in data analysis, and Python has become a popular programming language for data analysis due to the popularity of Python's Pandas library. Data analysis using Pandas is suitable for small-scale data analysis scenarios on a single machine, but cannot meet large-scale data processing and computing requirements. Spark is a mainstream computing platform for big data processing and iterative computing. It supports the Python language. PySpark is the Python language interface of SparkAPI. [0003] The native Spark provides less time series analysis functions and algorithms in time series analysis. Although the third-party library spark-timeseries provides a Spark-based time series analysis ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/2458G06F16/27
CPCG06F16/2462G06F16/2474G06F16/27
Inventor 黄必栋
Owner NANJING INST OF RAILWAY TECH
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More