A data optimization method based on cluster pod scheduling combined with a data lake

By deploying data pools and storage containers in a central server and distributed network, and combining them with quadratic interpolation technology, the problems of low data analysis efficiency in data lakes and large data migration volume after Kubernetes cluster Pod scheduling are solved, achieving efficient data storage and analysis.

CN115509693BActive Publication Date: 2026-06-23GUANGXI PUBLIC INFORMATION IND CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGXI PUBLIC INFORMATION IND CO LTD
Filing Date
2022-11-02
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies are inefficient for data analysis in data lakes, have high data comparison costs, and are inefficient for data analysis after Kubernetes cluster Pod scheduling, resulting in a large workload for data migration.

Method used

By deploying a central initial data pool on a central server and combining it with initial data storage tanks deployed on local core nodes in a distributed network, data is classified, stored, and analyzed. Quadratic interpolation technology is used to optimize special data, enabling cross-data pool data analysis and computation.

Benefits of technology

It improved data analysis efficiency, reduced data migration workload, optimized data storage and analysis processes, and reduced resource consumption.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115509693B_ABST
    Figure CN115509693B_ABST
Patent Text Reader

Abstract

The application discloses a data optimization method based on cluster Pod scheduling and data lake, and comprises the following steps: S11, building a distributed data pool and a distributed cluster and performing data arrangement; S12, performing type data pool data and Pod binding and data analysis and processing in a Pod scheduling process. The method mainly comprises the following steps: deploying a central initial data pool on a central server, deploying an initial data warehouse corresponding to the central initial data pool on each local core Node (node) of a distributed network to collect local Pod data, jointly analyzing the data before and after scheduling for the Pod that has been scheduled, and analyzing and processing the data that has been migrated from outside the cluster and the Pod that has been scheduled by using different methods, so that the problems of low data analysis efficiency after k8s cluster Pod scheduling and large data migration workload when non-cluster business is transferred to Kubernetes are solved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of big data and AI technology, specifically involving a data optimization method based on cluster Pod scheduling combined with a data lake. Background Technology

[0002] With the development of science and technology and the internet, the era of big data has arrived. Every day, various industries generate massive amounts of data fragments, and data measurement units have evolved from Byte, KB, MB, GB, TB to PB, EB, ZB, YB, and even BB, NB, DB. In the big data era, data collection is no longer the problem; the current technical challenge is how to find the inherent patterns within such a vast amount of data. Data lake architecture is information storage for multiple data sources, including the Internet of Things (IoT). Big data analysis or archiving can be achieved by accessing the data lake to process or deliver subsets of data to requesting users. However, a data lake architecture is not simply a giant disk. Data persistence and security are priorities. Many options can deliver a reasonable cost, but not all can meet the long-term storage needs of a data lake. The challenge lies in the fact that much data in a data lake will never be deleted. The value of this data lies in its analysis and comparison with data from year to year, which will offset its capacity cost. Therefore, we need to optimize the data.

[0003] There are already reports on existing technologies for data analysis, processing, and optimization.

[0004] For example, Chinese invention patent CN202010809326.4 discloses a method and apparatus for integrating heterogeneous data sources based on a data lake. The method includes the following steps: a) Based on information from a user's call to the write data interface, determine the operation identifier, data, and timestamp of the current write request. The operation identifier includes three types: append, update, and delete. The timestamp is the time the write request was completed. This information is appended to a specific file in the data lake. b) The data written to the specific file in the previous step is combined with the operation identifier and timestamp to perform data merging processing, resulting in the final data. This invention solves the problems of existing data lake integration technologies, such as the inability to support data update operations, the inability to maintain consistency between the data lake data and the original data, and the inability to effectively address the inefficiency in query performance caused by a large number of small files in a big data cluster.

[0005] For example, Chinese invention patent CN202210189508.5 discloses a data lake file system based on object storage, including a local file storage component, a file management component, and a local metadata storage component. The file management component includes an operation transaction management component and a file version management component. The local file storage component is controlled by the file management component and is responsible for saving business data storage object files locally and calling the local metadata storage component to save the metadata corresponding to the business data target object. The operation transaction management component controls the entire lifecycle of transactions in the local file storage component and links with the file version management component during transaction commit and rollback operations. This invention allows component users to achieve caching effects without needing to understand the underlying file system principles. Users don't need to worry about data governance details; they can focus only on the upper-level user interface to improve data governance effectiveness and accuracy, reducing the difficulty of data application and increasing its flexibility.

[0006] However, existing technologies involve large amounts of data analysis, are inefficient, and have high data comparison costs. Summary of the Invention

[0007] This invention addresses the shortcomings of existing technologies by providing a data optimization method based on cluster Pod scheduling combined with a data lake. The invention primarily involves deploying a central initial data pool on a central server and deploying corresponding initial data storage containers on local core nodes in a distributed network to collect local Pod data. It analyzes two scenarios: joint analysis of scheduled Pods with pre-scheduling data, and analysis of data migrated from outside the cluster with scheduled Pods, employing different processing methods. This solves the problems of low data analysis efficiency after Pod scheduling in existing Kubernetes clusters and the large workload of data migration when non-cluster services transition to Kubernetes.

[0008] To achieve the above objectives, the present invention adopts the following technical solution:

[0009] A data optimization method based on cluster Pod scheduling combined with a data lake includes the following steps:

[0010] S11. Build a distributed data pool and distributed cluster and perform data processing;

[0011] S12. Perform data analysis, processing, and optimization during the binding of type data pool data with Pods and Pod scheduling.

[0012] A further description of the present invention indicates that step S11 includes the following steps:

[0013] S111. Deploy a central initial data pool and a Kubernetes-based central cluster on the central server. The data generated by the central cluster is stored in the central initial data pool. At the same time, create multiple types of data pools to classify and store the stored data, and create corresponding multiple types of data storage tanks on each local node to store the data generated by the local pods.

[0014] S112. Deploy the initial data storage tank corresponding to the central initial data pool on the core nodes in the distributed network to collect the local Pod data and perform preliminary sorting. Put the small amount of data into the miscellaneous data pool allocated by the central initial data pool, and put the remaining data into the data storage tank of different data pools according to the data type.

[0015] S113. The collected physical device and network data, application computing data, and log text data are placed into the central initial data pool. At the same time, the metadata corresponding to the collected data is captured. The metadata associated with the collected data, the meta-process data, and the metadata and meta-process data associated with the collected data and the Pod are mapped into metadata identifiers and passed to the corresponding type of data pool for processing.

[0016] The initial data pool serves as a storage unit for data and is organized according to data characteristics to prepare for the next step of data entering different types of data pools. Kubernetes is an open-source system used to manage containerized applications on multiple hosts in a cloud platform. The data of low value includes data with little fluctuation and a large amount of repetition, which is deemed to have little value from a value analysis perspective, such as monitoring data that is collected normally. The three types of data pools are used to obtain data that has been organized from the central initial data pool and to classify and store the stored data. The three types of data warehouses store data generated by local pods.

[0017] Further explanation of the present invention: the data analysis and processing in step S12 of the pod scheduling process includes two pod data analysis scenarios, specifically:

[0018] S121. When a Pod is scheduled to a new Node and the Pod generates new business data, it is necessary to analyze the data before and after the Pod is scheduled.

[0019] S122. When some services are not integrated into the cluster, and external service data needs to be integrated with the scheduled Pods.

[0020] Further explanation of the present invention: the processing method for pod data analysis scenario 1) specifically includes the following steps:

[0021] S1211. Analyze the received metadata identifier format;

[0022] S1212. Obtain the metadata associated with each piece of data and make a unified declaration;

[0023] S1213. Perform cross-data pool analysis and calculation on the data before and after Pod scheduling.

[0024] The Pod data analysis and computation are equivalent to performing data analysis and computation between data storage tanks in multiple different data pools. After the metadata associated with each piece of data is uniformly declared, analysis and computation can be performed across data pools without the need for Pod data in the data pool to be transferred with Pod scheduling. Furthermore, when Pod queries data before and after scheduling and performs analysis and computation together, it avoids the tedious computation and low efficiency problem caused by database data migration, which requires data to be stored in different database tables and then aggregated and analyzed.

[0025] Further explanation of the present invention: the processing method for pod data analysis scenario 2) specifically includes the following steps:

[0026] S1221. Special data storage tanks corresponding to the central initial data pool for newly deployed local core nodes;

[0027] S1222. Place the external business data to be integrated into a special data storage container to obtain special data;

[0028] S1223. Special data, along with the metadata, metaprocess data, and third-party relationship mappings associated with the Pod that were migrated together, are identified as metadata identifiers.

[0029] The special data refers specifically to all non-local data, data that requires cross-regional data correlation and computation, or business data outside the cluster.

[0030] Further explanation of the present invention: Step S12, data optimization, specifically involves: when the special data storage tank needs to perform association calculations with the data stored in the three types of data pools deployed at each network node, a quadratic interpolation technique is used to optimize the special data; the quadratic interpolation technique specifically involves: performing difference processing on data with uneven sampling from different nodes, and then using a quadratic interpolation method, interpolating at every 3 adjacent points to obtain the quadratic interpolation value; the quadratic interpolation formula is:

[0031]

[0032] In the formula: x is the current value of the classified object, y is the three adjacent points of the classified object, and i is the sequence number.

[0033] The aforementioned quadratic interpolation technique optimizes special data to make the data intervals more uniform, which is more compatible with Transformer time-series processing and can also more realistically restore missing data in special data scenarios. The aforementioned quadratic interpolation technique mainly takes one data point every few data points, which is an optimization for large data with low value or small numerical fluctuations, and can reduce the number of model calculations and resource consumption.

[0034] A further explanation of the present invention is that the special data optimized by the quadratic interpolation technique can be placed in a data set for computation with the metadata ID of the local network node associated data.

[0035] Further explanation of the present invention: the various types of data pools include analog signal data pools, application data pools, and text data pools; the various types of data storage containers include analog signal data storage containers, application data storage containers, and text data storage containers; the three types of data storage containers correspond to and belong to the three types of data pools; the data pool is composed of multiple data storage containers, and each data storage container corresponds to a Node of a cluster.

[0036] As further explained in this invention, the Kubernetes-based cluster includes a Master, Nodes, and Pods.

[0037] A further explanation of the present invention is that the metadata corresponding to the collected data includes descriptions of data records, indexes, key values, and relationships between different data attributes; the meta-process data includes the recorded date, location, responsible person, and other ancillary information; the metadata identifier format is numeric###metadataID###meta-process dataID.

[0038] The present invention has the following beneficial effects:

[0039] 1. This invention establishes a distributed data pool and a distributed cluster, enabling the initial data to be classified and stored when it enters various types of data pools, facilitating analysis.

[0040] 2. This invention analyzes two scenarios: joint analysis of scheduled Pods and pre-scheduling data, and analysis of data migrated from outside the cluster and scheduled Pods. Different methods are used to process these scenarios, thereby solving the problem of low efficiency in data analysis after Pod scheduling in existing Kubernetes clusters. At the same time, it solves the problem of huge workload in the transition of non-cluster businesses to Kubernetes, especially in data migration. Attached Figure Description

[0041] Figure 1 This is a flowchart of a data optimization method based on cluster Pod scheduling combined with a data lake.

[0042] Figure 2 This is a model diagram of a data optimization method based on cluster Pod scheduling combined with a data lake. Detailed Implementation

[0043] The invention will now be further described with reference to the accompanying drawings.

[0044] A data optimization method based on cluster Pod scheduling combined with a data lake, the process of which is as follows: Figure 1 As shown, its model is as follows Figure 2 As shown, it includes the following steps:

[0045] S11. Build a distributed data pool and a distributed cluster and perform data processing.

[0046] S111. Deploy a central initial data pool and a Kubernetes-based central cluster on the central server. A Kubernetes-based cluster mainly includes three objects: Master, Node, and Pod. Data generated by the central cluster is stored in the central initial data pool. Simultaneously, create multiple types of data pools, including an analog signal data pool, an application data pool, and a text data pool, to obtain processed data from the central initial data pool and classify and store the stored data. Create corresponding three types of data storage containers on each local Node to store data generated by local pods, including analog signal data containers, application data containers, and text data containers. These three types of data storage containers correspond to and belong to the three types of data pools. Each data pool consists of multiple data storage containers, and each data storage container corresponds to a Node in the cluster.

[0047] S112. Deploy the initial data storage tank corresponding to the central initial data pool on the core nodes of each local area of ​​the distributed network to collect local Pod data and perform preliminary sorting. Put the small amount of data into the miscellaneous data pool allocated by the central initial data pool, and put the remaining data into the data storage tank of different data pools according to the data type.

[0048] S113. The collected physical device and network data, application computation data, and log text data are placed into the central initial data pool. Simultaneously, the metadata corresponding to the collected data is captured, including descriptions of data records, indexes, key values, and relationships between different data attributes. The purpose of setting up the initial data pool is to act as a data storage unit and prepare for the next step of data entering different types of data pools based on data characteristics. The metadata associated with the collected data, the meta-process data, and the metadata and meta-process data associated with the collected data and Pods are mapped together as metadata identifiers and passed to the corresponding type of data pool for processing. The meta-process data includes recorded dates, locations, responsible persons, and other ancillary information, which has more analytical value than the collected data and usually contains richer information. The metadata identifier format is: numeric###metadataID###meta-process dataID.

[0049] S12. Perform data analysis, processing, and optimization during the binding of type data pool data with Pods and Pod scheduling.

[0050] The data analysis and processing in step S12 of the pod scheduling process includes two scenarios for pod data analysis, specifically:

[0051] S121. When a Pod is scheduled to a new Node and generates new business data, it is necessary to analyze the data before and after the Pod scheduling. The specific processing method includes the following steps:

[0052] S1211. Analyze the received metadata identifier format;

[0053] S1212. Obtain the metadata associated with each piece of data and make a unified declaration;

[0054] S1213. Perform cross-data pool analysis and calculation on the data before and after Pod scheduling.

[0055] Data analysis and computation between Pods in a cluster is equivalent to performing data analysis and computation across multiple data warehouses in different data pools. First, the received metadata identifier format is analyzed, and the metadata associated with each piece of data is uniformly declared before analysis and computation can be performed across data pools. This eliminates the need for Pod data in the data pool to be transferred with Pod scheduling. Furthermore, analyzing and processing data before and after Pod scheduling together avoids the cumbersome and inefficient computational problems associated with database data migration, where data stored in different database tables requires aggregation and further analysis.

[0056] The existing cluster technology approach is to first obtain the data before Pod scheduling through the image server, and then perform joint analysis with the data after scheduling to obtain the full-process analysis data of Pod. Frequent Pod switching will cause excessive resource consumption of the image server and low analysis efficiency.

[0057] S122. When some services are not integrated into the cluster, and external service data needs to be integrated with the scheduled Pods, the specific handling method includes the following steps:

[0058] S1221. Special data storage tanks corresponding to the central initial data pool for newly deployed local core nodes;

[0059] S1222. Place the external business data to be integrated into a special data storage container to obtain special data;

[0060] S1223. Special data, along with the metadata, metaprocess data, and third-party relationship mappings associated with the Pod that were migrated together, are identified as metadata identifiers.

[0061] The data optimization in step S12 specifically involves the following: When a special data repository stored in the initial data pool needs to perform correlation operations with data stored in various types of data pools deployed at each network node, in order to achieve optimal data analysis efficiency and avoid excessive resource consumption by a large amount of repetitive and low-value data, a quadratic interpolation technique is used to optimize the special data. The quadratic interpolation technique specifically involves: first, to adapt to model processing, performing interpolation on data with uneven sampling from different nodes; then, using a quadratic interpolation method, interpolating every three adjacent points to obtain the quadratic interpolation, which is the data optimized by the artificial intelligence algorithm. The quadratic interpolation formula is:

[0062]

[0063] In the formula: x is the current value of the classified object, y is the three adjacent points of the classified object, and i is the sequence number.

[0064] Optimized special data using quadratic interpolation techniques can be processed in a dataset with metadata IDs associated with local network nodes. This enables cross-data pool data interaction and solves some issues associated with transitioning non-clustered services to Kubernetes. In particular, current data migration practices require the application to run smoothly on Kubernetes for a period of time before large-scale migration, making it difficult to estimate the workload, as it largely depends on the software (e.g., whether it has been containerized, which programming language is used, etc.).

[0065] The above embodiments are merely exemplary embodiments of the present invention and are not intended to limit the present invention. The scope of protection of the present invention is defined by the claims. Those skilled in the art can make various modifications or equivalent substitutions to the present invention within its spirit and scope of protection, and such modifications or equivalent substitutions should also be considered to fall within the scope of protection of the present invention.

Claims

1. A data optimization method based on cluster Pod scheduling combined with a data lake, characterized in that... Includes the following steps: S11. Build a distributed data pool and distributed cluster, and perform data processing; step S11 includes the following steps: S111. Deploy a central initial data pool and a Kubernetes-based central cluster on the central server. The data generated by the central cluster is stored in the central initial data pool. At the same time, create multiple types of data pools to classify and store the stored data, and create corresponding multiple types of data storage tanks on each local node to store the data generated by the local pods. S112. Deploy the initial data storage tank corresponding to the central initial data pool on the core nodes in the distributed network to collect the local Pod data and perform preliminary sorting. Put the small amount of data into the miscellaneous data pool allocated by the central initial data pool, and put the remaining data into the data storage tank of different data pools according to the data type. S113. The collected physical device and network data, application computing data, and log text data are put into the central initial data pool. At the same time, the metadata corresponding to the collected data is captured. The metadata associated with the collected data, the meta-process data, and the three-way relationship between the collected data and the metadata and meta-process data associated with the Pod are mapped into metadata identifiers and passed to the corresponding type of data pool for processing. S12, Perform data analysis, processing, and optimization during the binding of type data pool data with Pods and pod scheduling; the data analysis and processing during pod scheduling in step S12 includes two pod data analysis scenarios, specifically: S121. When a Pod is scheduled to a new Node and the Pod generates new business data, it is necessary to analyze the data before and after the Pod is scheduled. S122. When some services are not integrated into the cluster, and external service data needs to be integrated with the scheduled Pods.

2. The data optimization method based on cluster Pod scheduling combined with data lake as described in claim 1, characterized in that: The processing method for the pod data analysis S121 specifically includes the following steps: S1211. Analyze the received metadata identifier format; S1212. Obtain the metadata associated with each piece of data and make a unified declaration; S1213. Perform cross-data pool analysis and calculation on the data before and after Pod scheduling.

3. The data optimization method based on cluster Pod scheduling combined with data lake as described in claim 1, characterized in that: The processing method for the pod data analysis S122 specifically includes the following steps: S1221. Special data storage tanks corresponding to the central initial data pool for newly deployed local core nodes; S1222. Place the external business data to be integrated into a special data storage container to obtain special data; S1223. Special data, along with the metadata, metaprocess data, and third-party relationship mappings associated with the Pod that were migrated together, are identified as metadata identifiers.

4. The data optimization method based on cluster Pod scheduling combined with data lake as described in claim 3, characterized in that: The data optimization in step S12 specifically involves: when the special data storage tank needs to perform association operations with the data stored in the three types of data pools deployed at each network node, a quadratic interpolation technique is used to optimize the special data; the quadratic interpolation technique specifically involves: performing interpolation processing on the data with uneven sampling from different nodes, and then using a quadratic interpolation method, interpolating at every 3 adjacent points to obtain the quadratic interpolation value; the quadratic interpolation formula is: ; In the formula: x is the current value of the classified object, y is the three adjacent points of the classified object, and i is the sequence number.

5. The data optimization method based on cluster Pod scheduling combined with data lake as described in claim 4, characterized in that: The metadata IDs of the special data optimized by the quadratic interpolation technique and the local network node associated data are placed in a data set for computation.

6. The data optimization method based on cluster Pod scheduling combined with data lake as described in claim 1, characterized in that: The various types of data pools include analog signal data pools, application data pools, and text data pools; the various types of data storage containers include analog signal data storage containers, application data storage containers, and text data storage containers; the three types of data storage containers correspond to and belong to the three types of data pools; the data pool consists of multiple data storage containers, and each data storage container corresponds to a Node in the cluster.

7. The data optimization method based on cluster Pod scheduling combined with data lake as described in claim 6, characterized in that: A Kubernetes-based cluster includes a Master, Nodes, and Pods.

8. The data optimization method based on cluster Pod scheduling combined with data lake as described in claim 7, characterized in that: The metadata corresponding to the collected data includes descriptions of data records, indexes, key values, and relationships between different data attributes; the meta-process data includes the recorded date, location, responsible person, and other ancillary information. The metadata identifier format is: numeric###metadataID###metaprocess dataID.