Data processing methods, apparatus, electronic devices and storage media

By using configurable data refresh scripts and time files, and leveraging the parallel execution of data refresh across multiple nodes, the problems of low efficiency and insufficient accuracy in data refresh for large partitioned tables are solved, achieving efficient and accurate data refresh.

CN117632982BActive Publication Date: 2026-06-30WEBANK (CHINA)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
WEBANK (CHINA)
Filing Date
2023-11-28
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies have low data refresh efficiency and low availability and accuracy of refreshed tables. In particular, the data refresh process for large partitioned tables involves a large number of manual operations, which leads to low efficiency and is prone to errors.

Method used

By acquiring the data table to be refreshed, the data refresh time configuration file, and the data refresh script configuration file, the data refresh script and time can be configured. Multiple nodes can be used to execute data refresh tasks in parallel, decoupling nodes from business logic and reducing manual operations.

Benefits of technology

It automates the data refresh process, significantly reducing refresh time, improving efficiency, and ensuring the availability and accuracy of the table after refresh.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117632982B_ABST
    Figure CN117632982B_ABST
Patent Text Reader

Abstract

This application discloses a data processing method, apparatus, electronic device, and storage medium, relating to the field of financial technology (Fintech). The data processing method includes the following steps: obtaining a data table to be refreshed, a data refresh time configuration file, and a data refresh script configuration file; and refreshing the data table to be refreshed through multiple nodes based on the data refresh time configuration file and the data refresh script configuration file. This application solves the technical problems of low data refresh efficiency and low availability and accuracy of the refreshed table in related technologies.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of financial technology (Fintech), and more particularly to a data processing method, apparatus, electronic device, and storage medium. Background Technology

[0002] With the continuous development of fintech, especially internet fintech, more and more technologies (such as distributed systems and artificial intelligence) are being applied in the financial field, but the financial industry is also placing higher demands on technology.

[0003] In the construction of data warehouses, we often encounter large partitioned tables, where each partition is a full slice of data corresponding to a specific time period. As business grows rapidly, the number of partitions increases, and the data in each partition becomes larger. When data needs to be added or modified, such as changing the definition of some existing fields or adding new fields for new business needs, it is necessary to refresh all or part of the historical partitions.

[0004] However, current data refresh operations involve a large amount of manual work, resulting in low efficiency and unreliable availability and accuracy of the table after refresh. Summary of the Invention

[0005] The main objective of this application is to provide a data processing method, apparatus, electronic device, and storage medium, which aims to solve the technical problems of low data refresh efficiency and low availability and accuracy of tables after data refresh in related technologies.

[0006] To achieve the above objectives, this application provides a data processing method, which includes the following steps:

[0007] Retrieve the data table to be refreshed, the data refresh time configuration file, and the data refresh script configuration file;

[0008] The data table to be refreshed is refreshed by multiple nodes based on the data refresh time configuration file and the data refresh script configuration file.

[0009] This application also provides a data processing apparatus, the data processing apparatus comprising:

[0010] The acquisition module is used to acquire the data table to be refreshed, the data refresh time configuration file, and the data refresh script configuration file.

[0011] The data refresh module is used to refresh the data table to be refreshed by multiple nodes based on the data refresh time configuration file and the data refresh script configuration file.

[0012] This application also provides an electronic device, which is a physical device, comprising: a memory, a processor, and a program of the data processing method stored in the memory and executable on the processor. When the program of the data processing method is executed by the processor, it can implement the steps of the data processing method as described above.

[0013] This application also provides a storage medium, which is a computer-readable storage medium, on which a program implementing a data processing method is stored. When the program implementing the data processing method is executed by a processor, it implements the steps of the data processing method as described above.

[0014] This application provides a data processing method, apparatus, electronic device, and storage medium. By acquiring the data table to be refreshed, the data refresh time configuration file, and the data refresh script configuration file, the data refresh script and data refresh time are configurable. Then, multiple nodes refresh the data table based on the data refresh time configuration file and the data refresh script configuration file, enabling parallel execution of the data refresh task by multiple nodes, which can significantly reduce the time required for data refresh. On the one hand, in the case of multiple nodes executing data refresh tasks in parallel, each node's script contains a large amount of repetitive code. Compared to manually copying the data refresh script, modifying time parameters, and controlling the entire data refresh process, the configurable processing of the data refresh script and data refresh time allows only the data refresh time configuration file and data refresh script configuration file to be maintained. During data refresh, nodes can automatically execute the entire data refresh process by calling the maintained data refresh time configuration file and data refresh script configuration file, eliminating the need for extensive manual copying, pasting, and modification. On the other hand, data refresh time configuration files enable decoupling between nodes and business logic. This means that during data refresh, task allocation can be based on the refresh time, rather than relying on business logic. By maintaining these configuration files, tasks can be assigned to each node according to the refresh time. This effectively avoids situations where different tasks need to process the same data simultaneously during parallel refresh execution, thus reducing refresh errors. Therefore, it overcomes the current technical shortcomings of data refresh operations, which involve a large amount of manual work, resulting in low efficiency and compromised availability and accuracy after refresh. Data from different times in the table to be refreshed can be processed in parallel, significantly reducing refresh time and improving efficiency while ensuring the availability and accuracy of the refreshed table. Attached Figure Description

[0015] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0016] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is a flowchart illustrating the first embodiment of the data processing method in this application;

[0018] Figure 2 This is a flowchart illustrating one possible implementation of the data processing method involved in the embodiments of this application;

[0019] Figure 3 This is a flowchart illustrating the second embodiment of the data processing method in this application;

[0020] Figure 4 This is a schematic diagram of the structure of one embodiment of the data processing apparatus in this application;

[0021] Figure 5 This is a schematic diagram of the device structure of the hardware operating environment involved in the data processing method in the embodiments of this application.

[0022] The purpose, features, and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation

[0023] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0024] In the construction of data warehouses, large partitioned tables are frequently encountered, where each partition represents a full slice of data at a corresponding point in time. As business grows rapidly, the number of partitions increases, and the data in each partition becomes larger. When data needs to be added or modified, such as changing the definition of some existing fields or adding new fields for new business needs, it is necessary to refresh all or part of the historical partitions, resulting in significant operational pressure.

[0025] However, because the specific circumstances of each data addition or modification are different, the required data refresh operations also vary. Furthermore, there may be logical relationships between data in different partitions. Therefore, if data refresh tasks are executed in parallel according to partitions, multiple nodes may simultaneously need to access the same data. In this case, waiting or inability to access the data may occur, leading to reduced data refresh efficiency or accuracy, and potentially causing logical inconsistencies. For example, if a previous task requires data A, but a subsequent task modifies data A, the refreshed data will be incorrect because the previous task used the unmodified data A. Therefore, current data refresh operations involve a significant amount of manual work. However, manual operations are time-consuming, inefficient, and prone to errors, which can compromise the availability and accuracy of the refreshed table.

[0026] Therefore, this application provides a data processing method, apparatus, electronic device, and storage medium. By acquiring the data table to be refreshed, the data refresh time configuration file, and the data refresh script configuration file, the data refresh script and data refresh time are configurable. Then, multiple nodes refresh the data table to be refreshed based on the data refresh time configuration file and the data refresh script configuration file, enabling parallel execution of the data refresh task by multiple nodes, which can significantly reduce the time required for data refresh. On the one hand, in the case of multiple nodes executing data refresh tasks in parallel, each node's script contains a large amount of repetitive code. Compared to manually copying the data refresh script, modifying time parameters, and controlling the entire data refresh process, the configurable processing of the data refresh script and data refresh time allows only the data refresh time configuration file and data refresh script configuration file to be maintained. During data refresh, nodes can automatically execute the entire data refresh process by calling the maintained data refresh time configuration file and data refresh script configuration file, eliminating the need for extensive manual copying, pasting, and modification. On the other hand, data refresh time configuration files enable decoupling between nodes and business logic. This means that during data refresh, task allocation can be based on the refresh time, rather than relying on business logic. By maintaining these configuration files, tasks can be assigned to each node according to the refresh time. This effectively avoids situations where different tasks need to process the same data simultaneously during parallel refresh execution, thus reducing refresh errors. Therefore, it overcomes the current technical shortcomings of data refresh operations, which involve a large amount of manual work, resulting in low efficiency and compromised availability and accuracy after refresh. Data from different times in the table to be refreshed can be processed in parallel, significantly reducing refresh time and improving efficiency while ensuring the availability and accuracy of the refreshed table.

[0027] Example 1

[0028] Embodiment 1 of this application provides a data processing method, which is referred to... Figure 1 The data processing method includes the following steps:

[0029] Step S10: Obtain the data table to be refreshed, the data refresh time configuration file, and the data refresh script configuration file;

[0030] The execution subject of the method in this embodiment can be a data processing device, a data processing terminal device, or a server. This embodiment takes a data processing device as an example, which can be integrated into terminal devices such as smartphones and computers with data processing functions.

[0031] In this embodiment, it should be noted that the data table to be refreshed is a partitioned table that requires data refresh. When the definitions of some existing fields need to be modified, or when new business needs to add new fields, it is necessary to refresh the data in all or part of the historical partitions. Each partition in the partitioned table is a full slice of data corresponding to a specific time period. The specific partitioning method can be determined according to actual needs, and this embodiment does not impose any restrictions. However, the data in different partitions are logically interconnected. Therefore, refreshing data in different partitions through different nodes may affect the data in other partitions, potentially leading to errors in the data refresh process and incorrect data after refresh. It should be noted that the "node" refers to the engine that executes the script.

[0032] In the actual data refresh process, two important steps are involved: writing the refresh data script and controlling the refresh data time. The refresh data script refers to the code corresponding to the logic of refreshing the data, and the refresh data time refers to the generation time of the data to be refreshed in the data table. The data refresh time configuration file is the configuration file for the refresh data time, and the data refresh script configuration file can include the data time information corresponding to each node. If the refresh data script is manually copied and the time parameters are modified to control the entire data refresh process, there will be a large number of copy, paste, and modification operations, and objective factors such as data errors and missed partitions may occur. Furthermore, if the refresh data logic changes, the entire process needs to be re-executed, which is time-consuming and laborious. By configuring these two key nodes, the goal is to control the entire data refresh process by maintaining only two configuration files, enabling partition refresh using a dynamic partitioning approach. By configuring the data refresh script, changes in business logic can be controlled. By configuring the data refresh time, the allocation of data refresh tasks can be decoupled from business logic, that is, the decoupling of nodes from business logic can be achieved, thereby better enabling multiple nodes to perform data refresh in parallel. At the same time, by configuring the data refresh time, the workload of each node can be controlled, thereby effectively improving data refresh efficiency.

[0033] In one feasible approach, the data time range corresponding to the data table to be refreshed can be segmented according to the number of nodes, and each data time segment can be matched one-to-one with each node. This way, the data time information corresponding to each node can be determined. This ensures the comprehensiveness of data refresh of the data table to be refreshed and avoids the problem of cross-calling when data refresh is performed in parallel.

[0034] Before refreshing the data table, a corresponding data refresh time configuration file and data refresh script configuration file need to be written. The data refresh script configuration file can be written based on changes in actual business requirements, thus maintaining the data refresh code within it. Subsequent nodes can read the code from the configuration file, achieving the effect of synchronously modifying the execution scripts of all running nodes by only modifying the data refresh script configuration file. In the data refresh script configuration file, the time parameter of the data to be refreshed can be represented by time placeholders. This allows subsequent calls to the data refresh script configuration file to insert the corresponding data time information from the configuration file at the time placeholder positions, enabling separate control and combined execution of the data refresh script and refresh time, achieving partitioned refresh using dynamic partitioning. The data refresh time configuration file can be dynamically allocated according to actual conditions and resources, for example, by month or week, to balance operational efficiency and resources.

[0035] For example, the data refresh script configuration file can be:

[0036] "insert overwrite table XXX partition(ds)

[0037] select

[0038] --Add specific field information

[0039] ,t1.ds

[0040] from(

[0041] --Data manipulation logic

[0042] )t1;

[0043] where ds>='%1&s'

[0044] and ds<'2$s';".

[0045] For example, the data refresh time configuration file can be:

[0046] Node 1:

[0047] S1_dt:20221201 Start Date

[0048] e1_dt:20230101 End Date

[0049] node1_active:1 Whether it is started.

[0050] Node 2:

[0051] S2_dt: Start date: 2022-11-01

[0052] e2_dt: End date: 2023-12-01

[0053] node2_active:1 Whether it is started.

[0054] ...

[0055] Node N:

[0056] SN_dt:yyyy-MM-dd start date

[0057] eN_dt:yyyy-MM-dd End Date

[0058] nodeN_active:1 Whether to start.

[0059] As an example, step S10 includes: obtaining the data table to be refreshed, and a data refresh time configuration file and a data refresh script configuration file pre-written for the data table to be refreshed.

[0060] Step S20: The data table to be refreshed is refreshed by multiple nodes based on the data refresh time configuration file and the data refresh script configuration file.

[0061] In this embodiment, it should be noted that the node refers to the engine that executes scripts to complete data refresh tasks. Therefore, assigning different scripts to different nodes enables them to complete corresponding data refresh tasks. Since different data within the same time period have logical dependencies, such as data A1 being calculated based on data B1 or related to data B2, it is difficult to completely separate all the data to be refreshed in the data table and identify multiple independent data processing logics that do not affect each other if multiple nodes process different data within the same time period. Therefore, to ensure data accuracy and availability, parallel data refresh is usually not possible, resulting in low data refresh efficiency. However, this embodiment separates the data refresh time configuration file and allocates all data refresh tasks according to the time the data was generated. This decouples the nodes from the business logic, allowing parallel data refresh to be performed while ensuring data accuracy and availability, effectively improving data refresh efficiency.

[0062] As an example, step S20 includes: setting up multiple nodes according to actual resource conditions, generating corresponding data refresh task scripts by calling the data refresh script configuration file and the data refresh time configuration file through each node, and publishing them. After publication, each node can call its corresponding data refresh task script to refresh the data table to be refreshed in parallel.

[0063] Furthermore, the step of refreshing the data table to be refreshed by multiple nodes based on the data refresh time configuration file and the data refresh script configuration file includes:

[0064] Step S21: Based on the node information of each node, search the data refresh time configuration file to determine the data time information corresponding to each node.

[0065] Step S22: Each node refreshes the data table to be refreshed according to its corresponding data refresh time information and the data refresh script configuration file.

[0066] As an example, steps S21-S22 include: for each node, the data refresh time configuration file can be searched based on the node information to find the data time information corresponding to the node information, wherein the node information can be node identification information such as number and name. Then, a pre-written program automatically assembles the data time information corresponding to each node and the data refresh script configuration file into a data refresh task script corresponding to each node, and publishes it. After each node publishes its data refresh task script, the corresponding data refresh task script can be called by each node to refresh the data table to be refreshed in parallel.

[0067] Further, the data time information includes the refresh partition start time and refresh partition end time; the step of refreshing the data table to be refreshed by each node according to its corresponding data refresh time information and the data refresh script configuration file includes:

[0068] Step S221: Write the start time and end time of the refresh partition corresponding to each node into the start and end time field of the data to be refreshed in the data refresh script configuration file to generate the data refresh task script corresponding to each node.

[0069] Step S222: Each node calls its corresponding data refresh task script to refresh the data table to be refreshed.

[0070] In this embodiment, it should be noted that the data time information includes the refresh partition start time and refresh partition end time. The refresh partition start time and refresh partition end time can be used to determine the refresh data allocated to each node in the refresh data table.

[0071] As an example, steps S221-S222 include: For each node, a pre-written program can read the data refresh script configuration file. When the start and end time fields of the data to be refreshed are detected, the start and end times of the refresh partition corresponding to each node are written into the start and end time fields of the data to be refreshed in the data refresh script configuration file. This assembles the data refresh task script corresponding to each node and publishes it. After each node publishes its data refresh task script, the corresponding data refresh task script can be called by each node to refresh the data table in parallel.

[0072] Furthermore, the data time information also includes node startup control parameters; the step of writing the start time and end time of the refresh partition corresponding to each node into the start and end time field of the data to be refreshed in the data refresh script configuration file to generate the data refresh task script corresponding to each node includes:

[0073] Step S2211: Determine at least one target node to be started from each of the nodes according to the node start control parameters corresponding to each node.

[0074] Step S2212: Write the start time and end time of the refresh partition corresponding to each target node into the start and end time field of the data to be refreshed in the data refresh script configuration file, and generate the data refresh task script corresponding to each target node.

[0075] In this embodiment, it should be noted that the data time information also includes node startup control parameters. These parameters control whether a node calls the data refresh task script to execute the data refresh task. For example, if node i detects that the node startup control parameter is 0, it will not execute the data refresh task script; if it detects that the node startup control parameter is 1, it will execute the data refresh task script. Since the amount of data in different tables to be refreshed may vary, setting the node startup control parameters improves the flexibility of allocating data refresh tasks to nodes without modifying the number of nodes or the data refresh time parameters, further reducing the workload of manual operations.

[0076] As an example, steps S2211-S2212 include: for each node, after obtaining the data time information, the node startup control parameters can be extracted first. Based on the node startup control parameters, it is determined whether to start. The nodes determined to start are identified as target nodes, and the nodes determined not to start are identified as non-target nodes. Then, by reading the data refresh script configuration file through a pre-written program, when the start and end time fields of the data to be refreshed are detected, the start and end times of the refresh partitions corresponding to each target node are written into the start and end time fields of the data to be refreshed in the data refresh script configuration file, thus assembling the data refresh task scripts corresponding to each target node.

[0077] Furthermore, before the step of obtaining the data refresh time configuration file, the following steps are also included:

[0078] Step A10: Obtain the data table to be refreshed, and determine the time range of the data to be refreshed based on the data table to be refreshed;

[0079] Step A20: Divide the time range of the data to be refreshed into segments according to the preset number of nodes, and determine the start time and end time of the refresh partition for each node.

[0080] In this embodiment, it should be noted that, in order to avoid the problem of cross-calling of data, the time range covered by the data table to be refreshed can be segmented according to the number of nodes. The start time and end time of the refresh partition corresponding to each node are determined according to the segmented time period. In this way, the comprehensiveness of data refresh of the data table to be refreshed can be guaranteed, and the problem of cross-calling when data refresh is performed in parallel can be avoided.

[0081] As an example, steps A10-A20 include: obtaining a data table to be refreshed; determining the time range of data to be refreshed covered by the data table based on the time information of each piece of data to be refreshed in the data table; then segmenting the time range of data to be refreshed according to a preset number of nodes, dividing the time range of data to be refreshed into a number of time periods equal to the number of nodes, assigning a corresponding time period to each node, and the start and end times of the time period corresponding to each node are the start and end times of the refresh partition corresponding to each node.

[0082] In one feasible approach, refer to Figure 2The data processing method comprises three parts: rapid copying of large tables, configuration file design, and scheduling design and deployment. First, a backup data table is created. Then, the HDFS directories of the original data table and the backup data table are viewed. The HDFS files in the original data table's HDFS directory are copied to the backup data table's directory. Next, the backup table partitions are repaired, resulting in the data table to be refreshed. Configuration file design can be pre-completed. A data refresh time parameter configuration file is obtained by configuring time parameters, determining the start time, end time, and whether to execute. A data refresh script configuration file is obtained by writing the data refresh logic and parameterizing the time parameters. After the configuration file design is complete, workflow configuration can be performed to generate data refresh task scripts for each node. These scripts are then scheduled and published. After scheduling and publication, multiple nodes can call their respective data refresh task scripts in parallel to refresh the data table to be refreshed.

[0083] In this embodiment, by acquiring the data table to be refreshed, the data refresh time configuration file, and the data refresh script configuration file, the data refresh script and data refresh time are configurable. Then, multiple nodes refresh the data table based on the data refresh time configuration file and the data refresh script configuration file, enabling parallel execution of the data refresh task by multiple nodes, which can significantly reduce the time required for data refresh. On the one hand, when multiple nodes execute data refresh tasks in parallel, each node's script contains a large amount of repetitive code. Compared to manually copying the data refresh script, modifying time parameters, and controlling the entire refresh process, the configurable processing of the data refresh script and data refresh time allows for automated execution of the entire data refresh process by simply maintaining the data refresh time configuration file and the data refresh script configuration file, eliminating the need for extensive manual copying, pasting, and modification. On the other hand, data refresh time configuration files enable decoupling between nodes and business logic. This means that during data refresh, task allocation can be based on the refresh time, rather than relying on business logic. By maintaining these configuration files, tasks can be assigned to each node according to the refresh time. This effectively avoids situations where different tasks need to process the same data simultaneously during parallel refresh execution, thus reducing refresh errors. Therefore, it overcomes the current technical shortcomings of data refresh operations, which involve a large amount of manual work, resulting in low efficiency and compromised availability and accuracy after refresh. Data from different times in the table to be refreshed can be processed in parallel, significantly reducing refresh time and improving efficiency while ensuring the availability and accuracy of the refreshed table.

[0084] Example 2

[0085] Furthermore, referring to Figure 3 Based on the above embodiments of this application, in the second embodiment of this application, the same or similar content as the above embodiments can be referred to the above description, and will not be repeated hereafter. Based on this, the steps for obtaining the data table to be refreshed include:

[0086] Step B10: Obtain the original data table;

[0087] Step B20: Copy the original data table to obtain the data table to be refreshed;

[0088] After the step of refreshing the data table to be refreshed by multiple nodes based on the data refresh time configuration file and the data refresh script configuration file, the method further includes:

[0089] Step B30: Replace the original data table with the data table to be refreshed.

[0090] In this embodiment, it should be noted that currently, data refresh typically involves directly modifying the original data table. However, due to the gradually increasing data volume and the large number of historical partitions requiring refresh, manual operation is labor-intensive and prone to errors. If problems arise during the data refresh process, the difficulty and time required for problem detection, troubleshooting, or repair are significant, resulting in a substantial reduction in data refresh efficiency. Furthermore, data refresh usually requires modifying the table structure. When the table structure is modified, the entire process of refreshing historical partitions must be completed on the same day; otherwise, it will affect the scheduling of the following day.

[0091] As an example, steps B10-B30 include: obtaining the original data table, creating a new data table, copying the table structure, data, and metadata of the original data table to the new data table, resulting in a data table to be refreshed that is identical to the original data table. This is equivalent to backing up the original data table, and the entire data refresh process operates on the data table to be refreshed without affecting the original data table. Then, after refreshing the data table to be refreshed, the table name of the data table to be refreshed is switched with the name of the original data table, thereby replacing the original data table with the data table to be refreshed, resulting in a new original data table.

[0092] In one feasible approach, data tables can be created by creating tables using a table creation statement, by querying a table and then creating a table (create table as select), or by copying the table structure of an existing table (create tablelike).

[0093] Further, the step of copying the original data table to obtain the data table to be refreshed includes:

[0094] Step B21: Create a backup data table with the same table structure as the original data table;

[0095] Step B22: Determine the storage path of the original data in the original data table according to the original data table configuration file of the original data table;

[0096] Step B23: Locate the original data file based on the storage path, and copy the original data file to the backup table directory of the backup table partition corresponding to the backup data table;

[0097] Step B24: Perform partition repair on the backup table partition to obtain the data table to be refreshed.

[0098] As an example, steps B21-B24 include: creating a backup data table with a structure completely identical to the original data table by copying the table structure of an existing table (createtable like); then obtaining the original data table configuration file based on the original data table, which determines the storage path of the original data in the original data table; then locating the original data file along the storage path and copying the original data file to the backup table directory of the backup table partition corresponding to the backup data table, thus achieving data copying. However, after the original data file is copied to the backup table directory, because the information of the original data file is not synchronized to the metadata information for management, it is actually impossible to obtain the corresponding data by direct query. Therefore, partition repair is also required, including checking the table partition continuity, detecting the files of the table on the distributed file storage system, and writing the partition information that has not been written to the metadata storage to obtain the data table to be refreshed.

[0099] In this embodiment, backing up the original data table by copying the original data file can not only effectively improve efficiency, but also effectively reduce the problem of data inaccuracy caused by manual backup.

[0100] For example, the underlying storage system for Hive (a data warehouse tool based on Hadoop) tables is the Hadoop Distributed File System (HDFS). Each table has a corresponding directory for storing its data, which can be configured via the `hive.metastore.warehouse.dir` property in the configuration file. To obtain the original data file storage directories for the original and backup tables, you can use `desc formatted table` to view the table's details and then find the corresponding storage path "Location:XXX". Then, based on Hive's syntax, use the `dfs-cp` command to copy the original data files from the original table directory to the backup table directory. Finally, use `msckrepair table` to detect the table's files on HDFS and write any partition information that hasn't been written to the metastore.

[0101] In this embodiment, by backing up the original data table, operations such as adding, deleting, and updating definitions, as well as data updates, are performed on the backed-up data table to be updated. This does not affect the use and accuracy of the production environment and is imperceptible to users. It avoids data instability and unavailability caused by directly manipulating the original data table. Furthermore, after the operations on the backed-up data table to be updated are completed, cross-validation can be performed with the original table data to ensure accuracy before switching the table name and going live. Finally, by operating on the backed-up data table to be updated, the entire data refresh process cycle can be dynamically controlled, avoiding the limitation of having to complete the data refresh process within one day due to direct manipulation of the original data table, which would otherwise affect the scheduling of the next day.

[0102] Example 3

[0103] Furthermore, embodiments of this application also provide a data processing apparatus, referring to... Figure 4 The data processing device includes:

[0104] Module 10 is used to obtain the data table to be refreshed, the data refresh time configuration file, and the data refresh script configuration file.

[0105] The data refresh module 20 is used to refresh the data table to be refreshed by multiple nodes based on the data refresh time configuration file and the data refresh script configuration file.

[0106] Furthermore, the acquisition module 10 is also used for:

[0107] Obtain the original data table;

[0108] Copy the original data table to obtain the data table to be refreshed;

[0109] After the data refresh operation is performed on the data table to be refreshed via multiple nodes based on the data refresh time configuration file and the data refresh script configuration file, the data processing device further includes a replacement module, which is used for:

[0110] Replace the original data table with the data table to be refreshed.

[0111] Furthermore, the acquisition module 10 is also used for:

[0112] Create a backup data table with the same table structure as the original data table;

[0113] Based on the original data table configuration file of the original data table, determine the storage path of the original data in the original data table;

[0114] Based on the storage path, locate the original data file and copy the original data file to the backup table directory of the backup table partition corresponding to the backup data table;

[0115] The backup table partition is repaired to obtain the data table to be refreshed.

[0116] Furthermore, the data refresh module 20 is also used for:

[0117] Based on the node information of each node, the data refresh time configuration file is searched to determine the data time information corresponding to each node.

[0118] Each node refreshes the data table to be refreshed according to its corresponding data refresh time information and the data refresh script configuration file.

[0119] Furthermore, the data time information includes the refresh partition start time and refresh partition end time; the data refresh module 20 is also used for:

[0120] Write the start time and end time of the refresh partition corresponding to each node into the start and end time field of the data to be refreshed in the data refresh script configuration file to generate the data refresh task script corresponding to each node.

[0121] Each node invokes its corresponding data refresh task script to refresh the data table to be refreshed.

[0122] Furthermore, the data time information also includes node startup control parameters; the data refresh module 20 is also used for:

[0123] Based on the node startup control parameters corresponding to each of the nodes, at least one target node to be started is determined from each of the nodes;

[0124] Write the start time and end time of the refresh partition corresponding to each target node into the start and end time field of the data to be refreshed in the data refresh script configuration file to generate the data refresh task script corresponding to each target node.

[0125] Furthermore, prior to obtaining the data refresh time configuration file, the data processing device further includes a determining module, which is used to:

[0126] Obtain the data table to be refreshed, and determine the time range of the data to be refreshed based on the data table to be refreshed;

[0127] The time range of the data to be refreshed is segmented according to the preset number of nodes, and the start time and end time of the refresh partition corresponding to each node are determined.

[0128] The data processing apparatus provided by this invention, employing the data processing method described in the above embodiments, solves the technical problems of low data refresh efficiency and low availability and accuracy of tables after data refresh in related technologies. Compared with related technologies, the beneficial effects of the data processing apparatus provided by the embodiments of this invention are the same as those of the data processing method described in the above embodiments, and other technical features in this data processing apparatus are the same as those disclosed in the methods of the above embodiments, and will not be repeated here.

[0129] Example 4

[0130] Furthermore, embodiments of the present invention provide an electronic device, the electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the data processing method described above.

[0131] The following is for reference. Figure 5 The diagram illustrates a structural schematic of an electronic device suitable for implementing embodiments of the present disclosure. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as Bluetooth headsets, mobile phones, laptops, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and fixed terminals such as digital TVs and desktop computers. Figure 5 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.

[0132] like Figure 5 As shown, an electronic device may include a processing unit (such as a central processing unit, graphics processing unit, etc.) that can perform various appropriate actions and processes based on a program stored in read-only memory (ROM) or a program loaded from a storage device into random access memory (RAM). The RAM also stores various programs and arrays required for the operation of the electronic device. The processing unit, ROM, and RAM are interconnected via a bus. Input / output (I / O) interfaces are also connected to the bus.

[0133] Typically, the following systems can be connected to the I / O interface: input devices including, for example, touchscreens, touchpads, keyboards, mice, image sensors, microphones, accelerometers, gyroscopes, etc.; output devices including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices including, for example, magnetic tapes, hard disks, etc.; and communication devices. Communication devices allow electronic devices to communicate wirelessly or wiredly with other devices to exchange arrays. Although electronic devices with various systems are shown in the figures, it should be understood that it is not required to implement or possess all the systems shown. More or fewer systems may be implemented alternatively.

[0134] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device, or installed from a storage device, or installed from a ROM. When the computer program is executed by a processing device, it performs the functions defined above in the methods of embodiments of this disclosure.

[0135] The electronic device provided by this invention, employing the data processing method described in the above embodiments, solves the technical problems of low data refresh efficiency and low availability and accuracy of the table after data refresh in related technologies. Compared with related technologies, the beneficial effects of the electronic device provided by the embodiments of this invention are the same as those of the data processing method provided in the above embodiments, and other technical features of this electronic device are the same as those disclosed in the methods of the above embodiments, and will not be repeated here.

[0136] It should be understood that various parts of this disclosure can be implemented using hardware, software, firmware, or a combination thereof. In the description of the above embodiments, specific features, structures, materials, or characteristics may be combined in any suitable manner in one or more embodiments or examples.

[0137] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

[0138] Example 5

[0139] Furthermore, this embodiment provides a computer-readable storage medium having computer-readable program instructions stored thereon, which are used to execute the data processing method described in the above embodiment.

[0140] The computer-readable storage medium provided in this embodiment of the invention may be, for example, a USB flash drive, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this embodiment, the computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, system, or device. The program code contained on the computer-readable storage medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination thereof.

[0141] The aforementioned computer-readable storage medium may be included in an electronic device or may exist independently without being assembled into an electronic device.

[0142] The aforementioned computer-readable storage medium carries one or more programs that, when executed by an electronic device, cause the electronic device to: acquire a data table to be refreshed, a data refresh time configuration file, and a data refresh script configuration file; and refresh the data table to be refreshed through multiple nodes based on the data refresh time configuration file and the data refresh script configuration file.

[0143] Computer program code for performing the operations of this disclosure can be written in one or more programming languages ​​or a combination thereof, including object-oriented programming languages ​​such as Java, Smalltalk, and C++, and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0144] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0145] The modules described in the embodiments of this disclosure can be implemented in software or hardware. The names of the modules do not necessarily limit the functionality of the unit itself.

[0146] The computer-readable storage medium provided by this invention stores computer-readable program instructions for performing the above-described data processing method, solving the technical problems of low data refresh efficiency and low availability and accuracy of tables after data refresh in related technologies. Compared with related technologies, the beneficial effects of the computer-readable storage medium provided in the embodiments of this invention are the same as the beneficial effects of the data processing method provided in the above embodiments, and will not be repeated here.

[0147] Example 6

[0148] Furthermore, this application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the data processing method described above.

[0149] The computer program product provided in this application solves the technical problems of low data refresh efficiency and low availability and accuracy of tables after data refresh in related technologies. Compared with related technologies, the beneficial effects of the computer program product provided in the embodiments of this invention are the same as the beneficial effects of the data processing methods provided in the above embodiments, and will not be repeated here.

[0150] The above are merely preferred embodiments of this application and do not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent scope of this application.

Claims

1. A data processing method, characterized in that, The data processing method includes the following steps: Retrieve the data table to be refreshed, the data refresh time configuration file, and the data refresh script configuration file; The data table to be refreshed is refreshed by multiple nodes based on the data refresh time configuration file and the data refresh script configuration file; The steps for obtaining the data table to be refreshed include: Obtain the original data table; Create a backup data table with the same table structure as the original data table; Based on the original data table configuration file of the original data table, determine the storage path of the original data in the original data table; Based on the storage path, locate the original data file and copy the original data file to the backup table directory of the backup table partition corresponding to the backup data table; Perform partition repair on the backup table partition to obtain the data table to be refreshed; The data time information includes the refresh partition start time and refresh partition end time; the step of refreshing the data table to be refreshed by multiple nodes based on the data refresh time configuration file and the data refresh script configuration file includes: Based on the node information of each node, the data refresh time configuration file is searched to determine the data time information corresponding to each node. Write the start time and end time of the refresh partition corresponding to each node into the start and end time field of the data to be refreshed in the data refresh script configuration file to generate the data refresh task script corresponding to each node. Each node invokes its corresponding data refresh task script to refresh the data table to be refreshed. The step of refreshing the data table to be refreshed by multiple nodes based on the data refresh time configuration file and the data refresh script configuration file further includes: Replace the original data table with the data table to be refreshed.

2. The data processing method as described in claim 1, characterized in that, The data time information also includes node startup control parameters; the step of writing the start time and end time of the refresh partition corresponding to each node into the start and end time field of the data to be refreshed in the data refresh script configuration file to generate the data refresh task script corresponding to each node includes: Based on the node startup control parameters corresponding to each of the nodes, at least one target node to be started is determined from each of the nodes; Write the start time and end time of the refresh partition corresponding to each target node into the start and end time field of the data to be refreshed in the data refresh script configuration file to generate the data refresh task script corresponding to each target node.

3. The data processing method as described in claim 1, characterized in that, Before the steps to obtain the data refresh time configuration file, the following are also included: Obtain the data table to be refreshed, and determine the time range of the data to be refreshed based on the data table to be refreshed; The time range of the data to be refreshed is segmented according to the preset number of nodes, and the start time and end time of the refresh partition corresponding to each node are determined.

4. A data processing apparatus, characterized in that, The data processing device includes: The acquisition module is used to acquire the data table to be refreshed, the data refresh time configuration file, and the data refresh script configuration file. The data refresh module is used to refresh the data table to be refreshed by multiple nodes based on the data refresh time configuration file and the data refresh script configuration file. The acquisition module is further configured to: acquire the original data table; create a backup data table with the same table structure as the original data table; determine the storage path of the original data in the original data table according to the original data table configuration file of the original data table; locate the original data file based on the storage path and copy the original data file to the backup table directory of the backup table partition corresponding to the backup data table; and perform partition repair on the backup table partition to obtain the data table to be refreshed. The data time information includes the refresh partition start time and refresh partition end time. The data refresh module is further configured to: search the data refresh time configuration file based on the node information of each node to determine the data time information corresponding to each node; write the refresh partition start time and refresh partition end time corresponding to each node into the start and end time fields of the data to be refreshed in the data refresh script configuration file to generate a data refresh task script corresponding to each node; and refresh the data table to be refreshed by calling the corresponding data refresh task script by each node. The replacement module is used to replace the original data table with the data table to be refreshed.

5. An electronic device, characterized in that, The electronic device includes: At least one processor; and, A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the steps of the data processing method according to any one of claims 1 to 3.

6. A storage medium, characterized in that, The storage medium is a computer-readable storage medium, on which a program implementing the data processing method is stored, and the program implementing the data processing method is executed by a processor to implement the steps of the data processing method as described in any one of claims 1 to 3.