A dynamic load balancing distributed data migration method and device
By optimizing task allocation through multi-dimensional evaluation and dynamic granularity splitting algorithms, combined with dynamic load balancing and encrypted connections, the problems of node overload and resource idleness in distributed database data migration are solved, achieving efficient and secure data migration and business continuity.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA ZHESHANG BANK
- Filing Date
- 2026-02-12
- Publication Date
- 2026-06-19
AI Technical Summary
The lack of dynamic load balancing strategies in distributed database data migration leads to node overload or resource idleness. The data importance assessment system is one-sided and cannot ensure the priority processing and transmission security of core business data. The data verification capability is insufficient and cannot accurately locate and repair problematic data blocks.
The system employs multi-dimensional evaluation metrics to calculate data importance scores, optimizes task allocation through a dynamic granularity splitting algorithm, combines two-way authentication encrypted connections and dynamic load balancing algorithms to achieve priority transmission of data blocks, and introduces a multi-level verification and anomaly backtracking and repair mechanism.
It improved data migration efficiency, ensured data integrity and security, enhanced business continuity, avoided node overload and resource idleness, and guaranteed the priority processing and secure transmission of core business data.
Smart Images

Figure CN122240294A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of database migration technology, and in particular to a distributed data migration method and apparatus with dynamic load balancing. Background Technology
[0002] Distributed database data migration refers to the core technology of transferring data from a source database cluster across nodes, architectures, and storage media to a target database cluster in a distributed architecture environment. This technology can effectively support key scenarios such as system upgrades and reconstructions, system expansion, disaster recovery deployments, and database version upgrades.
[0003] However, distributed database data migration faces several drawbacks during implementation: First, the target host load balancing strategy lacks dynamic adaptability. Traditional static task allocation mechanisms cannot detect dynamic load changes such as host CPU utilization, memory usage, and network bandwidth in real time, leading to overload and congestion on some nodes and idle resources on others, resulting in slow migration speeds. Second, the data importance assessment system is one-sided. Existing models rely on single indicators such as data access frequency or data size, failing to consider key indicators such as the correlation between data access frequency and core business, and data security. This results in insufficient priority for core business data migration, potentially causing business interruption risks and transmission security issues. Furthermore, the fault tolerance capability of the verification process during data transmission is weak, making it difficult to accurately locate and repair problematic data blocks after anomaly detection.
[0004] Therefore, constructing a distributed database data migration solution that integrates dynamic load balancing, multi-dimensional data importance evaluation, and migration data verification to ensure migration efficiency, data security, and business continuity is a technical problem that urgently needs to be solved by those skilled in the art. Summary of the Invention
[0005] This invention provides a dynamic load-balanced distributed data migration method and apparatus, which has the advantages of improving migration efficiency, ensuring data integrity and security, and enhancing business continuity. This invention provides the following technical solution:
[0006] This invention discloses a distributed data migration method with dynamic load balancing, comprising the following steps:
[0007] Based on the system functional module division and business requirement analysis, the data objects to be migrated and their scope are identified.
[0008] Based on a defined data range, establish a data structure mapping model from the source database (old structure) to the target database (new structure);
[0009] Multi-dimensional evaluation metrics are collected from the source database. Based on these metrics, the importance score of the original data is calculated. Data priorities are assigned according to the original data importance scores. Based on all the original data importance scores and the size of the original data, the migration task is divided into several independent subtasks using a dynamic granularity splitting algorithm.
[0010] Establish a two-way authentication and encrypted connection mechanism with the target host.
[0011] Based on the maximum capacity and real-time load status of each target host, the split subtasks are allocated using a dynamic load balancing algorithm. The original data blocks corresponding to the allocated subtasks are transmitted sequentially according to the priority order of the data blocks.
[0012] The target data is validated sequentially at the block level, field level, global consistency, and business logic levels. If any validation step detects an anomaly, the anomaly backtracking and repair mechanism is triggered.
[0013] While adopting the above technical solutions, the present invention may also adopt or combine the following technical solutions: the multi-dimensional evaluation indicators include access frequency, data scale and relevance to core business, and the relevance to core business includes; the original data importance score includes the importance of the data itself and the potential access frequency of the data, and the importance of the data itself includes business impact factor and system impact factor.
[0014] As a preferred embodiment of the present invention, the dynamic granularity splitting algorithm includes the following steps:
[0015] Based on the importance scores and size of all raw data, the task granularity baseline value is dynamically calculated using an adaptive task granularity calculation formula.
[0016] Based on the task granularity benchmark value, the migration task is divided into several independent sub-tasks;
[0017] The analysis shows that the task granularity benchmark value corresponding to high-priority data is smaller than that corresponding to low-priority data, and the transmission and allocation of high-priority data blocks takes precedence over low-priority data blocks.
[0018] As a preferred technical solution of the present invention, the two-way authentication encryption connection mechanism sequentially performs key negotiation, two-way identity authentication, and data transmission encryption operations.
[0019] As a preferred embodiment of the present invention, the maximum carrying capacity of each target host is calculated by testing the maximum carrying capacity of each target host using a sliding window probing method. The sliding window probing method adopts dynamic adaptive adjustment, dynamically adjusting the window duration of the sliding window based on the traffic fluctuation coefficient of the target host, and calculating the maximum carrying capacity of each target host.
[0020] As a preferred embodiment of the present invention, the real-time load status of each target host is obtained by collecting real-time load status parameters of each target host and constructing a load status quantification model; the load status parameters include CPU utilization, memory usage, and network bandwidth usage of the target host. The expression of the load status quantification model is:
[0021]
[0022] in, This is the maximum load state variable. For CPU utilization, For memory usage, This refers to network bandwidth utilization.
[0023] As a preferred embodiment of the present invention, based on the maximum carrying capacity and real-time load status of each target host, the sub-tasks are allocated using a dynamic load balancing algorithm, including the following steps:
[0024] Based on the maximum traffic capacity and real-time load status of each target host, the initial task allocation coefficient is calculated. The expression for the initial task allocation coefficient is as follows:
[0025]
[0026] in, Set a baseline load threshold for calculating the migration task allocation coefficient. For load status variables, The time decay factor, For migration duration, It is a natural constant. It is a preset attenuation constant. The maximum traffic capacity of the target host. The average maximum traffic across all target hosts;
[0027] After correcting the initial task allocation coefficients, the final task allocation coefficients are calculated, and the expression for the final task allocation coefficients is obtained as follows:
[0028]
[0029]
[0030] in, To introduce a load deviation feedback adjustment factor; For load status variables, Real-time calculation of average load for load feedback adjustment; To calculate the initial allocation coefficients;
[0031] Migration subtasks are assigned based on the final task allocation coefficient, with subtasks having higher final task allocation coefficients being assigned priority over subtasks with lower final task allocation coefficients.
[0032] As a preferred technical solution of the present invention, the anomaly backtracking and repair mechanism includes at least one of the following: re-migrating the corresponding abnormal original data block, locating and repairing the specific abnormal field, and tracing the abnormal original data block and retransmitting for verification until all verification steps pass.
[0033] This invention provides another technical solution: a dynamic load-balanced distributed data migration device, the device comprising:
[0034] The data filtering module is used for system functional module division and business requirement analysis to clarify the data objects to be migrated and their scope.
[0035] The data mapping module is used to define the data range and establish a data structure mapping model from the source database to the target database.
[0036] The evaluation module is used to collect multi-dimensional evaluation indicators from the source database, calculate the importance score of the raw data based on the multi-dimensional evaluation indicators, and divide the data priority according to the importance score of the raw data. Based on all the importance scores of the raw data and the size of the raw data, the migration task is divided into several independent sub-tasks through a dynamic granularity splitting algorithm.
[0037] The authentication module is used to establish a two-way authenticated encrypted connection mechanism with the target host;
[0038] The transmission module is used to allocate the split subtasks according to the maximum carrying capacity and real-time load status of each target host through a dynamic load balancing algorithm, and to transmit the original data blocks corresponding to the allocated subtasks in order of data priority.
[0039] The verification module is used to sequentially perform block-level verification, field-level verification, global consistency verification, and business logic verification of the target data. If an anomaly is detected in any verification step, the anomaly backtracking and repair mechanism is triggered.
[0040] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0041] This invention effectively addresses the limitations of traditional evaluation systems by collecting multi-dimensional evaluation indicators and calculating the importance score of raw data, ensuring the priority processing of core business data. Based on the raw data importance score, a dynamic granularity splitting algorithm optimizes priority task allocation, improving data migration efficiency. Simultaneously, based on the maximum throughput and real-time load status of each target host, a dynamic load balancing algorithm allocates sub-tasks, avoiding node overload or resource idleness and improving transmission resource utilization. A two-way authentication and encrypted connection mechanism ensures the security and integrity of data transmission between the source database and the target host. The introduction of multi-level verification and anomaly backtracking and repair mechanisms comprehensively guarantees the integrity, accuracy, and business continuity of data migration. Attached Figure Description
[0042] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:
[0043] Figure 1 This is a flowchart illustrating a dynamic load balancing distributed data migration method according to an embodiment of the present invention.
[0044] Figure 2 This is a schematic diagram of the structure of a distributed data migration device with dynamic load balancing provided in an embodiment of the present invention;
[0045] Figure 3 This is a schematic diagram of the structure of a medium provided in an embodiment of the present invention. Detailed Implementation
[0046] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0047] Example 1
[0048] Traditional static task allocation mechanisms in distributed database data migration fail to detect the dynamic load status of hosts in real time, leading to overload or idle resources on some nodes and severely restricting migration efficiency. Meanwhile, data importance assessment systems are one-sided, ignoring key factors such as data access frequency and relevance to core business, resulting in insufficient migration of core business data and potential business interruption risks. Furthermore, data verification is inadequate, failing to fully guarantee data integrity and security.
[0049] To address this, this application proposes a dynamic load-balanced distributed data migration method, such as... Figure 1 As shown, the method includes the following steps:
[0050] Step 101: Based on the system functional module division and business requirement analysis, clarify the data objects to be migrated and their scope, including but not limited to core business tables, historical archived data, and related configuration information and metadata.
[0051] Step 102 establishes a data structure mapping model from the source database (old structure) to the target database (new structure) based on the data scope determined in Step 101. This mapping relationship aims to define how data is transformed from the old structure (which may have undergone modifications, such as table splitting, field merging, type conversion, encoding, or format changes) and carried into the newly designed structure during the migration process.
[0052] Step 103 collects multi-dimensional evaluation indicators from the source database, calculates the importance score of the original data based on the multi-dimensional evaluation indicators, and divides the data priority according to the importance score of the original data. Based on the importance scores of all the original data and the size of the original data, the migration task is divided into several independent sub-tasks through a dynamic granularity splitting algorithm.
[0053] Step 104 generates a session key through a key negotiation algorithm, establishes an encrypted transmission connection with the target host cluster using a two-way authentication encryption mechanism, and then encrypts the original data for transmission.
[0054] Step 105: Based on the maximum carrying capacity and real-time load status of each target host, the split subtasks are allocated using a dynamic load balancing algorithm. The original data blocks corresponding to the allocated subtasks are transmitted sequentially according to the priority order of the data blocks.
[0055] Step 106 sequentially performs block-level verification, field-level verification, global consistency verification, and business logic verification of the target data. If any verification step detects an anomaly, the anomaly backtracking and repair mechanism is triggered.
[0056] In this embodiment, the multi-dimensional evaluation indicators in step 103 refer to quantitative or qualitative parameters of multiple different dimensions used to comprehensively measure the importance of data, including access frequency, data size, and core business relevance. Access frequency refers to the number of times data is read, queried, or accessed within a specific time period. It can be calculated by parsing the query logs of the source database or the access logs of the application system to count the number of accesses to a specific data block or table. Data size refers to the storage space occupied by the data or the number of data records. It can be measured by querying the metadata view of the source database to obtain the physical storage size of the data table or a specific data block, or by counting the number of rows in the data table or the number of records contained in a specific data block. Core business relevance refers to the closeness and scope of influence of data with the enterprise's core business processes, key functions, or important business transactions. It can be achieved through business analysis, mapping data items to key business processes to identify which data are necessary conditions for supporting the operation of core business functions.
[0057] Specifically, the raw data importance score includes both the importance of the data itself and the potential access frequency of the data. It can be represented as:
[0058]
[0059] in, For the importance of the data itself, As for the potential frequency of data access, This indicates the relevant data to be scored.
[0060] The importance of the data itself includes two components: business impact factor and system impact factor.
[0061] Specifically, the importance of the data itself Including business impact factors and system impact factors Importance of the data itself The importance of the data itself is obtained through data monitoring. It can be represented as:
[0062]
[0063] in, Business impact factors represent data Impact on business; The system impact factor represents the data. The impact on the system and These are weighting coefficients, satisfying... , and It can be represented as:
[0064]
[0065]
[0066] in, The number of key business operations that depend on this data item. For the total number of all businesses that depend on this data item, The daily average number of occurrences of the correlation between the core business and this data item. The average daily total number of occurrences for core business operations. The number of system core modules that depend on this data item. This represents the total number of modules in the system.
[0067] It should be noted that the business impact factor is used to characterize the degree of impact of data on business transactions, business processes, and business decisions, while the system impact factor is used to characterize the degree of impact of data on system functions, system stability, and system module operation.
[0068] In this embodiment, the potential access frequency of data refers to the theoretical upper limit of the number of times data may be accessed or used within a certain period of time in the future, reflecting the future activity of the data.
[0069] Specifically, the potential access frequency of a certain data item Refers to data items The theoretical upper limit of access counts within the statistical period; the potential access frequency of a certain data item. It can be represented as:
[0070]
[0071] in, This indicates the number of types of relevance to the core business. Indicates the number of times within the statistical period The total volume of this type of business; Indicates single order The number of times a business accesses the target data item; Indicates the number of associated batch task types; Indicates the number of times within the statistical period The number of times a batch task is executed; Indicates single order The number of times a batch task accesses the target data item; Indicates the duration of the statistical period.
[0072] Based on Embodiment 1 above, in order to optimize priority task allocation and improve data migration efficiency, the dynamic granularity splitting algorithm in step 103 of some of the above solutions in this application includes the following steps: Based on the importance scores of all original data and the original data size, dynamically calculate the task granularity benchmark value through an adaptive task granularity calculation formula. According to the task granularity benchmark value, split the migration task into several independent sub-tasks. Analyze that the task granularity benchmark value corresponding to high-priority data is less than the task granularity benchmark value corresponding to low-priority data, and determine that the transmission allocation of high-priority data blocks takes precedence over low-priority data blocks.
[0073] In this embodiment, based on the importance scores and size of all original data, a task granularity baseline value is dynamically calculated using an adaptive task granularity calculation formula. The fineness of task splitting is dynamically adjusted according to data attributes (such as original data importance scores and data size). The expression for the task granularity baseline value is:
[0074]
[0075] in, As the baseline value for task granularity, This is the particle size adjustment coefficient; For data items The data size is expressed in KB; ε is a constant to avoid a denominator of zero.
[0076] It should be noted that the higher the importance of the original data, the smaller the corresponding task granularity baseline value, thus achieving a more refined split. Furthermore, by using preset mapping rules or decision tree models, the corresponding task granularity baseline value can be dynamically looked up or calculated based on a combination range of the original data importance scores and the original data size. This ensures that the task splitting granularity can flexibly adapt to the characteristics of different data, forming the basis for subsequent priority transmission.
[0077] In this embodiment, based on the task granularity baseline, the migration task is divided into several independent subtasks, breaking down the massive data migration work into smaller modules that can be independently scheduled and executed. The expression for the number of tasks is:
[0078]
[0079] in, For the total data size, As the baseline value for task granularity, This represents the number of subtasks after splitting.
[0080] It should be noted that each subtask can be defined based on the starting offset and length of the data block, and a unique identifier can be assigned to each subtask. Each subtask corresponds to a unique task ID and data block index, which facilitates scheduling and backtracking after transmission. This splitting method makes the migration process more controllable and facilitates parallel processing and resource allocation.
[0081] In this embodiment, the task granularity benchmark value corresponding to high-priority data is analyzed to be smaller than that corresponding to low-priority data. This determines that the transmission allocation of high-priority data blocks takes precedence over low-priority data blocks. By comparing and analyzing the task granularity benchmark values corresponding to data of different priorities, it is found that original data with higher importance typically generates smaller task granularity benchmark values, meaning it is split into finer segments. Based on this analysis, it can be determined that when allocating data blocks for transmission, data blocks with smaller task granularity benchmark values (i.e., corresponding to high-priority data) should be processed first. The dynamic granularity splitting algorithm ensures that core business data can be prioritized and transmitted with finer granularity, thereby significantly improving the efficiency of data migration, reducing the risk of business interruption due to data congestion, and ensuring business continuity and timeliness of data transmission during distributed data migration.
[0082] Based on Embodiment 1 above, in order to achieve secure data transmission between the source and target ends before data transmission, a two-way authentication encrypted connection mechanism is proposed in step 104 of some of the solutions described in this application to sequentially perform key negotiation, two-way authentication, and data transmission encryption operations.
[0083] Specifically, the key negotiation mainly uses the SM2 elliptic curve cryptography algorithm to negotiate the key, and the session key is generated based on the negotiated shared key. It can be represented as:
[0084]
[0085] in, The source public key; The target host's private key; This is an SM2 key derivation function.
[0086] Two-way authentication includes public key certificates, random numbers, and signature verification to complete the identity and legitimacy verification with the target host. The process is as follows:
[0087] (1) The source sends an authentication request to the target host, carrying the source's public key certificate. With random numbers ;
[0088] (2) Target host verification After validation, return a random number. With signature SM3 ( ∥ ∥ (∥ represents string concatenation, (a unique identifier for the target host)
[0089] (3) After the source verifies the signature, send the signature SM3( ∥ ∥ If the target host verifies the connection, the connection is successfully established.
[0090] As a preferred implementation, in scenarios involving the transmission of sensitive and important business data, the SM4 block cipher algorithm can be used to encrypt the migrated data, with the encryption mode being CBC. The definition of sensitive data is based on relevant laws, regulations, and system business specifications; important data is determined according to a preset original data importance score model. This is achieved by ranking the original data importance scores and setting dynamic encryption thresholds. For example, the top 10% of the most important data can be designated for encrypted transmission. This optimizes overall processing performance while ensuring the security of core data. Since encryption consumes computing power and time, and transmits additional redundant information, most data does not require encryption; only sensitive fields need to be encrypted to prevent data leakage during migration. The encryption algorithm is as follows:
[0091]
[0092] in, For the original data block; For encrypted data blocks, The initial vector is generated randomly.
[0093] It should be noted that additional information is added during the transmission process. The data block checksum ensures correct decryption. Two-way authentication encryption is a security mechanism that, when two communicating parties establish a connection, mutually verify each other's identity and establish an encrypted channel. This significantly improves the security and reliability of the entire distributed data migration process, ensuring the confidentiality and integrity of data during transmission and effectively preventing data from being eavesdropped on or tampered with.
[0094] Based on the above embodiment 1, in order to accurately detect the maximum carrying capacity during data transmission, to evaluate the transmission capacity of the target host under dynamic environmental changes, and to improve the rationality and efficiency of task allocation, some of the above solutions in this application propose that the maximum carrying capacity of each target host in step 105 be tested and calculated by the sliding window detection method. The sliding window detection method adopts dynamic adaptive adjustment, and the window duration of the sliding window is dynamically adjusted based on the traffic fluctuation coefficient of the target host to calculate the maximum carrying capacity of each target host.
[0095] In this embodiment, the dynamic window probing method employs a dynamically adaptive adjustment parameter. The sliding window probing method can automatically adjust its internal parameters according to changes in system operating status or environment to optimize the probing effect, making the probing mechanism more flexible and robust. The maximum load capacity of each target host is determined through the aforementioned sliding window probing method and dynamic adaptive adjustment mechanism, ultimately yielding the target host's maximum data processing capacity under current or predicted conditions. Specifically, the adaptive sliding window method is used to test the maximum load capacity of the target host. ( The expression for (target host number) is:
[0096]
[0097]
[0098]
[0099]
[0100] in, For flow fluctuation coefficient, The average flow rate within the window. For the detection duration, For the first Real-time data transmission per second For stability weights, Reserve a coefficient for traffic flow.
[0101] It should be noted that the flow fluctuation coefficient Stability weights are used to quantify the stability of flow within a sliding window. This function is used to correct for the impact of traffic fluctuations on the host's carrying capacity. The smaller the traffic fluctuation, the closer ω is to 1, and the closer the detection result is to the host's carrying capacity. Conversely, the larger the traffic fluctuation, the smaller ω is, proactively lowering the predicted traffic limit to avoid sudden traffic overload. (Traffic reservation coefficient) To avoid sudden traffic overload.
[0102] Preferably, a reserve coefficient is assumed. =0.9 is used as a preliminary estimate of the maximum carrying capacity, when Time (when traffic fluctuates drastically), window duration Automatically reduced to 30 seconds to improve detection response speed and avoid misjudgment of traffic limits due to excessively long detection cycles; when (With stable traffic), the window duration T remains at 60 or 90 seconds.
[0103] Furthermore, by presetting a series of thresholds, when certain monitored indicators (such as traffic fluctuations, CPU utilization, memory utilization, etc.) reach the set thresholds, the sliding window parameters can be automatically adjusted. The traffic fluctuation coefficient is an indicator that measures the degree of drastic change in traffic of a target host over a period of time. Specifically, a traffic fluctuation coefficient threshold can be set. When the traffic fluctuation coefficient is higher than a certain high threshold, it indicates drastic traffic changes. In this case, the window duration should be shortened to improve the real-time performance and response speed of the detection, avoiding data distortion caused by an excessively long window during periods of drastic fluctuations. When the traffic fluctuation coefficient is lower than a certain low threshold, it indicates relatively stable traffic. In this case, the window duration can be appropriately extended to increase the amount of sampled data, thereby improving the real-time performance and accuracy of the detection results.
[0104] Based on Embodiment 1 above, the collected multi-dimensional load status parameters are integrated into a unified, quantifiable value for subsequent comparison and decision-making in dynamic load balancing algorithms. In some of the solutions described above, the real-time load status of each target host is obtained by collecting real-time load status parameters of each target host and constructing a load status quantification model. The load status parameters include the target host's CPU utilization, memory usage, and network bandwidth usage. Specifically, the expression for the load status quantification model is:
[0105]
[0106] in, This is the maximum load state variable. For CPU utilization, For memory usage, This refers to network bandwidth utilization.
[0107] It's important to note that CPU utilization refers to the percentage of the target host's central processing unit (CPU) occupied by programs within a specific time period. It's a key indicator of the host's computing power load and is typically obtained by reading performance counters or system call interfaces provided by the operating system. Memory utilization refers to the percentage of the target host's physical memory used by programs and the operating system. It's a key indicator of the host's memory resource load and is typically obtained by reading memory information or performance counters provided by the operating system. Network bandwidth utilization refers to the ratio of the actual amount of data transmitted through the target host's network interface to its maximum available bandwidth within a specific time period. It's a key indicator of the host's network I / O load and is typically obtained by reading network interface statistics.
[0108] Based on the above embodiment 1, in order to dynamically adapt to the differences in carrying capacity of different hosts and the load changes during the migration process in data transmission, some of the solutions in this application propose to allocate the split subtasks based on the maximum carrying capacity and real-time load status of each target host through a dynamic load balancing algorithm, including the following steps:
[0109] Based on the maximum traffic capacity and real-time load status of each target host, the initial task allocation coefficient is calculated. The expression for the initial task allocation coefficient is as follows:
[0110]
[0111] in, To preset the baseline load threshold, For load status variables, The time decay factor, For migration duration, It is a natural constant. It is a preset attenuation constant. The maximum traffic capacity of the target host. The average maximum traffic for all target hosts.
[0112] It should be noted that the preset baseline load threshold For preset constants, such as This serves as a benchmark for calculating task allocation coefficients and can be set as a constant based on historical operational data. Time decay factor. This is an adjustment coefficient that dynamically changes with the migration duration t, used to achieve adaptive adjustment of task allocation over time, reflecting the system's response weight to load conditions during the migration process. The time decay factor can be implemented using an exponential decay function. Where e is the natural constant and k is the preset decay constant, for example t represents the migration duration. The maximum traffic capacity of the target host m. This represents the maximum data transfer rate under no host overload conditions, obtained through a sliding window probing method. This method dynamically adjusts the window duration based on the traffic fluctuation coefficient to achieve adaptive probing and average maximum traffic. The average of the maximum traffic capacity of all target hosts is used as a reference for the overall carrying capacity and is used in the calculation of the initial task allocation coefficient.
[0113] To further optimize the accuracy and dynamism of task allocation, the initial task allocation coefficients are revised, and the final task allocation coefficients are calculated. This is done by comparing the deviation between the current load and the average load, adjusting the initial allocation coefficients to more finely adapt to real-time load fluctuations. The expression for the final task allocation coefficients is:
[0114]
[0115]
[0116] in, To introduce a load deviation feedback adjustment factor; For load status variables, Real-time calculation of average load for load feedback adjustment; To calculate the initial allocation coefficients.
[0117] It should be noted that the load deviation feedback adjustment factor Used for quantizing target host Current load Average load of all target hosts The deviation between them, and accordingly the initial task allocation coefficients. Adjustments can be made to correct any deviations that may exist in the initial allocation. When Higher than hour, If it is a negative value, then Decrease. When Below hour, If it is a positive value, then Increase. Load feedback adjustment calculates average load in real time. It is a quantification of the real-time load status of all target hosts at the current moment. The average value. It serves as a benchmark for measuring the overall system load level and is used to assess whether the load of individual hosts deviates from the average level.
[0118] In this embodiment, migration subtasks are allocated according to the final task allocation coefficient. The calculated final task allocation coefficient determines how the split subtasks are assigned to each target host. The aim is to prioritize tasks for hosts with high capacity and low current load, thereby optimizing resource utilization and improving migration efficiency. Scheduling strategies such as round-robin, weighted round-robin, or minimum connection count can be used in combination for task allocation. For example, this can be implemented using a task queue and a scheduler. The scheduler periodically evaluates the final task allocation coefficient of each host and retrieves subtasks to be allocated from the task queue, prioritizing their allocation to the host with the highest current final task allocation coefficient until that host's load reaches a preset limit or all subtasks are allocated. This improves overall migration efficiency, avoids overloading a single node, and fully utilizes cluster resources.
[0119] Based on the above embodiment 1, in order to ensure the accuracy of the data transmission result after the data transmission is completed, the above-mentioned solutions in this application propose an anomaly backtracking and repair mechanism including at least one of the following: re-migrating the corresponding abnormal original data block, locating and repairing the specific abnormal field, and tracing the abnormal original data block and retransmitting for verification until all verification steps pass.
[0120] In this embodiment, the block-level verification in step 106 refers to a method for performing integrity checks on the migrated data on a block-by-block basis. For example, methods such as hash verification or garbled character detection can be used to verify whether the data blocks have been damaged or tampered with during transmission.
[0121] The hash check divides the migrated data blocks into fixed-size blocks and calculates the check value using the Mersenne prime hash function as follows:
[0122]
[0123] in, Number the data blocks, For the original data block; For Mersenne prime-based hash functions, After migration, the target data block is calculated. ,like If the block-level surface is consistent, then the block-level surface is consistent. If the block-level surface is inconsistent, the corresponding abnormal original data block will be migrated again.
[0124] In addition, the garbled character detection uses Unicode encoding validity verification and regular expression matching of key business fields. If the encoding is invalid or the key field format is abnormal, it is judged as garbled character, and the corresponding abnormal original data block is re-migrated.
[0125] Re-migrating the corresponding abnormal original data block is when a data block lacks integrity after transmission (such as during block-level verification). The system will identify the unique identifier of the abnormal data block and issue an instruction to the source database or source cache to reread and transmit the data block.
[0126] In this embodiment, the process of locating and repairing specific abnormal fields in step 106 involves field-level verification, which refers to a method for checking the integrity and accuracy of key fields in the migrated data. Specifically, CRC32 XOR verification is used as follows:
[0127]
[0128] in, The original value of the i-th core field is given by ⨁, which represents the XOR operation and is calculated after migration. ,like If the field-level validation passes, then the validation passes. If the field-level validation fails, locate and fix the specific abnormal field.
[0129] For finer-grained validation anomalies, such as field-level validation finding that a specific field's data does not conform to specifications, the system can pinpoint the record and field name containing the anomalous field. The repair mechanism can then update the corresponding anomalous field in the target database directly based on preset repair rules. This could involve retrieving the correct value of the field from the source database, or applying default values or performing format conversions according to business rules. For instance, when field-level validation finds a format error in a critical field, the system can automatically retrieve the correct value from the source database based on the field's business rules, correcting only the anomalous field in the target database instead of retransmitting the entire data block.
[0130] In this embodiment, the global consistency check in step 106 refers to performing a full consistency check on the migrated data, and the global hash value of the data is calculated as follows:
[0131]
[0132] in, The total number of data blocks; calculate the global hash value of the target data after migration. ,like If the global consistency check passes, then the global consistency check passes. If the global consistency check fails, the original abnormal data block will be traced back by combining the transaction log and task ID and retransmitted for verification.
[0133] As a preferred implementation, the business logic verification in step 106 refers to a method for performing consistency checks on the migrated data at the business level. This verification aims to confirm whether the migrated data conforms to the expected business rules and logic. The business logic verification rules can be set according to the different business systems, such as: whether the values of specific fields are within the valid range required by the business; whether the relationships between different data entities are still valid and consistent; and whether the key business values calculated from multiple fields are accurate.
[0134] It should be noted that passing all verification steps ensures that the repair process ultimately meets data integrity standards, providing a multi-layered and flexible repair strategy. Based on the original data importance scores, data is categorized as important or unimportant, and the system implements a differentiated verification failure handling mechanism:
[0135] Important data: If the global consistency check fails during the verification process, it indicates that the data transmission process has integrity damage, triggering the real-time retransmission process to immediately retransmit and review the relevant data blocks to ensure the integrity and reliability of the core data; if the business logic check fails, it indicates that the data may have errors due to improper mapping or rule definition during the conversion process, so the data migration process is interrupted and the mapping rules are analyzed and modified. After correction, the conversion and transmission are re-executed.
[0136] Non-critical data: When validation fails, the overall process is not interrupted; the system marks and aggregates the abnormal records into the pending queue. For data blocks that fail consistency validation, they will be retransmitted in batches after transmission is complete. For data that fails business logic validation, the mapping rules will be adjusted after the migration task is completed, and then the migration task of the corresponding data table will be re-executed to overwrite the erroneous data.
[0137] For highly important data, anomalies in business logic validation can severely impact the overall data migration results and subsequent business operations, potentially triggering a chain reaction of data anomalies. Therefore, after correcting the business mapping rules, the migration and transmission of this type of important data must be re-executed. For less important data, anomalies only affect the validity of a localized portion and do not impact the overall migration process or core business operations. These anomalies can be repaired separately by retransmitting the anomaly portion after the overall migration is complete. This mechanism ensures both data migration efficiency and the accuracy and reliability of important data, achieving a reasonable balance between migration efficiency and data precision.
[0138] This invention combines multi-dimensional data importance assessment with a dynamic granularity splitting algorithm to accurately identify core business data and optimize task allocation granularity. Simultaneously, it introduces a dynamic load balancing transmission mechanism based on maximum capacity and real-time load status, dynamically adjusting task allocation strategies to adapt to changes in host resources. Furthermore, through a two-way authentication encryption mechanism and a multi-level verification system, it comprehensively ensures the security and integrity of data transmission. Due to the synergistic effect of these steps, it achieves a significant improvement in migration efficiency, reliable assurance of business continuity, and comprehensive enhancement of data security.
[0139] This application also provides a distributed data migration device with dynamic load balancing, such as... Figure 2 As shown, the device includes:
[0140] The data filtering module 201 is used for system functional module division and business requirement analysis to clarify the data objects to be migrated and their scope.
[0141] Data mapping module 202 is used to define the data range and establish a data structure mapping model from the source database to the target database;
[0142] Evaluation module 203 is used to collect multi-dimensional evaluation indicators from the source database, calculate the importance score of the original data based on the multi-dimensional evaluation indicators, divide the data priority according to the importance score of the original data, and divide the migration task into several independent sub-tasks based on all the importance scores of the original data and the size of the original data through a dynamic granularity splitting algorithm.
[0143] Authentication module 204 is used to establish a two-way authentication and encrypted connection mechanism between the source database and the target host;
[0144] The transmission module 205 is used to allocate the split subtasks according to the maximum carrying capacity and real-time load status of each target host through a dynamic load balancing algorithm, and to transmit the original data blocks corresponding to the allocated subtasks in order of data priority.
[0145] The verification module 206 is used to sequentially perform block-level verification, field-level verification, global consistency verification, and business logic verification of the target data. If an anomaly is detected in any verification step, an anomaly backtracking and repair mechanism is triggered. (See below for details.) Figure 3 As shown, it illustrates a structural schematic diagram of an electronic device 300 suitable for implementing some embodiments of the present invention.
[0146] like Figure 3As shown, the electronic device 300 may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) 301, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 302 or a program loaded from a storage device 308 into a random access memory (RAM) 303. The RAM 303 also stores various programs and data required for the operation of the electronic device 300. The processing unit 301, ROM 302, and RAM 303 are interconnected via a bus 304. An input / output (I / O) interface 305 is also connected to the bus 304.
[0147] Through the above technical solution, electronic device 303 can effectively solve the risks of business interruption caused by the lack of dynamic adaptation capability of target host load balancing and the one-sidedness of the data importance assessment system in distributed database data migration, resulting in overload of some nodes, idle resources, low migration efficiency, and insufficient priority of core business data migration. The method of calling the program by processing device 301 avoids the delay and error of manual intervention, realizes the intelligent execution of the method, and enhances the balance of resource utilization and the comprehensiveness of assessment.
[0148] The storage medium in this application embodiment stores program instructions capable of implementing all the above methods. These program instructions can be stored in the storage medium in the form of a software product, including several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.
[0149] In the description of this specification, references to terms such as "an embodiment," "example," "specific example," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0150] The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, thereby enabling those skilled in the art to better understand and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.
Claims
1. A distributed data migration method with dynamic load balancing, characterized in that: The method includes the following steps: Based on the system functional module division and business requirement analysis, the data objects to be migrated and their scope are identified. Based on a defined data range, establish a data structure mapping model from the source database to the target database; Multi-dimensional evaluation metrics are collected from the source database. Based on these metrics, the importance score of the original data is calculated. Data priorities are assigned according to the original data importance scores. Based on all the original data importance scores and the size of the original data, the migration task is divided into several independent subtasks using a dynamic granularity splitting algorithm. Establish a two-way authenticated encrypted connection mechanism with the target host; Based on the maximum capacity and real-time load status of each target host, the split subtasks are allocated using a dynamic load balancing algorithm. The original data blocks corresponding to the allocated subtasks are transmitted sequentially according to the priority order of the data blocks. The target data is validated sequentially at the block level, field level, global consistency, and business logic levels. If any validation step detects an anomaly, the anomaly backtracking and repair mechanism is triggered.
2. The distributed data migration method according to claim 1, characterized in that: The multi-dimensional evaluation indicators include access frequency, data scale, and relevance to core business; the original data importance score includes the importance of the data itself and the potential access frequency of the data, wherein the importance of the data itself includes business impact factors and system impact factors.
3. The distributed data migration method according to claim 2, characterized in that: The dynamic granularity splitting algorithm includes the following steps: Based on the importance scores and size of all raw data, the task granularity baseline value is dynamically calculated using an adaptive task granularity calculation formula. Based on the task granularity benchmark value, the migration task is divided into several independent sub-tasks; The analysis shows that the task granularity benchmark value corresponding to high-priority data is smaller than that corresponding to low-priority data, and the transmission and allocation of high-priority data blocks takes precedence over low-priority data blocks.
4. The distributed data migration method according to claim 1, characterized in that: The two-way authentication encryption connection mechanism sequentially performs key negotiation, two-way identity authentication, and data encryption transmission based on the importance score of the original data.
5. The distributed data migration method according to claim 1, characterized in that: The maximum carrying capacity of each target host is calculated by testing the maximum carrying capacity of each target host using the sliding window probing method. The sliding window probing method adopts dynamic adaptive adjustment, which dynamically adjusts the window duration of the sliding window based on the traffic fluctuation coefficient of the target host, and calculates the maximum carrying capacity of each target host.
6. The distributed data migration method according to claim 5, characterized in that: The real-time load status of each target host is obtained by collecting real-time load status parameters of each target host and constructing a load status quantification model. The load status parameters include the CPU utilization, memory usage, and network bandwidth usage of the target host. The expression for the load status quantification model is as follows: in, This is the maximum load state variable. For CPU utilization, For memory usage, This refers to network bandwidth utilization.
7. The distributed data migration method according to claim 6, characterized in that: Based on the maximum traffic capacity and real-time load status of each target host, the split subtasks are allocated using a dynamic load balancing algorithm, including the following steps: Based on the maximum traffic capacity and real-time load status of each target host, the initial task allocation coefficient is calculated. The expression for the initial task allocation coefficient is as follows: in, Set a baseline load threshold for calculating the migration task allocation coefficient. For load status variables, The time decay factor, For migration duration, It is a natural constant. It is a preset attenuation constant. The maximum traffic capacity of the target host. The average maximum traffic across all target hosts; After correcting the initial task allocation coefficients, the final task allocation coefficients are calculated, and the expression for the final task allocation coefficients is obtained as follows: in, To introduce a load deviation feedback adjustment factor; For load status variables, Real-time calculation of average load for load feedback adjustment; To calculate the initial allocation coefficients; Migration subtasks are assigned based on the final task allocation coefficient, with subtasks having higher final task allocation coefficients being assigned priority over subtasks with lower final task allocation coefficients.
8. The distributed data migration method according to any one of claims 1, characterized in that: The anomaly backtracking and repair mechanism includes at least one of the following: re-migrating the corresponding abnormal original data block, locating and repairing the specific abnormal field, and tracing the abnormal original data block and retransmitting it for verification until all verification steps pass.
9. A distributed data migration device with dynamic load balancing, characterized in that: The device includes: The data filtering module is used for system functional module division and business requirement analysis to clarify the data objects to be migrated and their scope. The data mapping module is used to define the data range and establish a data structure mapping model from the source database to the target database. The evaluation module is used to collect multi-dimensional evaluation indicators from the source database, calculate the importance score of the raw data based on the multi-dimensional evaluation indicators, and divide the data priority according to the importance score of the raw data. Based on all the importance scores of the raw data and the size of the raw data, the migration task is divided into several independent sub-tasks through a dynamic granularity splitting algorithm. The authentication module is used to establish a two-way authenticated encrypted connection mechanism with the target host; The transmission module is used to allocate the split subtasks according to the maximum carrying capacity and real-time load status of each target host through a dynamic load balancing algorithm, and to transmit the original data blocks corresponding to the allocated subtasks in order of data priority. The verification module is used to sequentially perform block-level verification, field-level verification, global consistency verification, and business logic verification of the target data. If an anomaly is detected in any verification step, the anomaly backtracking and repair mechanism is triggered.
10. A computing device, characterized in that, The computing device includes: At least one processor, memory, and input / output module; The memory is used to store computer programs, and the processor is used to call the computer programs stored in the memory to execute the method as described in any one of claims 1 to 6.