Parallel archiving method for mass data
By constructing a dynamic performance index set and a resource scoring matrix, an ordered list of preferred paths is generated and a capacity-constrained load allocation is performed, which solves the problem of uneven resource utilization in parallel archiving and achieves efficient and stable parallel archiving of massive amounts of data.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NEIJIANG NORMAL UNIV
- Filing Date
- 2026-03-04
- Publication Date
- 2026-06-19
AI Technical Summary
Existing parallel archiving methods suffer from reduced system throughput and long-tail effects when dealing with extremely large and unevenly distributed source data scenarios. This is due to competition for underlying shared I/O and network resources between parallel tasks. Consequently, they cannot achieve efficient and stable parallel acceleration.
By acquiring performance metrics of each source storage node, archive storage node, and network link, a dynamic performance metric set is constructed, a resource scoring matrix is calculated, an ordered preferred path list is generated, and an archive scheduling instruction is generated through a capacity-constrained global load allocation algorithm to achieve parallel data archiving.
It significantly improves the overall execution efficiency and success rate of archiving tasks, reduces task completion time, and enhances the overall throughput and operational stability of the storage system.
Smart Images

Figure CN121785997B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data archiving technology, and in particular to a parallel archiving method for massive amounts of data. Background Technology
[0002] With the rapid development of cloud computing, the Internet of Things, and various sensing technologies, the amount of data accumulated by enterprises and institutions is growing exponentially, ushering in the era of massive data. Of this data, only a small portion requires frequent access and online processing, while the majority decreases in access frequency over time, becoming "cold data" or "warm data." To effectively reduce the load and total cost of ownership on high-speed, high-cost primary storage systems (such as all-flash arrays) while ensuring data traceability and compliance, automatically and efficiently migrating this low-access-frequency data to lower-cost dedicated archival storage systems has become a key common requirement in the field of data management.
[0003] Against this backdrop, leveraging parallel processing techniques to accelerate the archiving process of massive amounts of data is an intuitive and widely researched approach. Existing parallel archiving methods typically divide the dataset to be archived into multiple independent or semi-independent data task units, and then distribute them to multiple computing nodes or processes for concurrent execution, aiming to shorten the overall archiving time.
[0004] However, existing parallel archiving methods suffer from a significant problem when dealing with extremely large datasets and potentially highly unevenly distributed source data: the disordered competition among parallel tasks for shared I / O (input / output) and network resources leads to a sharp drop in overall system throughput. This results in a pronounced "long-tail effect" in the archiving job, failing to achieve the expected parallel speedup and potentially even performing worse than optimized serial archiving. Specifically, when multiple parallel tasks simultaneously read data from the source storage system or write data to the archive storage system, a large number of random I / O requests and network connection conflicts are generated. This causes frequent disk head seeks or network bandwidth to be consumed by numerous small concurrent flows, severely reducing the stable, high-throughput serialized read / write capabilities that storage devices and network links should provide. This not only wastes hardware resources but also limits the completion time of the entire archiving job to the slowest tasks, creating a "long-tail task" that severely restricts further improvements in archiving efficiency.
[0005] Therefore, how to design a method that can effectively coordinate the access of parallel archiving tasks to underlying shared I / O and network resources, maximize the serialized data stream, and thus truly achieve efficient and stable parallel acceleration of massive data archiving has become an urgent problem to be solved in this field. Summary of the Invention
[0006] This invention provides a parallel archiving method for massive amounts of data, including:
[0007] Step S1: Obtain the performance metrics of each source storage node, archive storage node, and network link, and construct a dynamic performance metric set;
[0008] Step S2: Based on the dynamic performance index set, calculate the comprehensive resource score between each source storage node and the archive storage node to form a resource score matrix;
[0009] Step S3: Based on the resource scoring matrix, calculate and sort the path priority values of each archiving storage node of the source data to be archived, and generate an ordered preferred path list;
[0010] Step S4: Based on the ordered preferred path list, the global optimal archiving path is solved for the source data to be archived using a capacity-constrained global load allocation algorithm, and an archiving scheduling instruction is generated.
[0011] Step S5: Based on the archiving scheduling instructions, archive the source data to be archived in parallel.
[0012] The parallel archiving method for massive data described above includes the following sub-steps for obtaining performance metrics of each source storage node, archive storage node, and network link:
[0013] Step S11: Send probe commands to the source storage node and the archive storage node to obtain the read I / O throughput of the source storage node, the write I / O throughput of the archive storage node, and their available remaining storage capacity.
[0014] Step S12: Measure the current available network bandwidth between each pair of source storage nodes and archive storage nodes, and integrate it with each I / O throughput and the proportion of available remaining storage capacity to construct a dynamic performance index set.
[0015] The parallel archiving method for massive data described above includes the following sub-steps: calculating the comprehensive resource score between each source storage node and the archive storage node based on a dynamic performance index set to form a resource score matrix.
[0016] Step S21: Obtain historical archiving execution data when historical archiving tasks are executed, and calculate the competition coefficient of each performance indicator through the historical competition situation learning algorithm;
[0017] Step S22: Based on the dynamic performance index set and the competition coefficient of each performance index, calculate the comprehensive resource score between each source storage node and the archive storage node to form a resource score matrix.
[0018] The parallel archiving method for massive data described above includes the following sub-steps: Based on a resource scoring matrix, calculating and sorting the path priority values of each archiving storage node of the source data to be archived, and generating an ordered list of preferred paths.
[0019] Step S31: Based on the resource scoring matrix, calculate the path priority value of each archiving storage node of the source data to be archived using a path optimization algorithm;
[0020] Step S32: Sort the archive storage nodes in descending order based on path priority values to form an ordered preferred path list of the source data to be archived.
[0021] The parallel archiving method for massive data described above includes the following sub-steps: Based on an ordered list of preferred paths, a global load allocation algorithm with capacity constraints is used to solve for the globally optimal archiving path for the source data to be archived, generating archiving scheduling instructions.
[0022] Step S41: Based on the ordered preferred path list, construct a global archive path planning model using a capacity-constrained global load allocation algorithm;
[0023] Step S42: Solve the global optimal archiving path for the source data to be archived using the global archiving path planning model, and generate archiving scheduling instructions.
[0024] The parallel archiving method for massive data described above includes the following sub-steps for archiving source data in parallel based on archiving scheduling instructions:
[0025] Step S51: Based on the archiving scheduling instructions, archive each data block of the source data to be archived in parallel;
[0026] Step S52: Monitor the data archiving process.
[0027] This invention also provides a parallel archiving system for massive amounts of data, comprising:
[0028] The performance metrics acquisition module acquires the performance metrics of each source storage node, archive storage node, and network link, and constructs a dynamic performance metric set.
[0029] The resource scoring matrix generation module calculates the comprehensive resource score between each source storage node and the archive storage node based on a dynamic performance index set, forming a resource scoring matrix.
[0030] The preferred path selection module calculates and sorts the path priority values of each archiving storage node of the source data to be archived based on the resource scoring matrix, and generates an ordered preferred path list.
[0031] The archiving scheduling instruction generation module, based on an ordered preferred path list, uses a capacity-constrained global load allocation algorithm to solve for the globally optimal archiving path for the source data to be archived, and generates archiving scheduling instructions.
[0032] The data archiving execution module archives the source data to be archived in parallel based on the archiving scheduling instructions.
[0033] As described above, in a parallel archiving system for massive amounts of data, the performance metric acquisition module specifically includes:
[0034] The storage node performance metric acquisition submodule sends probe commands to the source storage node and the archive storage node to obtain the read I / O throughput of the source storage node, the write I / O throughput of the archive storage node, and their available remaining storage capacity.
[0035] The dynamic performance metric set construction submodule measures the current available network bandwidth between each pair of source storage nodes and archive storage nodes, and integrates it with each I / O throughput and the proportion of available remaining storage capacity to construct a dynamic performance metric set.
[0036] As described above, in a parallel archiving system for massive amounts of data, the resource scoring matrix generation module specifically includes:
[0037] The competition coefficient generation submodule obtains historical archiving execution data when historical archiving tasks are executed, and calculates the competition coefficient of each performance indicator through a historical competition situation learning algorithm.
[0038] The comprehensive resource score calculation submodule calculates the comprehensive resource score between each source storage node and the archive storage node based on the dynamic performance index set and the competition coefficient of each performance index, forming a resource score matrix.
[0039] As described above, in a parallel archiving system for massive amounts of data, the preferred path selection module specifically includes:
[0040] The path priority value calculation submodule calculates the path priority value of each archiving storage node of the source data to be archived based on the resource scoring matrix and through the path optimization algorithm.
[0041] The ordered preferred path list generation submodule sorts the archive storage nodes in descending order based on path priority values to form an ordered preferred path list of the source data to be archived.
[0042] As described above, in a parallel archiving system for massive amounts of data, the archiving scheduling instruction generation module specifically includes:
[0043] The global archive path planning model construction submodule constructs a global archive path planning model based on an ordered preferred path list and a capacity-constrained global load allocation algorithm.
[0044] The global optimal archiving path solution submodule uses a global archiving path planning model to solve for the global optimal archiving path of the source data to be archived and generates archiving scheduling instructions.
[0045] As described above, a parallel archiving system for massive amounts of data includes a data archiving execution module, which specifically comprises:
[0046] The parallel data archiving execution submodule performs parallel archiving of each data block of the source data to be archived based on the archiving scheduling instructions.
[0047] The execution process monitoring submodule monitors the data archiving execution process.
[0048] The beneficial effects achieved by this invention are as follows: This invention can systematically solve the problems of uneven resource utilization, frequent bottleneck nodes, and limited overall throughput that are common in the process of massive data archiving. Under the premise of ensuring the capacity security of each archiving storage node, it significantly improves the overall execution efficiency and success rate of archiving tasks, reduces task completion time, enables the system to adaptively cope with constantly changing workloads, and enhances the overall throughput and operational stability of the storage system. Attached Figure Description
[0049] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the present invention. For those skilled in the art, other drawings can be obtained based on these drawings.
[0050] Figure 1 This is a flowchart of a parallel archiving method for massive data provided in Embodiment 1 of this application;
[0051] Figure 2 This is a schematic diagram of a parallel archiving system for massive data provided in Embodiment 2 of this application. Detailed Implementation
[0052] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0053] Example 1
[0054] like Figure 1As shown in Embodiment 1 of this application, a parallel archiving method for massive amounts of data is provided, which includes the following steps:
[0055] Step S1: Obtain the performance metrics of each source storage node, archive storage node, and network link, and construct a dynamic performance metric set;
[0056] Furthermore, obtaining performance metrics for each source storage node, archive storage node, and network link includes the following sub-steps:
[0057] Step S11: Send probe commands to the source storage node and the archive storage node to obtain the read I / O throughput of the source storage node, the write I / O throughput of the archive storage node, and their available remaining storage capacity.
[0058] Specifically, to all source storage nodes Parallel read probe commands are sent to measure the throughput of a read operation on a preset-size test data block, thus obtaining its current read I / O throughput. ; to all archive storage nodes Send a write probe command to measure its throughput in receiving and writing a preset-size test data block, and obtain its current write I / O throughput. At the same time, to all archive storage nodes Send a storage status probe command to obtain its current available remaining storage capacity. and total storage capacity ;in, Index for the source storage node, The range of values is , The number of source storage nodes. For indexing archive storage nodes, The range of values is , This refers to the number of archival storage nodes. The specific definition of each node type varies depending on the application environment. For example, in a cloud computing environment, source storage nodes are standard storage buckets in cloud object storage services, and archive storage nodes are instances of archive storage services; in a high-performance computing environment, source storage nodes are storage servers in parallel file systems, and archive storage nodes are tape libraries or optical disc libraries in hierarchical storage management systems; in a medical imaging system, source storage nodes are online storage servers, and archive storage nodes are long-term archival storage devices for medical images; in a media asset management system, source storage nodes are non-linear editing storage arrays, and archive storage nodes are media asset archival storage systems.
[0059] Step S12: Measure the current available network bandwidth between each pair of source storage nodes and archive storage nodes, and integrate it with each I / O throughput and the proportion of available remaining storage capacity to construct a dynamic performance index set;
[0060] Specifically, for each pair of source storage nodes and archive storage nodes Network links between Send test packets to measure the current available network bandwidth. ; All read I / O throughput obtained Write I / O throughput Available remaining storage capacity Total storage capacity and available network bandwidth Integrate and build a dynamic performance index set. .
[0061] Step S2: Based on the dynamic performance index set, calculate the comprehensive resource score between each source storage node and the archive storage node to form a resource score matrix;
[0062] Furthermore, based on the dynamic performance index set, the comprehensive resource score between each source storage node and the archive storage node is calculated to form a resource score matrix, including the following sub-steps:
[0063] Step S21: Obtain historical archiving execution data when historical archiving tasks are executed, and calculate the competition coefficient of each performance indicator through the historical competition situation learning algorithm;
[0064] Specifically, monitor historical archiving task execution data to obtain historical performance datasets within historical time periods T. These historical performance datasets include the actual read I / O throughput of each source storage node, synchronously collected at each historical sampling point t. Compared with current read I / O throughput Actual write I / O throughput of each archive storage node Compared to current write I / O throughput Actual available network bandwidth of each network link With current available network bandwidth Through performance utilization formula Calculate the performance utilization of each source storage node, archive storage node, and network link separately. For sampling point t source storage node Read utilization rate For sampling point t source storage node Actual read I / O throughput For sampling point t source storage node Current read I / O throughput Archive storage node for sampling point t Write utilization rate Archive storage node for sampling point t Actual write I / O throughput Archive storage node for sampling point t Current write I / O throughput, For sampling point t network link bandwidth utilization For sampling point t network link The actual available network bandwidth For sampling point t network link The current available network bandwidth; a utilization threshold is set based on the execution data of historical archived tasks. When the utilization exceeds this threshold, the performance is considered to be in a bottleneck state at that sampling point. Calculate the performance bottleneck probability of each performance parameter within a historical time period T, where, The probability that read performance becomes the performance bottleneck within a historical time period T. This represents the total number of sampling points within the historical time period T. The range of values is , The number of source storage nodes. The range of values is , For sampling point t source storage node Read utilization rate For read utilization threshold, This is an indicator function representing the read utilization rate. Greater than the read utilization threshold Take 1 at time. The probability that write performance becomes the performance bottleneck within a historical time period T. The number of archive storage nodes, The range of values is , This is an indicator function that represents the write utilization rate. Greater than the write utilization threshold Take 1 at time. This represents the probability that network performance will become a performance bottleneck within a historical time period T. This is an indicator function, representing the bandwidth utilization rate. Greater than the bandwidth utilization threshold When the sampling point is in a bottleneck state, take 1; Calculate the average degree of exceeding the threshold, where, The average degree to which read performance exceeds the threshold. The number of source storage nodes. The range of values is , The number of sampling points that are in a bottleneck state. The range of values is , Sampling points in a bottleneck state Source storage node Read utilization rate For read utilization threshold, The total number of sampling points representing the read performance bottleneck. The average level of write performance exceeding the threshold. The number of archive storage nodes, The range of values is , Sampling points in a bottleneck state Archive storage node Write utilization rate To write the utilization threshold, To write the total number of sampling points that represent the performance bottleneck, The average degree to which network performance exceeds a threshold. Sampling points in a bottleneck state Network Link bandwidth utilization This is the bandwidth utilization threshold. The total number of sampling points representing network performance bottlenecks; using the competition coefficient formula. Calculate a comprehensive bottleneck metric for read, write, and network performance. This is a comprehensive bottleneck metric for read performance. The probability that read performance becomes the performance bottleneck within a historical time period T. The average degree to which read performance exceeds the threshold. As a smoothing factor, This is a comprehensive bottleneck metric for write performance. The probability that write performance becomes the performance bottleneck within a historical time period T. The average level of write performance exceeding the threshold. This is a comprehensive bottleneck metric for network performance. This represents the probability that network performance will become a performance bottleneck within a historical time period T. To determine the average extent to which network performance exceeds a threshold, by After normalization, the contention coefficients for read, write, and network performance are obtained. To read the competition coefficient, To write the competition coefficient, This represents the network contention coefficient. This is a comprehensive bottleneck metric for read performance. This is a comprehensive bottleneck metric for write performance. This is a comprehensive bottleneck metric for network performance.
[0065] Step S22: Based on the dynamic performance index set and the competition coefficient of each performance index, calculate the comprehensive resource score between each source storage node and the archive storage node to form a resource score matrix;
[0066] Specifically, based on dynamic performance index sets Read the competition coefficient Write the competition coefficient Network contention coefficient Through the resource scoring calculation formula Calculate the comprehensive resource score for each archive path from each source storage node to each archive storage node, where, For source storage node to archive storage node Comprehensive resource score for the archive path. For source storage node Read I / O throughput For archive storage nodes Write I / O throughput For source storage node to archive storage node Available network bandwidth of the network link. To read the competition coefficient, To write the competition coefficient, This represents the network contention coefficient. For archive storage nodes Available remaining storage capacity For archive storage nodes The total storage capacity will be used to comprehensively score the resources of each archive path. Fill the M×N matrix to form the resource scoring matrix. .
[0067] Step S3: Based on the resource scoring matrix, calculate and sort the path priority values of each archiving storage node of the source data to be archived, and generate an ordered preferred path list;
[0068] Furthermore, based on the resource scoring matrix, the path priority values of each archiving storage node for the source data to be archived are calculated and sorted to generate an ordered list of preferred paths, including the following sub-steps:
[0069] Step S31: Based on the resource scoring matrix, calculate the path priority value of each archiving storage node of the source data to be archived using a path optimization algorithm;
[0070] Specifically, for each data block in the source data to be archived The source storage node From the resource rating matrix Extract the comprehensive resource score of all archive paths originating from this source storage node. Through path optimization algorithm Calculate each archive storage node The path priority value, where, For archive storage nodes Path priority value, For data blocks The source storage node to archive storage node A comprehensive resource score for the archive path. Indicates from the source storage node The maximum combined resource score for all originating archive paths. Represents the set of all archive paths. It is a natural constant. For source storage nodes to archive storage node The number of times the archive path history has been assigned an archive task.
[0071] Step S32: Sort the archive storage nodes in descending order based on path priority values to form an ordered list of preferred paths for the source data to be archived;
[0072] Specifically, based on each archive storage node Path priority value For all archive storage nodes Sort the data in descending order and select the first Q archive storage nodes, where the value of Q is determined based on the size of the archive task, to form data blocks from the source data to be archived. Ordered preferred path list Output an ordered list of preferred paths for all data blocks. .
[0073] Step S4: Based on the ordered preferred path list, the global optimal archiving path is solved for the source data to be archived using a capacity-constrained global load allocation algorithm, and an archiving scheduling instruction is generated.
[0074] Furthermore, based on the ordered preferred path list, a global load allocation algorithm with capacity constraints is used to solve for the globally optimal archiving path for the source data to be archived, generating archiving scheduling instructions, including the following sub-steps:
[0075] Step S41: Based on the ordered preferred path list, construct a global archive path planning model using a capacity-constrained global load allocation algorithm;
[0076] Specifically, the method for constructing a global archiving path planning model using a capacity-constrained global load distribution algorithm is as follows: for each data block in the source data to be archived... and its ordered preferred path list Each archive storage node in Define binary decision variables ,in, When, it indicates a data block Assigned to archive storage node , When, it indicates a data block Not assigned to archive storage nodes For each data block Constructing uniqueness constraints , This represents the list of all ordered preferred paths. archive storage nodes Based on each archive storage node Available remaining storage capacity For each archive storage node Construct absolute capacity constraints ,in, This indicates all "archive storage nodes" that meet the criteria. In data blocks Ordered preferred path list "data block in" , For data blocks Size, For binary decision variables, For capacity safety factor, For archive storage nodes Available remaining storage capacity Based on archive storage nodes Available remaining storage capacity With path priority value Build performance bootstrapping constraints ,in, This indicates all "archive storage nodes" that meet the criteria. In data blocks Ordered preferred path list "data block in" , For data blocks Size, For binary decision variables, This is the global adjustment coefficient. For archive storage nodes Priority of all paths The average value, For archive storage nodes Available remaining storage capacity The number of archive storage nodes, The range of values is , The number of data blocks in the source data to be archived. The range of values is , For data blocks The size; through maximizing the priority value formula Maximize the sum of global path priority values, where, The number of data blocks in the source data to be archived. The range of values is , This represents the list of all ordered preferred paths. archive storage nodes , For archive storage nodes Path priority value, These are binary decision variables.
[0077] Step S42: Solve the global optimal archiving path for the source data to be archived using the global archiving path planning model, and generate archiving scheduling instructions;
[0078] Specifically, the global archiving path planning model is solved to obtain each decision variable. Optimal value Based on this optimal value, in each data block Ordered preferred path list Select to archive storage nodes As its ultimate target, the archive storage node, a data block-archive storage node allocation pair is constructed. For each pair Generate a data block Identifiers, data blocks Source storage node Address, target archive storage node Address archiving scheduling instructions.
[0079] Step S5: Based on the archiving scheduling instruction, archive the source data to be archived in parallel;
[0080] Furthermore, based on the archiving schedule instructions, the parallel archiving of the source data to be archived includes the following sub-steps:
[0081] Step S51: Based on the archiving scheduling instructions, archive each data block of the source data to be archived in parallel;
[0082] Specifically, based on the parallel archiving degree P that the system can execute simultaneously, archiving scheduling instructions with the same combination of source storage nodes and target archiving storage nodes are merged and encapsulated into several independent archiving task units. The archiving task units are assigned to P parallel execution worker threads. Each worker thread reads the corresponding data block from the source storage node specified by the archiving scheduling instruction in sequence according to the archiving task unit it is assigned, and writes the data to the target archiving storage node specified by the archiving scheduling instruction through the network link.
[0083] Step S52: Monitor the data archiving process;
[0084] Specifically, during the archiving process, the system performance indicators and the available remaining capacity of each archiving storage node are monitored in real time. When the performance deviation exceeds the threshold or the remaining capacity is lower than the warning line, a feedback signal is triggered, the distribution of new tasks is suspended, and the process returns to step S1 to start a new round of scheduling optimization.
[0085] Example 2
[0086] like Figure 2 As shown, Embodiment 2 of this application provides a parallel archiving system for massive data, including:
[0087] The performance metric acquisition module 21 acquires the performance metrics of each source storage node, archive storage node, and network link, and constructs a dynamic performance metric set.
[0088] Furthermore, the performance metric acquisition module 21 includes the following sub-modules:
[0089] The storage node performance metric acquisition submodule sends probe commands to the source storage node and the archive storage node to obtain the read I / O throughput of the source storage node, the write I / O throughput of the archive storage node, and their available remaining storage capacity.
[0090] The dynamic performance metric set construction submodule measures the current available network bandwidth between each pair of source storage nodes and archive storage nodes, and integrates it with each I / O throughput and the proportion of available remaining storage capacity to construct a dynamic performance metric set.
[0091] The resource scoring matrix generation module 22 calculates the comprehensive resource score between each source storage node and the archive storage node based on the dynamic performance index set, and forms a resource scoring matrix.
[0092] Furthermore, the resource scoring matrix generation module 22 includes the following sub-modules:
[0093] The competition coefficient generation submodule obtains historical archiving execution data when historical archiving tasks are executed, and calculates the competition coefficient of each performance indicator through a historical competition situation learning algorithm.
[0094] The comprehensive resource score calculation submodule calculates the comprehensive resource score between each source storage node and the archive storage node based on the dynamic performance index set and the competition coefficient of each performance index, forming a resource score matrix.
[0095] The preferred path selection module 23 calculates and sorts the path priority values of each archiving storage node of the source data to be archived based on the resource scoring matrix, and generates an ordered preferred path list.
[0096] Furthermore, the preferred path selection module 23 includes the following sub-modules:
[0097] The path priority value calculation submodule calculates the path priority value of each archiving storage node of the source data to be archived based on the resource scoring matrix and through the path optimization algorithm.
[0098] The ordered preferred path list generation submodule sorts the archive storage nodes in descending order based on path priority values to form an ordered preferred path list of the source data to be archived;
[0099] The archiving scheduling instruction generation module 24, based on an ordered preferred path list, uses a capacity-constrained global load allocation algorithm to solve for the globally optimal archiving path for the source data to be archived and generates archiving scheduling instructions.
[0100] Furthermore, the archiving scheduling instruction generation module 24 includes the following sub-modules:
[0101] The global archive path planning model construction submodule constructs a global archive path planning model based on an ordered preferred path list and a capacity-constrained global load allocation algorithm.
[0102] The global optimal archiving path solution submodule uses a global archiving path planning model to solve for the global optimal archiving path of the source data to be archived and generates archiving scheduling instructions.
[0103] The data archiving execution module 25 archives the source data to be archived in parallel based on the archiving scheduling instructions;
[0104] Furthermore, the data archiving execution module 25 includes the following sub-modules:
[0105] The parallel data archiving execution submodule performs parallel archiving of each data block of the source data to be archived based on the archiving scheduling instructions.
[0106] The execution process monitoring submodule monitors the data archiving execution process;
[0107] Corresponding to the above embodiments, the present invention provides a computer storage medium, including: at least one memory and at least one processor;
[0108] The memory is used to store one or more program instructions;
[0109] A processor is used to run one or more program instructions to execute a parallel archiving method for massive amounts of data.
[0110] Corresponding to the above embodiments, this embodiment of the invention provides a computer-readable storage medium containing one or more program instructions, which are used by a processor for a parallel archiving method for massive data.
[0111] The embodiments disclosed in this invention provide a computer-readable storage medium storing computer program instructions that, when executed on a computer, cause the computer to perform the aforementioned parallel archiving method for massive amounts of data.
[0112] In this embodiment of the invention, the processor can be an integrated circuit chip with signal processing capabilities. The processor can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
[0113] The various methods, steps, and logic diagrams disclosed in the embodiments of this invention can be implemented or executed. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this invention can be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. The processor reads information from the storage medium and, in conjunction with its hardware, completes the steps of the above methods.
[0114] The storage medium can be memory, such as volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
[0115] Among them, non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory.
[0116] Volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous linked dynamic random access memory (Synchlink DRAM, SLDRAM), and direct memory bus RAM (DRRAM).
[0117] The storage media described in the embodiments of the present invention are intended to include, but are not limited to, these and any other suitable types of memory.
[0118] Those skilled in the art will recognize that, in one or more of the examples above, the functions described in this invention can be implemented using a combination of hardware and software. When applied as software, the corresponding functions can be stored in a computer-readable medium or transmitted as one or more instructions or code on a computer-readable medium. Computer-readable media include computer storage media and communication media, wherein communication media include any medium that facilitates the transmission of computer programs from one place to another. Storage media can be any available medium that can be accessed by a general-purpose or special-purpose computer.
[0119] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made on the basis of the technical solution of the present invention should be included within the scope of protection of the present invention.
Claims
1. A parallel archiving method for mass data, characterized by, include: Step S1: Obtain the performance metrics of each source storage node, archive storage node, and network link, and construct a dynamic performance metric set; Step S2: Based on the dynamic performance index set, calculate the comprehensive resource score between each source storage node and the archive storage node to form a resource score matrix, including the following sub-steps: Step S21: Obtain historical archiving execution data when historical archiving tasks are executed, and calculate the competition coefficient of each performance indicator through the historical competition situation learning algorithm; Specifically, monitor historical archiving task execution data to obtain historical performance datasets within historical time periods T. These historical performance datasets include the actual read I / O throughput of each source storage node, synchronously collected at each historical sampling point t. Compared with current read I / O throughput Actual write I / O throughput of each archive storage node Compared to current write I / O throughput Actual available network bandwidth of each network link With current available network bandwidth Through performance utilization formula Calculate the performance utilization of each source storage node, archive storage node, and network link separately. For sampling point t source storage node Read utilization rate For sampling point t source storage node Actual read I / O throughput For sampling point t source storage node Current read I / O throughput Archive storage node for sampling point t Write utilization rate Archive storage node for sampling point t Actual write I / O throughput Archive storage node for sampling point t Current write I / O throughput, For sampling point t network link bandwidth utilization For sampling point t network link The actual available network bandwidth For sampling point t network link The current available network bandwidth; a utilization threshold is set based on the execution data of historical archived tasks. When the utilization exceeds this threshold, the performance is considered to be in a bottleneck state at that sampling point. Calculate the performance bottleneck probability of each performance parameter within a historical time period T, where, The probability that read performance becomes the performance bottleneck within a historical time period T. This represents the total number of sampling points within the historical time period T. The range of values is , The number of source storage nodes. The range of values is , For sampling point t source storage node Read utilization rate For read utilization threshold, This is an indicator function representing the read utilization rate. Greater than the read utilization threshold Take 1 at time. The probability that write performance becomes the performance bottleneck within a historical time period T. The number of archive storage nodes, The range of values is , This is an indicator function that represents the write utilization rate. Greater than the write utilization threshold Take 1 at time. This represents the probability that network performance will become a performance bottleneck within a historical time period T. This is an indicator function, representing the bandwidth utilization rate. Greater than the bandwidth utilization threshold When the sampling point is in a bottleneck state, take 1; Calculate the average degree of exceeding the threshold, where, The average degree to which read performance exceeds the threshold. The number of source storage nodes. The range of values is , The number of sampling points that are in a bottleneck state. The range of values is , Sampling points in a bottleneck state Source storage node Read utilization rate For read utilization threshold, The total number of sampling points representing the read performance bottleneck. The average level of write performance exceeding the threshold. The number of archive storage nodes, The range of values is , Sampling points in a bottleneck state Archive storage node Write utilization rate To write the utilization threshold, To write the total number of sampling points that represent the performance bottleneck, The average degree to which network performance exceeds a threshold. Sampling points in a bottleneck state Network Link bandwidth utilization This is the bandwidth utilization threshold. The total number of sampling points representing network performance bottlenecks; using the competition coefficient formula. Calculate a comprehensive bottleneck metric for read, write, and network performance. This is a comprehensive bottleneck metric for read performance. The probability that read performance becomes the performance bottleneck within a historical time period T. The average degree to which read performance exceeds the threshold. As a smoothing factor, This is a comprehensive bottleneck metric for write performance. The probability that write performance becomes the performance bottleneck within a historical time period T. The average level of write performance exceeding the threshold. This is a comprehensive bottleneck metric for network performance. This represents the probability that network performance will become a performance bottleneck within a historical time period T. To determine the average extent to which network performance exceeds a threshold, by After normalization, the contention coefficients for read, write, and network performance are obtained. To read the competition coefficient, To write the competition coefficient, This represents the network competition coefficient. This is a comprehensive bottleneck metric for read performance. This is a comprehensive bottleneck metric for write performance. This is a comprehensive bottleneck metric for network performance. Step S22: Based on the dynamic performance index set and the competition coefficient of each performance index, calculate the comprehensive resource score between each source storage node and the archive storage node to form a resource score matrix; Specifically, based on dynamic performance index sets Read the competition coefficient Write the competition coefficient Network contention coefficient Through the resource scoring calculation formula Calculate the comprehensive resource score for each archive path from each source storage node to each archive storage node, where, For source storage node to archive storage node Comprehensive resource score for the archive path. For source storage node Read I / O throughput For archive storage nodes Write I / O throughput For source storage node to archive storage node Available network bandwidth of the network link. To read the competition coefficient, To write the competition coefficient, This represents the network competition coefficient. For archive storage nodes Available remaining storage capacity For archive storage nodes The total storage capacity will be used to comprehensively score the resources of each archive path. Fill the M×N matrix to form the resource scoring matrix. ; Step S3: Based on the resource scoring matrix, calculate and sort the path priority values of each archiving storage node of the source data to be archived, and generate an ordered preferred path list; Step S4: Based on the ordered preferred path list, the global optimal archiving path is solved for the source data to be archived using a capacity-constrained global load allocation algorithm, and an archiving scheduling instruction is generated. Step S5: Based on the archiving scheduling instructions, archive the source data to be archived in parallel.
2. The parallel archiving method for mass data according to claim 1, wherein, Obtaining performance metrics for each source storage node, archive storage node, and network link includes the following sub-steps: Step S11: Send probe commands to the source storage node and the archive storage node to obtain the read I / O throughput of the source storage node, the write I / O throughput of the archive storage node, and their available remaining storage capacity. Step S12: Measure the current available network bandwidth between each pair of source storage nodes and archive storage nodes, and integrate it with each I / O throughput and the proportion of available remaining storage capacity to construct a dynamic performance index set.
3. The parallel archiving method for mass data according to claim 1, wherein, Based on the resource scoring matrix, the path priority value of each archiving storage node of the source data to be archived is calculated and sorted to generate an ordered list of preferred paths, including the following sub-steps: Step S31: Based on the resource scoring matrix, calculate the path priority value of each archiving storage node of the source data to be archived using a path optimization algorithm; Step S32: Sort the archive storage nodes in descending order based on path priority values to form an ordered preferred path list of the source data to be archived.
4. The parallel archiving method for mass data of claim 1, wherein, Based on an ordered list of preferred paths, a global load balancing algorithm with capacity constraints is used to solve for the globally optimal archiving path for the source data to be archived, generating archiving scheduling instructions. This process includes the following sub-steps: Step S41: Based on the ordered preferred path list, construct a global archive path planning model using a capacity-constrained global load allocation algorithm; Step S42: Solve the global optimal archiving path for the source data to be archived using the global archiving path planning model, and generate archiving scheduling instructions.
5. A massively data-oriented parallel archiving system, performing a massively data-oriented parallel archiving method according to any one of claims 1 to 4, characterized in that, include: The performance metrics acquisition module acquires the performance metrics of each source storage node, archive storage node, and network link, and constructs a dynamic performance metric set. The resource scoring matrix generation module calculates the comprehensive resource score between each source storage node and the archive storage node based on a dynamic performance index set, forming a resource scoring matrix. The preferred path selection module calculates and sorts the path priority values of each archiving storage node of the source data to be archived based on the resource scoring matrix, and generates an ordered preferred path list. The archiving scheduling instruction generation module, based on an ordered preferred path list, uses a capacity-constrained global load allocation algorithm to solve for the globally optimal archiving path for the source data to be archived, and generates archiving scheduling instructions. The data archiving execution module archives the source data to be archived in parallel based on the archiving scheduling instructions.
6. A parallel archiving system for mass data as claimed in claim 5, wherein, The performance metric acquisition module specifically includes: The storage node performance metric acquisition submodule sends probe commands to the source storage node and the archive storage node to obtain the read I / O throughput of the source storage node, the write I / O throughput of the archive storage node, and their available remaining storage capacity. The dynamic performance metric set construction submodule measures the current available network bandwidth between each pair of source storage nodes and archive storage nodes, and integrates it with each I / O throughput and the proportion of available remaining storage capacity to construct a dynamic performance metric set.
7. A parallel archiving system for mass data as claimed in claim 5, wherein, The resource rating matrix generation module specifically includes: The competition coefficient generation submodule obtains historical archiving execution data when historical archiving tasks are executed, and calculates the competition coefficient of each performance indicator through a historical competition situation learning algorithm. The comprehensive resource score calculation submodule calculates the comprehensive resource score between each source storage node and the archive storage node based on the dynamic performance index set and the competition coefficient of each performance index, forming a resource score matrix.
8. A parallel archiving system for mass data as claimed in claim 5, wherein, The preferred path selection module specifically includes: The path priority value calculation submodule calculates the path priority value of each archiving storage node of the source data to be archived based on the resource scoring matrix and through the path optimization algorithm. The ordered preferred path list generation submodule sorts the archive storage nodes in descending order based on path priority values to form an ordered preferred path list of the source data to be archived.
9. A parallel archiving system for mass data as claimed in claim 5, wherein, The archiving scheduling instruction generation module specifically includes: The global archive path planning model construction submodule constructs a global archive path planning model based on an ordered preferred path list and a capacity-constrained global load allocation algorithm. The global optimal archiving path solution submodule uses a global archiving path planning model to solve for the globally optimal archiving path of the source data to be archived and generates archiving scheduling instructions.