A method for searching for data access within an enterprise
By calculating the storage load rate and correlation of data units and optimizing node allocation, the problem of load imbalance in distributed storage systems is solved, and data retrieval efficiency and system stability are improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGSU QINGSHAN SOFTWARE CO LTD
- Filing Date
- 2025-06-06
- Publication Date
- 2026-06-26
AI Technical Summary
When storing internal enterprise data using distributed hash algorithms, uneven load distribution across different storage nodes can lead to poor performance, prolonged response times, and may cause node crashes or failures, reducing data retrieval efficiency and increasing the risk of data loss.
By calculating the storage load rate, correlation, and node performance indicators of data units, the data allocation scheme is optimized, and the node allocation is iteratively adjusted to ensure load balance.
It improves data retrieval efficiency, reduces the risk of node crashes, decreases the possibility of data loss, and optimizes the performance of distributed storage systems.
Smart Images

Figure CN120653693B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data retrieval technology, specifically to a method for accessing and retrieving internal enterprise data. Background Technology
[0002] With the rapid advancement of information technology, enterprises are constantly generating massive amounts of data. How to quickly retrieve data from this massive amount of data is a problem that needs to be solved. By storing the data generated within an enterprise through a distributed hash algorithm and quickly retrieving the data through a hash index, the enterprise's data can be protected while improving the efficiency of data retrieval.
[0003] However, when storing data generated within an enterprise using a distributed hash algorithm, the amount of data stored on different storage nodes varies, resulting in different loads on the storage nodes. This can lead to situations where some storage nodes with poor performance are overloaded with excessive amounts of data, reducing the node's response time during data retrieval. Consequently, this can cause latency and reduced throughput in the distributed system. Furthermore, overloaded storage nodes may crash or fail, increasing the risk of data loss and reducing the efficiency of data retrieval. Summary of the Invention
[0004] To address the aforementioned technical issues, a method for accessing and retrieving internal enterprise data is provided to resolve the existing problems.
[0005] The solution to the technical problem addressed in this application is to provide a method for accessing and retrieving internal enterprise data, comprising the following steps:
[0006] Divide the various types of data within the enterprise into data units, and obtain all data points, all historical retrieval records, and performance data of various performance indicators of each node in the distributed storage system for each data unit;
[0007] The storage load rate of each data unit is calculated by taking into account the sampling frequency of the data points in each data unit and the total number of data points.
[0008] Analyze the historical retrieval records for cases where any two data units are retrieved simultaneously, and calculate the first correlation degree between the two data units.
[0009] Based on the correlation of changes in adjacent data points in the same sampling period in any two data units, and the differences in changes in data points in any two data units, a second correlation degree is determined for any two data units; the first correlation degree and the second correlation degree are fused to determine the retrieval correlation degree for any two data units.
[0010] The nodes are sorted according to the performance data of various performance indicators of each node to obtain the node performance sequence; the data units are initially allocated based on the distance relationship between the collection location of each data unit and the node; and the nodes are sorted according to the storage load rate of all data units allocated to each node to obtain the node load sequence.
[0011] Analyze the matching of elements in the node performance sequence and the node load sequence to obtain each matching node and each non-matching node. Based on the retrieval correlation between any data unit allocated to each non-matching node and the other data units, and in conjunction with the storage load rate, calculate the schedulability of any data unit. Based on the number of matching nodes, evaluate the initial allocation scheme. In conjunction with the schedulability, iteratively adjust the allocation scheme and allocate nodes to the data units.
[0012] Preferably, the calculation of the storage load rate for each data unit includes:
[0013] Calculate the average time interval between any two adjacent data points in each data unit, and denote it as the average sampling interval.
[0014] The storage load factor is the ratio between the number of all data points in each data unit and the average sampling interval.
[0015] Preferably, the first correlation degree R1 between the r-th data unit and the h-th data unit is... r,h The calculation formula is: Among them, U r,h To simultaneously retrieve the number of historical retrieval records for both the r-th and h-th data units, U r To retrieve the total number of historical retrieval records for the r-th data unit, U h The number of all historical search records retrieved for the h-th data unit is ε, which is a preset value greater than 0.
[0016] Preferably, determining the second correlation degree between any two data units includes:
[0017] Select the data unit with the largest average sampling interval among any two data units and denote it as the target unit; then denote the other data unit as the reference unit.
[0018] Two adjacent data points are denoted as a pair of adjacent data points, and the relative change of each pair of adjacent data points in the target unit is calculated.
[0019] Based on the sampling time interval between each pair of adjacent data points in the target unit, all data points within the sampling time interval are selected from the reference unit to form a reference data sequence;
[0020] Calculate the sum of the relative changes between all adjacent data points in the reference data sequence, and denot it as the reference change.
[0021] The relative changes of all adjacent data points in the target unit are used to form a first change sequence; the reference changes of all reference data sequences in the reference unit are used to form a second change sequence.
[0022] Calculate the correlation between the first change sequence and the second change sequence;
[0023] Calculate the difference between each element in the first change sequence and the element at the same position in the second change sequence, and calculate the sum of the differences of all elements in the first change sequence;
[0024] The second correlation degree is the ratio of the absolute value of the correlation degree to the summation.
[0025] Preferably, the further process of obtaining the node performance sequence is as follows:
[0026] For each performance metric, the performance data of all nodes is positively processed. All nodes are arranged in descending order of performance data. The sum of the ranking of each node in all performance metrics is calculated. All nodes are arranged in ascending order according to the sum to form a node performance sequence.
[0027] Preferably, the further process of obtaining the node load sequence is as follows:
[0028] Calculate the distance between the acquisition location of each data unit and the location of each node;
[0029] All data units are pre-assigned to the nearest node as the initial allocation scheme;
[0030] The sum of the storage load rates of all data units allocated to each node in the initial allocation scheme is taken as the total load rate of each node in the initial allocation scheme.
[0031] Arrange all nodes in descending order of the total load rate to form a node load sequence.
[0032] Preferably, obtaining each matching node and each non-matching node includes:
[0033] If the node load sequence is the same as the node at the same position in the node performance sequence, then the corresponding node is a matching node; otherwise, it is a non-matching node.
[0034] Preferably, the schedulable calculation formula for any data unit is as follows: Among them, T n,rFor the schedulable data unit allocated to the nth mismatched node, L n,r RJ represents the storage load factor of the r-th data unit allocated to the n-th mismatched node. r,h H represents the retrieval relevance between the r-th data unit assigned to the n-th non-matching node and the h-th data unit. n The number of data units assigned to the nth mismatched node, where ∈ is a preset value greater than 0.
[0035] Preferably, the evaluation of the initial allocation scheme includes:
[0036] The total number of matching nodes is counted and used as the node matching degree of the initial allocation scheme.
[0037] If the node matching degree is less than a preset threshold, the allocation scheme is iteratively adjusted; otherwise, the data unit is stored according to the current allocation scheme.
[0038] Preferably, the iterative adjustment of the allocation scheme includes:
[0039] Allocate the largest schedulable data unit under all mismatched nodes to the second nearest neighbor node as the allocation scheme for the next allocation;
[0040] By continuously iterating and adjusting the allocation scheme, the optimal allocation scheme is obtained, and data units are allocated nodes according to the optimal allocation scheme.
[0041] This application has at least the following beneficial effects:
[0042] This application calculates the storage load rate of each data unit by measuring the sampling frequency of data within each data unit. Its advantage lies in considering the rate of data growth within the corresponding data unit, thus reflecting the data load level of that data unit. Secondly, by analyzing the simultaneous retrieval of any two data units in historical search records, a first correlation degree is calculated between these two data units to reflect their association and indicate the degree to which they should be assigned to the same node. Thirdly, by analyzing the changes in adjacent data points within these two data units, a second correlation degree is calculated. Its advantage lies in considering the correlation of changes in data points within the two data units, further evaluating the association between them. Finally, by combining the first and second correlation degrees, the retrieval correlation degree of these two data units is determined. Its advantage lies in comprehensively evaluating the association between the two data units by integrating the correlation between historical search records and changes in data points, thus reflecting the degree to which they should be assigned to the same node. The likelihood of accessing multiple nodes during data retrieval is reduced, thus improving data retrieval efficiency. Nodes are sorted based on their performance data to obtain a node performance sequence. Data units are pre-allocated based on the distance between the data unit's collection location and the node. Nodes are then sorted again based on the storage load rate of the pre-allocated data units to obtain a node load sequence. Matching nodes and non-matching nodes are identified by comparing the node performance sequence with the node load sequence at the same location. This approach considers both node performance and pre-allocation load, reflecting the degree to which high-load data units are allocated to high-performance nodes. For data units allocated to non-matching nodes, schedulability is calculated. The allocation scheme is iteratively adjusted using schedulability to obtain the optimal allocation scheme. Data units are then allocated to nodes according to this optimal scheme. This approach balances the load on different nodes, reduces the risk of data loss due to node crashes, lowers node response time, and improves data retrieval efficiency. Attached Figure Description
[0043] The following description, in conjunction with the accompanying drawings, provides a more detailed explanation of an enterprise internal data access and retrieval method based on this application.
[0044] Figure 1 A flowchart illustrating the steps of an enterprise internal data access and retrieval method provided in this application embodiment;
[0045] Figure 2 A flowchart illustrating the steps of the method for obtaining the optimal allocation scheme provided in this application embodiment. Detailed Implementation
[0046] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description of an enterprise internal data access and retrieval method, in conjunction with the accompanying drawings and implementation examples, is provided. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0047] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains.
[0048] Please see Figure 1 The diagram illustrates a flowchart of an enterprise internal data access and retrieval method according to an embodiment of this application. The method includes the following steps:
[0049] Step 1: Divide the various types of data within the enterprise into data units, and obtain all data points, all historical retrieval records, and performance data of various performance indicators of each node in the distributed storage system for each data unit.
[0050] In a distributed storage system, multiple nodes collaborate to store and manage data, achieving high availability, high scalability, and load balancing. By distributing data across multiple nodes, overloading a single node is avoided, improving the overall performance of the distributed storage system. To enhance data retrieval efficiency, distributed hash storage is used. To address the issue of poorly performing nodes being overloaded, leading to node crashes or failures, the data allocation and storage scheme is optimized. This ensures that different data is stored on appropriate nodes, improving the storage efficiency of nodes in the distributed hash algorithm. Consequently, this improves node load balancing, reduces the risk of data loss, and guarantees efficient data retrieval.
[0051] The enterprise uses a distributed storage system to continuously monitor production using different monitoring devices, generating various types of real-time data. The data is then categorized into different types, with each type treated as a data unit, resulting in a single data unit.
[0052] Therefore, for each data unit, the data points are sorted according to their sampling time to obtain all data points for each data unit, and all historical retrieval records can be obtained through the system log; the performance data of each node's performance indicators can be obtained through the system monitoring tool; thus, it is necessary to select appropriate nodes for storage for different data units.
[0053] At this point, all data points for each data unit are obtained, all historical retrieval records are acquired, and performance data for various performance metrics of each node are also obtained.
[0054] Step 2: Calculate the storage load rate of each data unit by taking into account the sampling frequency of the data points in each data unit and the total number of data points; analyze the cases in historical retrieval records where any two data units are retrieved simultaneously, and calculate the first correlation degree of the two data units.
[0055] Because the growth rate of the number of data points in different data units is different, the amount of data in the data units varies. Data units with a faster growth rate of data points have a higher storage load. If a high-load data unit is assigned to a low-performance node, it will cause the node to be overloaded, affecting the node's response speed and thus reducing the data retrieval efficiency.
[0056] Therefore, by calculating the number of data points in each data unit and the sampling interval between adjacent data points, the storage load rate of each data unit is calculated to reflect the data load of the corresponding data node. Specifically:
[0057] Calculate the average time interval between any two adjacent data points in each data unit, and denote it as the average sampling interval.
[0058] The ratio of the number of all data points in each data unit to the average sampling interval is used as the storage load rate of each data unit;
[0059] It should be noted that the larger the number, the larger the amount of data contained in the data unit, and the smaller the average sampling interval, the faster the data growth rate. In this case, the storage load rate is larger, indicating that the corresponding data unit has a higher load on the node.
[0060] Secondly, since each node in a distributed storage system can store multiple data units, and different data units often have certain relationships when stored, when retrieving data, multiple data units are often searched for, and multiple types of data are searched. If data units with strong relationships are distributed to different nodes, multiple nodes will be accessed when retrieving data, which reduces the efficiency of data retrieval.
[0061] Therefore, by analyzing the association between different data units in all historical search records, a first degree of association is calculated to reflect the association between different data units in historical searches. Specifically:
[0062] The formula for calculating the first correlation degree between any two data units is:
[0063]
[0064] Among them, R1 r,hU represents the first correlation degree between the r-th data unit and the h-th data unit. r,h To simultaneously retrieve the number of historical retrieval records for both the r-th and h-th data units, U r To retrieve the total number of historical retrieval records for the r-th data unit, U h To retrieve the number of all historical retrieval records for the h-th data unit, ε is a preset value greater than 0 to avoid a denominator of 0. The value of ε ranges from (0,1). In this embodiment, the value of ε is 0.1. As for other implementation methods, the implementer can set it according to the actual situation.
[0065] It should be noted that the greater the first correlation, the greater the correlation between the data category features of the r-th data unit and the h-th data unit. When searching the data, the greater the possibility that the data in these two data units will be retrieved together, which reflects that the r-th data unit and the h-th data unit should be allocated to the same node for storage.
[0066] Thus, the first correlation degree between any two data units is obtained.
[0067] Step 3: Based on the correlation of changes in adjacent data points in the same sampling period in any two data units, and the differences in changes in data points in any two data units, determine the second correlation degree of any two data units; combine the first correlation degree and the second correlation degree to determine the retrieval correlation degree of any two data units.
[0068] Furthermore, the first correlation degree reflects the possibility that two data units are jointly retrieved, thus indicating the correlation between data units; however, it does not involve the correlation of data point changes between different data units. Since the average sampling intervals of different data units are different, the number of data points contained in different data units is inconsistent under the same sampling duration. Therefore, by analyzing the correlation changes of data points between different data units under the same sampling duration, a second correlation degree is calculated to more accurately reflect the correlation between different data units. Specifically:
[0069] Select the data unit with the largest average sampling interval among any two data units and denote it as the target unit; then denote the other data unit as the reference unit.
[0070] In this embodiment, it is assumed that for the r-th data unit and the h-th data unit, the average sampling interval of the r-th data unit is the largest. Therefore, the r-th data unit is the target unit and the h-th data unit is the reference unit.
[0071] Two adjacent data points are denoted as a pair of adjacent data points. Based on the sampling time period between each pair of adjacent data points in the target unit, all data points within the sampling time period are selected from the reference unit to form a reference data sequence.
[0072] It should be noted that, for ease of understanding, let's assume a pair of adjacent data points in the target unit, namely a1 and a2. Then, the sampling time corresponding to data point a1 is t1, and the sampling time corresponding to data point a2 is t2. The sampling time interval between data points a1 and a2 is [t1, t2]. Data points collected within the time interval [t1, t2] are selected from the reference unit. Assuming that a total of 5 data points are collected in the reference unit within the time interval [t1, t2], namely b1, b2, b3, b4, and b5, the reference data sequence is {b1, b2, b3, b4, b5}. Thus, data points a1 and a2 correspond to one reference data sequence in the reference unit, and correspondingly, data points a2 and a3 correspond to one reference data sequence in the reference unit.
[0073] Calculate the relative change of each pair of adjacent data points in the target unit;
[0074] In this embodiment, taking a pair of adjacent data points a1 and a2 in the target unit as an example, the formula for calculating the relative change is: in, The relative change between a pair of adjacent data points a1 and a2 in the target cell. The value of data point a1 in the target cell. This represents the value of data point a2 in the target cell.
[0075] Calculate the sum of the relative changes between all adjacent data points in the reference data sequence, and denot it as the reference change.
[0076] In this embodiment, taking a reference data sequence corresponding to a pair of adjacent data points a1 and a2 in the target unit as an example, the formula for calculating the reference change is as follows: in, The reference change amount is the reference data sequence corresponding to a pair of adjacent data points a1 and a2 in the target unit. For data point b in the reference data sequence i The value, For data point b in the reference data sequence i-1 The value, The number of all data points in the reference data sequence corresponding to a pair of adjacent data points a1 and a2 in the target unit.
[0077] It should be noted that if the value of a data point in a data unit is not numerical, the non-numerical data is encoded and converted into numerical data for calculation. The method of encoding non-numerical data is a well-known technique and will not be elaborated here.
[0078] The relative changes of all adjacent data points in the target unit are used to form a first change sequence;
[0079] The reference change amounts of all reference data sequences in the reference unit are combined to form a second change sequence;
[0080] Calculate the correlation between the first change sequence and the second change sequence;
[0081] In this embodiment, the degree of correlation is measured by calculating the Pearson correlation coefficient between the first change sequence and the second change sequence. The calculation method of the Pearson correlation coefficient is a well-known technique. As other implementations, implementers may use other methods of the prior art, such as the Spearman correlation coefficient, etc. This embodiment does not impose any special restrictions on this.
[0082] It should be noted that elements at the same position between the first and second change sequences correspond to the same pair of adjacent data points in the target unit. Assuming the first element in the first change sequence is... The first element in the second change sequence is Both of these elements correspond to the same pair of adjacent data points a1 and a2 in the target unit; secondly, the larger the absolute value of the correlation, the greater the correlation between the data changes between the target unit and the reference unit.
[0083] Furthermore, the second correlation is analyzed based on the differences between elements in the first and second changed sequences, specifically as follows:
[0084] Calculate the difference between each element in the first change sequence and the element at the same position in the second change sequence, and calculate the sum of the differences of all elements in the first change sequence;
[0085] The ratio of the absolute value of the correlation degree to the summation is taken as the second correlation degree between any two data units.
[0086] In this embodiment, the formula for calculating the second correlation degree between any two data units is:
[0087]
[0088] Among them, R2 r,h E represents the second correlation degree between the r-th data unit and the h-th data unit.r,h M is the absolute value of the correlation between the r-th data unit and the h-th data unit. q C is the value of the q-th element in the first change sequence. q Let τ be the value of the q-th element in the second change sequence, Q be the number of all elements in the first change sequence, and τ be a preset value greater than 0 to avoid the denominator being 0. The value range of τ is (0,1]. In this embodiment, τ is 1. As for other implementation methods, the implementer can set it according to the actual situation.
[0089] It should be noted that the smaller the sum, the smaller the difference in the magnitude of element changes between the first change sequence and the second change sequence, and the larger the second correlation degree, indicating that the correlation between data unit A and data unit B is higher. In this case, data unit A and data unit B should be assigned to the same node.
[0090] Based on the first relevance and the second relevance, the retrieval relevance is determined as follows:
[0091] The normalized result of the product of the first correlation degree and the second correlation degree is used as the retrieval correlation degree of any two data units;
[0092] In this embodiment, the sigmoid function is used for normalization. The sigmoid function is a well-known technique and will not be described in detail here. As other implementation methods, implementers may use other methods, such as the tanh function, etc. This embodiment does not impose any special restrictions on this.
[0093] It should be noted that the greater the retrieval relevance, the higher the correlation between the two corresponding data units. In this case, the two corresponding data units should be assigned to the same node for storage.
[0094] Thus, the retrieval relevance of any two data units is obtained.
[0095] Step 4: Sort the nodes according to the performance data of various performance indicators of each node to obtain the node performance sequence; based on the distance relationship between the collection location of each data unit and the node, perform initial allocation of data units; sort the nodes according to the storage load rate of all data units allocated to each node to obtain the node load sequence.
[0096] Data collection locations for different data units may differ. The greater the distance between the collection location of each data unit and the node, the lower the data transmission efficiency. Therefore, by analyzing the distance relationship between each data unit and the node, all data units are initially allocated to obtain the node performance sequence and node load sequence, specifically:
[0097] Calculate the distance between the acquisition location of each data unit and the location of each node;
[0098] It should be noted that the location of nodes in the distributed storage system is obtained through geographic information system (GIS) technology, which is a well-known technology and will not be elaborated upon here.
[0099] It should be noted that the smaller the distance, the closer the corresponding data unit is to the node.
[0100] All data units are pre-assigned to the nearest node as the initial allocation scheme;
[0101] The sum of the storage load rates of all data units allocated to each node in the initial allocation scheme is taken as the total load rate of each node in the initial allocation scheme.
[0102] Furthermore, since different performance metrics for each node exhibit varying degrees of optimization, with larger data values generally indicating better performance, and smaller data values generally indicating better performance, it is necessary to perform a positive correlation processing on the performance data for different metrics to ensure that the direction of performance data changes is consistent. Specifically:
[0103] For each performance metric, the performance data of all nodes is positively processed. All nodes are arranged in descending order of performance data. The sum of the ranking of each node in all performance metrics is calculated. All nodes are arranged in ascending order of the sum to form a node performance sequence.
[0104] It should be noted that the forward processing method is a well-known technology and will not be elaborated here; secondly, the smaller the sum, the better the performance of the corresponding node and the higher the data processing capability of the node. Therefore, the earlier the node in the node performance sequence, the better its performance.
[0105] Arrange all nodes in descending order of the total load rate to form a node load sequence;
[0106] It should be noted that the earlier a node appears in the node load sequence, the higher its load.
[0107] Step 5: Analyze the matching of elements in the node performance sequence and the node load sequence to obtain each matching node and each non-matching node. Based on the retrieval correlation degree between any data unit allocated to each non-matching node and the other data units, and in conjunction with the storage load rate, calculate the schedulability of any data unit. Based on the number of matching nodes, evaluate the initial allocation scheme. In conjunction with the schedulability, iteratively adjust the allocation scheme and allocate nodes to the data units.
[0108] Furthermore, the matching between the node performance sequence and the node load sequence is analyzed to evaluate the allocation scheme, and the allocation scheme is continuously iterated and adjusted to obtain the optimal allocation scheme, specifically as follows:
[0109] If the node performance sequence is the same as the node at the same position in the node load sequence, then the corresponding node is a matching node; otherwise, it is a non-matching node.
[0110] The total number of matching nodes is counted and used as the node matching degree of the initial allocation scheme.
[0111] It should be noted that, for ease of understanding, we assume there are 5 nodes in the distributed storage system, where the node performance sequence is {g2, g4, g1, g5, g3} and the node load sequence is {g3, g4, g2, g5, g1}. Then, the second position in both the node performance sequence and the node load sequence is node g4, and the fourth position in both the node performance sequence and the node load sequence is node g5. Therefore, the node matching degree is 2.
[0112] It should be noted that the lower the node matching degree, the more unreasonable the allocation of data units is, and the more necessary it is to readjust and reassign the data units.
[0113] If the node matching degree is less than a preset threshold, the allocation scheme is iteratively adjusted; otherwise, the data unit is stored according to the current allocation scheme.
[0114] In this embodiment, the preset threshold is Where N is the total number of nodes in the distributed storage system. As a rounding function, N is set to 11 in this embodiment. In other implementations, the implementer can set the value according to the actual situation.
[0115] The process of iteratively adjusting the allocation scheme is as follows:
[0116] The schedulability of any data unit allocated to each mismatched node is calculated using the following formula:
[0117]
[0118] Among them, T n,r For the schedulable data unit allocated to the nth mismatched node, L n,r RJ represents the storage load factor of the r-th data unit allocated to the n-th mismatched node. r,h H represents the retrieval relevance between the r-th data unit assigned to the n-th non-matching node and the h-th data unit. nThe number of all data units allocated to the nth mismatched node, where ∈ is a preset value greater than 0 to avoid a denominator of 0. The value range of ∈ is (0,1]. In this embodiment, the value of ∈ is 0.1. As for other implementation methods, the implementer can set it according to the actual situation.
[0119] All the largest schedulable data units under all mismatched nodes are allocated to the second nearest neighbor node as the next allocation scheme. This process is iterated until the node matching degree meets the preset threshold, obtaining the optimal allocation scheme. Data units are then allocated to nodes according to the optimal scheme, achieving load balancing across different nodes and improving data retrieval efficiency. The flowchart of the method for obtaining the optimal allocation scheme provided in this application embodiment is shown below. Figure 2 As shown.
[0120] The hash value is calculated for each data unit allocated to each node. By storing the index relationship between the hash value and the data storage location, the data can be retrieved quickly, thus improving the efficiency of data retrieval.
[0121] It should be understood that, although Figure 1 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 1 At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
[0122] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0123] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application. Therefore, any simple modifications, equivalent changes, and alterations made to the above embodiments based on the technical essence of this application, without departing from the content of the technical solution of this application, shall fall within the protection scope of the technical solution of this application.
Claims
1. A method for accessing and retrieving internal enterprise data, characterized in that, The method includes the following steps: Divide the various types of data within the enterprise into data units, and obtain all data points, all historical retrieval records, and performance data of various performance indicators of each node in the distributed storage system for each data unit; The storage load rate of each data unit is calculated based on the sampling frequency and the number of data points in each data unit. Analyze the historical retrieval records for cases where any two data units are retrieved simultaneously, and calculate the first correlation degree between the two data units. Based on the correlation of changes in adjacent data points in the same sampling period in any two data units, and the differences in changes in data points in any two data units, a second correlation degree is determined for any two data units; the first correlation degree and the second correlation degree are fused to determine the retrieval correlation degree for any two data units. The nodes are sorted according to the performance data of various performance indicators of each node to obtain the node performance sequence; the data units are initially allocated based on the distance relationship between the collection location of each data unit and the node; and the nodes are sorted according to the storage load rate of all data units allocated to each node to obtain the node load sequence. Analyze the matching of elements in the node performance sequence and the node load sequence to obtain each matching node and each non-matching node. Based on the retrieval correlation between any data unit allocated to each non-matching node and the other data units, and in conjunction with the storage load rate, calculate the schedulability of any data unit. Based on the number of matching nodes, evaluate the initial allocation scheme. In conjunction with the schedulability, iteratively adjust the allocation scheme and allocate nodes to the data units. No. The data unit and the first The first correlation between data units The calculation formula is: ,in, To simultaneously retrieve the first The data unit and the first The number of all historical retrieval records for each data unit To retrieve the first The number of all historical retrieval records for each data unit To retrieve the first The number of all historical retrieval records for each data unit The default value is greater than 0; Determining the second correlation degree between any two data units includes: Calculate the average time interval between any two adjacent data points in each data unit, and denot it as the average sampling interval; select the data unit with the largest average sampling interval among the two data units, and denot it as the target unit, and the other data unit is denoted as the reference unit; Two adjacent data points are denoted as a pair of adjacent data points, and the relative change of each pair of adjacent data points in the target unit is calculated. Based on the sampling time interval between each pair of adjacent data points in the target unit, all data points within the sampling time interval are selected from the reference unit to form a reference data sequence; Calculate the sum of the relative changes between all adjacent data points in the reference data sequence, and denot it as the reference change. The relative changes of all adjacent data points in the target unit are used to form a first change sequence; the reference changes of all reference data sequences in the reference unit are used to form a second change sequence. Calculate the correlation between the first change sequence and the second change sequence; Calculate the difference between each element in the first change sequence and the element at the same position in the second change sequence, and calculate the sum of the differences of all elements in the first change sequence; The second correlation degree is the ratio of the absolute value of the correlation degree to the cumulative sum; The retrieval relevance is the normalized result of the product of the first relevance and the second relevance.
2. The enterprise internal data access and retrieval method as described in claim 1, characterized in that, The calculation of the storage load rate for each data unit includes: The storage load factor is the ratio between the number of all data points in each data unit and the average sampling interval.
3. The enterprise internal data access and retrieval method as described in claim 1, characterized in that, The further process for obtaining the node performance sequence is as follows: For each performance metric, the performance data of all nodes is positively processed. All nodes are arranged in descending order of performance data. The sum of the ranking of each node in all performance metrics is calculated. All nodes are arranged in ascending order according to the sum to form a node performance sequence.
4. The enterprise internal data access and retrieval method as described in claim 1, characterized in that, The further process for obtaining the node load sequence is as follows: Calculate the distance between the acquisition location of each data unit and the location of each node; All data units are pre-assigned to the nearest node as the initial allocation scheme; The sum of the storage load rates of all data units allocated to each node in the initial allocation scheme is taken as the total load rate of each node in the initial allocation scheme. Arrange all nodes in descending order of the total load rate to form a node load sequence.
5. The enterprise internal data access and retrieval method as described in claim 1, characterized in that, The process of obtaining each matching node and each non-matching node includes: If the node load sequence is the same as the node at the same position in the node performance sequence, then the corresponding node is a matching node; otherwise, it is a non-matching node.
6. The enterprise internal data access and retrieval method as described in claim 1, characterized in that, The formula for calculating the schedulability of any data unit is as follows: ,in, For the first The first mismatched node assigned to the Scheduling of individual data units For the first The first mismatched node assigned to the Storage load rate per data unit For the first The first mismatched node assigned to the The data unit and the first The retrieval relevance between data units For the first The number of all data units allocated to each mismatched node. The default value is greater than 0.
7. The enterprise internal data access and retrieval method as described in claim 1, characterized in that, The evaluation of the initial allocation scheme includes: The total number of matching nodes is counted and used as the node matching degree of the initial allocation scheme. If the node matching degree is less than a preset threshold, the allocation scheme is iteratively adjusted; otherwise, the data unit is stored according to the current allocation scheme.
8. The enterprise internal data access and retrieval method as described in claim 7, characterized in that, The iterative adjustment of the allocation scheme includes: Allocate the largest schedulable data unit under all mismatched nodes to the second nearest neighbor node as the allocation scheme for the next allocation; By continuously iterating and adjusting the allocation scheme, the optimal allocation scheme is obtained, and data units are allocated nodes according to the optimal allocation scheme.