A data-driven enterprise business management data analysis method

By analyzing the overlap between the query range and the cluster coverage and the matching degree of data distribution, and combining Euclidean distance to obtain a corrected distance metric, the problems of insufficient analysis efficiency and inaccurate decision-making in existing technologies are solved, and more efficient data analysis and decision support are achieved.

CN121301475BActive Publication Date: 2026-06-23JILIN AGRI SCI & TECH COLLEGE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JILIN AGRI SCI & TECH COLLEGE
Filing Date
2025-09-28
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing enterprise business management data analysis methods fail to comprehensively consider the overlap of query scope, scope size, and data distribution matching degree, resulting in insufficient analysis efficiency and inaccurate decision support, making it difficult to support enterprises in rapid response and scientific decision-making under multi-source business data.

Method used

By collecting and preprocessing query logs and data block metadata, a standardized dataset is obtained. The overlap between the query range and cluster coverage and the matching relationship of data distribution are analyzed. Combined with Euclidean distance, a corrected distance metric is obtained to guide data prefetching strategies and business decisions.

Benefits of technology

It significantly improves the accuracy and real-time performance of data analysis, reduces the cross-node access overhead of distributed databases, increases the prefetch hit rate, and enhances enterprises' decision-making capabilities under complex and ever-changing query and analysis needs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121301475B_ABST
    Figure CN121301475B_ABST
Patent Text Reader

Abstract

The present application relates to the field of data analysis, and more particularly to a kind of enterprise business management data analysis method based on data driving, the method includes: through the acquisition and preprocessing of query log and data block meta information obtain standardized data set;Through the analysis of the overlapping relationship of query range and cluster coverage range obtain range overlap reward factor;Through the analysis of the matching relationship of query range size and data distribution obtain range size and data distribution matching degree reward factor;Through the combination of range overlap reward factor and range size and data distribution matching degree reward factor obtain modified distance measure with Euclidean distance;Through the analysis of clustering result based on modified distance measure obtain data prefetching strategy and business decision support basis, to solve the problem that the existing method cannot comprehensively consider query range overlap degree, range size and data distribution matching degree, leading to insufficient analysis efficiency and inaccurate decision support.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data analysis technology, and in particular to a data-driven enterprise business management data analysis method. Background Technology

[0002] In the business management process of modern enterprises, daily operations constantly generate massive amounts of data from multiple aspects such as sales, supply chain, finance, and customer service. To meet the needs of parallel processing and large-scale analysis by multiple departments, enterprises typically use distributed databases or data warehouses to store this business data. While distributed architecture provides good scalability and high availability, it also brings efficiency challenges when enterprises perform large-scale data retrieval and analysis. Especially when it is necessary to perform range-based analysis on order intervals for a certain period, inventory distribution in a certain region, or behavioral characteristics of a certain customer group, existing methods often require scanning or loading a large number of data blocks because the data is segmented and stored on different physical nodes, resulting in increased latency and wasted resources. To solve this problem, existing technologies typically use data pre-partitioning, indexing, and data prefetching to improve analysis efficiency. Data pre-partitioning, by dividing the data range before writing, can reduce some scanning costs, but it is prone to data skew and access hotspots in scenarios with uneven data distribution or frequent changes in the analysis range. Indexing technology, by creating key-value or attribute indexes, can speed up the location speed, but it still inevitably incurs high network and storage overhead when crossing multiple partitions. Data prefetching methods attempt to preload potentially needed data blocks while processing current data to reduce subsequent waiting time. However, existing prefetching strategies often rely on fixed rules, such as prefetching adjacent or limited data blocks, which are difficult to adapt to complex and ever-changing enterprise analysis request patterns, often resulting in low prefetching hit rates or excessive redundant data.

[0003] To further enhance the intelligence of enterprise-level data analytics, some studies have introduced clustering methods, such as K-means clustering, to model historical analysis requests and discover common interval patterns to guide data prefetching. However, these methods typically rely on Euclidean distance to measure the similarity between the request range and the cluster centers. Euclidean distance only considers the numerical difference between the start and end values ​​of an interval, ignoring several key factors. For example, when an analysis request is numerically close to a cluster center, but their actual covered intervals have almost no overlap, the prefetched data blocks may not meet the analysis requirements. When the request range differs significantly from the cluster's coverage, even if the intervals overlap, the amount of data loaded may not match the actual needs, leading to over-fetching or omissions. When the data distribution within a cluster is inconsistent with the data-dense area of ​​interest in the request, even if the ranges appear similar, the analysis results may deviate from business requirements. These issues demonstrate that existing Euclidean distance-based metrics are insufficient to accurately reflect the true correlation between analysis requests and historical patterns, making it difficult to support the urgent need for efficient and accurate analysis in enterprise business management. Therefore, existing enterprise business management data analysis methods still suffer from insufficient granularity and inaccurate decision support, lacking an analytical mechanism that can simultaneously consider interval overlap, range size, and data distribution matching. This directly restricts enterprises' rapid response and scientific decision-making under multi-source business data, and has become a pressing technical problem to be solved. Summary of the Invention

[0004] In view of this, the present invention aims to propose a data-driven enterprise business management data analysis method to solve the problems of insufficient analysis efficiency and inaccurate decision support caused by the failure of existing methods to comprehensively consider the overlap of query scope, scope size and data distribution matching degree.

[0005] To achieve the above objectives, the technical solution of the present invention is implemented as follows:

[0006] A data-driven enterprise business management data analysis method, the method comprising:

[0007] Step S1: Obtain a standardized dataset by collecting and preprocessing query logs and data block metadata;

[0008] Step S2: Obtain the range overlap reward factor by analyzing the overlap relationship between the query range and the cluster coverage range in the standardized dataset;

[0009] Step S3: Obtain the reward factor for the matching degree of the range size and data distribution by analyzing the matching relationship between the query range size and data distribution in the standardized dataset;

[0010] Step S4: Obtain the corrected distance metric by combining the range overlap reward factor with the range size and data distribution matching degree reward factor using Euclidean distance;

[0011] Step S5: Analyze the clustering results based on the modified distance metric to obtain data prefetching strategies and business decision support basis;

[0012] The method of obtaining a reward factor for range size and data distribution matching degree by analyzing the matching relationship between query range size and data distribution in a standardized dataset includes: obtaining range size similarity by performing ratio processing on query range length data and cluster coverage range length data; obtaining the intersection interval and union interval between any target query range and target cluster coverage range in the standardized dataset; obtaining intersection density data by traversing the data blocks involved in the intersection interval and performing ratio processing on the data volume of the data blocks involved in the intersection interval and the length of the intersection interval; obtaining union density data by traversing the data blocks involved in the union interval and performing ratio processing on the data volume of the data blocks involved in the union interval and the length of the union interval; obtaining data distribution matching degree by performing ratio processing on the intersection density data and the union density data; and obtaining a reward factor for range size and data distribution matching degree by fusing the range size similarity and data distribution matching degree.

[0013] The process of obtaining range size similarity by performing ratio processing on query range length data and cluster coverage range length data includes: for any target query range and target cluster coverage range in the standardized dataset, obtaining the length of the target query range and the length of the target cluster coverage range respectively; dividing the length of the target query range by the length of the target cluster coverage range to obtain length ratio data; performing logarithmic operation on the length ratio data and taking the absolute value to obtain range length difference data; using the range length difference data as the negative exponent input of an exponential function, and using the corresponding exponential mapping result as the range size similarity between the target query range and the target cluster coverage range;

[0014] The process of obtaining the data distribution matching degree by performing ratio processing on intersection density data and union density data includes: for any target query range and target cluster coverage range in the standardized dataset, obtaining the intersection interval and union interval between them; by traversing the data blocks involved in the intersection interval and union interval, calculating the total amount of data in the intersection interval and the total amount of data in the union interval respectively; dividing the total amount of data in the intersection interval by the length of the intersection interval to obtain the intersection density data; dividing the total amount of data in the union interval by the length of the union interval to obtain the union density data; and using the ratio processing of the intersection density data and union density data, along with the calculation result of logarithmic operation and exponential mapping, as the data distribution matching degree between the target query range and the target cluster coverage range.

[0015] The step of fusing range size similarity and data distribution matching degree to obtain a range size and data distribution matching degree reward factor includes: for any target query range and target cluster coverage range in the standardized dataset, the mean of the range size similarity and data distribution matching degree of the target query range and target cluster coverage range is used as the range size and data distribution matching degree reward factor of the target query range and target cluster coverage range.

[0016] Furthermore, the process of acquiring a standardized dataset by collecting and preprocessing query logs and data block metadata includes:

[0017] Query logs from a distributed database system are collected, and sensitive information in the logs is anonymized. The query logs include query time, query range, data tables and fields involved in the query, and the amount of data returned. The start and end keys of the query ranges are extracted from the query logs, and each query range is represented as a two-dimensional vector. Meta-information of data blocks in the distributed database is collected, and each data block is represented as... , and ,in, This is the starting key value of the data block. This is the end key value of the data block. The data volume of the data block; routine data cleaning and standardization processing is performed on the query logs and data block metadata to obtain a standardized dataset for analysis.

[0018] Furthermore, the step of obtaining the range overlap reward factor by analyzing the overlap relationship between the query range and the cluster coverage range includes:

[0019] By performing intersection analysis on the query range data and the cluster coverage range data, effective overlapping interval data is obtained. Then, by performing a ratio calculation on the effective overlapping interval data and the interval length data, range overlap evaluation data is obtained. Finally, by performing exponential amplification processing on the range overlap evaluation data, range overlap reward factor is obtained.

[0020] Furthermore, the step of obtaining effective overlapping interval data by performing intersection analysis on the query range data and cluster coverage range data, and obtaining range overlap evaluation data by performing a ratio calculation on the effective overlapping interval data and interval length data, includes:

[0021] For any target query range in the standardized dataset and any target cluster coverage range in the clustering process, the larger of the minimum value of the target query range and the minimum value of the target cluster coverage range is taken as the intersection starting point, and the smaller of the maximum value of the target query range and the maximum value of the target cluster coverage range is taken as the intersection ending point. The range formed by the intersection starting point and the intersection ending point is taken as the effective overlap interval data between the target query range and the target cluster. The larger value between the result of subtracting the intersection starting point from the intersection ending point and 0 is taken as the interval length of the effective overlap interval between the target query range and the target cluster. The length of the target query range and the length of the target cluster coverage range are obtained accordingly.

[0022] The length of the effective overlapping interval between the target query range and the target cluster coverage range is used as the numerator, and the smaller of the length of the target query range and the length of the target cluster coverage range is used as the denominator. The calculation result of the corresponding fraction is used as the range overlap evaluation data between the target query range and the target cluster coverage range.

[0023] Furthermore, the step of obtaining the range overlap reward factor by exponentially amplifying the range overlap evaluation data includes:

[0024] The result of subtracting the constant 1 from the range overlap evaluation data is subjected to an exponential mapping with the natural constant as the base, and the corresponding mapping result is used as the range overlap exponential transformation result; the result of multiplying the range overlap evaluation data and the range overlap exponential transformation result is used as the range overlap reward factor.

[0025] Furthermore, the step of obtaining a corrected distance metric by combining the range overlap reward factor with the range size and data distribution matching degree reward factor using Euclidean distance includes:

[0026] For any target query range and target cluster coverage range in the standardized dataset, the result of adding constant 1, the range overlap reward factor, and the data distribution matching degree reward factor is used as the denominator, constant 1 is used as the numerator, and the corresponding fraction is used as the distance correction factor. The distance correction factor is used as the weight to perform weighted optimization on the Euclidean distance between the target query range and the target cluster center, and the corrected distance metric between the target query range and the target cluster center is obtained.

[0027] Compared with the prior art, the present invention has the following advantages:

[0028] This invention presents a data-driven enterprise business management data analysis method. By introducing a joint modeling mechanism that considers range overlap, range size, and data distribution matching during the data analysis process, it can more accurately characterize the correlation between analysis requests and historical data patterns. Compared to existing methods that rely on simple rules or single Euclidean distance, this invention effectively avoids data loading redundancy and analysis bias caused by non-overlapping intervals or excessive range differences, thereby significantly improving the accuracy and real-time performance of data analysis. For enterprises, this means obtaining more efficient response results under the same computing and storage resource conditions when facing complex and ever-changing query and analysis needs. Simultaneously, this invention achieves better identification of typical query patterns through a modified similarity metric, thereby supporting the collaborative optimization of data prefetching and business management decisions. In practical applications, it not only reduces cross-node access overhead in distributed databases, improves prefetch hit rate, and reduces invalid I / O operations, but also helps enterprises gain insights into data hotspots and trends, guiding the optimized distribution of data blocks among storage nodes. This not only improves the operational efficiency of the underlying system, but also provides enterprises with more reliable data analysis support, enhancing their decision-making capabilities in various business scenarios such as sales forecasting, inventory control, and financial risk management. Attached Figure Description

[0029] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an undue limitation of the invention. In the drawings:

[0030] Figure 1 This is a flowchart illustrating a data-driven enterprise business management data analysis method according to an embodiment of the present invention. Detailed Implementation

[0031] The present invention will now be described in detail with reference to the accompanying drawings and embodiments.

[0032] See Figure 1 This is a flowchart of a data-driven enterprise business management data analysis method provided in Embodiment 1 of the present invention, as shown below. Figure 1 As shown, a data-driven enterprise business management data analysis method may include:

[0033] Step S1: Obtain a standardized dataset by collecting and preprocessing query logs and data block metadata.

[0034] The raw data used in this invention originates from the query logs of a distributed database system. These logs record detailed information about each query request submitted by the user, including query time, query scope, data tables and fields involved, and the amount of data returned. To protect user privacy and data security, sensitive information (such as user ID and specific query content) is anonymized when collecting query logs.

[0035] The query logs are collected from the distributed database system, and sensitive information in the logs is anonymized. The query logs include query time, query range, data tables and fields involved in the query, and the amount of data returned. The start and end keys of the query ranges are extracted from the query logs, and each query range is represented as a two-dimensional vector. Meta-information of data blocks in the distributed database is collected, and each data block is represented as... , and ,in, This is the starting key value of the data block. This is the end key value of the data block. The data volume of the data block; routine data cleaning and standardization processing is performed on the query logs and data block metadata to obtain a standardized dataset for analysis.

[0036] This completes the process of obtaining a standardized dataset by collecting and preprocessing query logs and data block metadata.

[0037] Step S2: Obtain the range overlap reward factor by analyzing the overlap relationship between the query range and the cluster coverage range.

[0038] In existing prefetching strategies based on K-means clustering, Euclidean distance is typically used to measure the distance between the query range and the cluster centroids. Specifically, the query range is represented as... The cluster center point is represented as The Euclidean distance between the two is calculated as follows: This metric, which relies solely on Euclidean distance, has a significant drawback: it fails to consider the overlap between the query range and the actual coverage of the cluster. In the first case, the Euclidean distance between the query range and the cluster centroid is small, but the query range and the cluster's coverage have almost no overlap. In this situation, even if the query range is close to the cluster centroid, the data blocks within the cluster may not contain data that satisfies the query range, or may contain only a very small amount of data. If the cluster is prioritized too high based on Euclidean distance and its data blocks are prefetched, it will result in a large number of invalid prefetch operations. In the second case, the Euclidean distance between the query range and the cluster centroid is large, but the query range and the cluster's coverage have a significant overlap. In this situation, although the query range is not so close to the cluster centroid, the cluster may contain a large amount of data that satisfies the query range. If the cluster is prioritized too low simply because of its large Euclidean distance, the opportunity to prefetch a large amount of relevant data will be missed. These two cases demonstrate that using only Euclidean distance cannot accurately assess the relevance between the query range and the cluster because it ignores the crucial factor of the overlap in their coverage. To address this issue, a reward mechanism needs to be introduced into the distance metric. This mechanism is based on the degree of overlap between the query range and the cluster's coverage. Specifically, when the overlap between the query range and the cluster's coverage is high, a larger reward is given, significantly reducing the adjusted distance and increasing the cluster's prefetch priority. When the overlap between the query range and the cluster's coverage is low, a smaller reward (or even zero) is given, so the adjusted distance is not significantly reduced, thus relatively lowering the cluster's prefetch priority. This approach can more accurately identify clusters that truly contain a large amount of data satisfying the query range. Even if their initial Euclidean distance is slightly greater, they can still obtain higher priority through the high overlap reward, thereby improving the prefetch hit rate and reducing unnecessary prefetch overhead.

[0039] In summary, this invention first obtains effective overlapping interval data by performing intersection analysis on the query range data and cluster coverage range data. Then, it obtains range overlap evaluation data by performing a ratio calculation on the effective overlapping interval data and interval length data. Specifically, for any target query range and any target cluster coverage range in the clustering process within the standardized dataset, the larger of the minimum value of the target query range and the minimum value of the target cluster coverage range is taken as the intersection starting point, and the smaller of the maximum value of the target query range and the maximum value of the target cluster coverage range is taken as the intersection ending point. The range formed by the intersection starting point and the intersection ending point is considered as the target query range. The effective overlap interval data between the target query range and the target cluster is used as follows: The larger value between the result of subtracting the intersection start point from the intersection end point and 0 from the effective overlap interval data between the target query range and the target cluster coverage range is used as the interval length of the effective overlap interval between the target query range and the target cluster; The length of the target query range and the length of the target cluster coverage range are obtained accordingly; The interval length of the effective overlap interval between the target query range and the target cluster coverage range is used as the numerator, and the smaller value between the length of the target query range and the length of the target cluster coverage range is used as the denominator. The result of the calculation of the corresponding fraction is used as the range overlap evaluation data between the target query range and the target cluster coverage range.

[0040] After obtaining the range overlap evaluation data between the target query range and the target cluster coverage range, the range overlap evaluation data is further exponentially amplified to obtain the range overlap reward factor. Specifically, the result of subtracting the range overlap evaluation data from the constant 1 is subjected to an exponential mapping with the natural constant as the base, and the corresponding mapping result is used as the range overlap exponential transformation result; the negative number of the result of multiplying the range overlap evaluation data and the range overlap exponential transformation result is used as the range overlap reward factor.

[0041] In one implementation, assume the first The length of each query range is ;No. The length of the cluster coverage area is ;No. The query range and the first The effective overlap interval length of each cluster coverage area is Then the first The query range and the first The formula for calculating the reward factor for the overlap of the coverage area of ​​each cluster is:

[0042]

[0043] in, Indicates the first The query range and the first The overlap of the coverage area of ​​each cluster with the reward factor; Indicates the first The query range and the first The length of the effective overlapping interval of each cluster's coverage area; Indicates the first The length of each query range; Indicates the first Cluster coverage length.

[0044] It should be noted that the range overlap reward factor aims to reward clusters that overlap with the query range to a certain degree. The first part of the calculation formula... The overlap ratio between the query range and the cluster coverage range is calculated, but the denominator here uses the smaller of the two lengths, rather than simply the length of the overlapping portion. This avoids situations where a small range is completely contained within a large range, resulting in a high overlap ratio but a very small actual overlap. For example, if the query range is... The coverage area of ​​the cluster is Although the overlap ratio is 100%, the actual meaningful overlap is only 1. Using As the denominator, it can more accurately reflect the proportion of effective overlap. The second part of the calculation formula... An exponential transformation was applied to the overlap ratio. First, the non-overlapping ratio was calculated, which is a constant 1 minus the overlap ratio. Then, an exponential function was used to amplify the non-overlapping ratio. When the overlap ratio is close to 1, the result of the exponential mapping is also close to 1, resulting in a larger overall reward; when the overlap ratio is close to 0, the result of the exponential mapping is also close to 0, resulting in a smaller reward. This exponential amplification effect makes the range overlap reward factor more sensitive to the range overlap degree, enabling it to more effectively distinguish clusters with low overlap.

[0045] This completes the process of obtaining the range overlap reward factor by analyzing the overlap between the query range and the cluster coverage range.

[0046] Step S3: Obtain the reward factor for the matching degree of the range size and data distribution by analyzing the matching relationship between the query range size and the data distribution.

[0047] In step S2, a range overlap reward factor is introduced to correct the deficiency of ignoring range overlap when only considering Euclidean distance. Even if the query range and the cluster coverage have a high degree of overlap, it does not necessarily mean that prefetching data blocks of that cluster is the optimal choice. Suppose the query range is small, while the cluster coverage is very large. Even if there is a high degree of overlap, most data blocks within the cluster are still irrelevant to the query range. Prefetching the entire cluster's data blocks would result in a large amount of I / O waste. Conversely, if the query range is large, while the cluster coverage is small, even if there is a high degree of overlap, the data blocks within the cluster may only contain a small portion of the data that satisfies the query range. Prefetching only the cluster's data blocks would miss a large amount of other relevant data. Therefore, it is necessary to consider the similarity between the size of the query range and the cluster coverage. Ideally, the size of the cluster coverage to be prefetched should be close to the size of the query range, covering as much relevant data as possible while avoiding prefetching too much irrelevant data. On the other hand, even if the query range and the cluster coverage are similar in size and have a high degree of overlap, if the data distribution within them does not match, the prefetching effect may still be unsatisfactory. For example, suppose the query scope primarily focuses on the portion with smaller key values, while the data within the cluster is mainly concentrated on the portion with larger key values. Even if the scopes overlap and are similar in size, most of the data within the cluster is not what the query needs. Therefore, it is necessary to consider the degree of matching between the expected data distribution within the query scope and the actual data distribution within the cluster. Ideally, the prefetched data distribution within the cluster should be as consistent as possible with the expected data distribution within the query scope.

[0048] In summary, this invention first obtains the similarity of range sizes by processing the ratio of query range length data and cluster coverage range length data. Specifically, for any target query range and target cluster coverage range in the standardized dataset, the length of the target query range and the length of the target cluster coverage range are obtained respectively; the length of the target query range is divided by the length of the target cluster coverage range to obtain the length ratio data; the logarithm of the length ratio data is performed and the absolute value is taken to obtain the range length difference data; the range length difference data is used as the negative exponent input of the exponential function, and the corresponding exponential mapping result is used as the range size similarity between the target query range and the target cluster coverage range.

[0049] After obtaining the similarity in size between the target query range and the target cluster coverage range, the data distribution matching degree is further obtained by performing ratio processing on the intersection density data and the union density data. Specifically, for any target query range and target cluster coverage range in the standardized dataset, the intersection interval and the union interval between them are obtained. By traversing the data blocks involved in the intersection interval and the union interval, the total amount of data in the intersection interval and the total amount of data in the union interval are calculated respectively. The total amount of data in the intersection interval is divided by the length of the intersection interval to obtain the intersection density data. The total amount of data in the union interval is divided by the length of the union interval to obtain the union density data. The result of the ratio processing of the intersection density data and the union density data, followed by logarithmic operation and exponential mapping, is used as the data distribution matching degree between the target query range and the target cluster coverage range.

[0050] Finally, by fusing the range size similarity and data distribution matching degree, a data distribution matching degree reward factor is obtained. Specifically, for any target query range and target cluster coverage range in the standardized dataset, the mean of the range size similarity and data distribution matching degree of the target query range and target cluster coverage range is used as the data distribution matching degree reward factor of the target query range and target cluster coverage range.

[0051] In one implementation, assume the first The query range and the first The intersection density data of the cluster coverage areas are ;No. The query range and the first The union density data of the cluster coverage area is Then the first The query range and the first The formula for calculating the reward factor for the data distribution matching degree of each cluster coverage area is:

[0052]

[0053] in, Indicates the first The query range and the first Data distribution matching degree reward factor for each cluster coverage area; Indicates the first The length of each query range; Indicates the first Cluster coverage length; Indicates the first The query range and the first Intersection density data of the coverage areas of each cluster; Indicates the first The query range and the first Union density data of the coverage area of ​​each cluster; Denotes the natural constant e; This indicates absolute value calculation.

[0054] It should be noted that, for the first The query range and the first The union of cluster coverage areas requires traversing all data blocks within the target cluster coverage area and the query range. When calculating the data volume within the union of data blocks within the same cluster, in addition to traversing the data blocks in the current cluster, it is also necessary to consider whether the query range spans multiple clusters. Not fully included in the cluster Within the coverage area, there may be some data blocks that do not belong to the cluster. However, it still intersects with the union, and these data blocks fall within the query range. It belongs to other clusters, so it is necessary to traverse the clusters. All data blocks within the coverage area, and the query range. Within the cluster, find all data blocks that intersect with the union of the given set. For the first... Data blocks If if If it is completely contained within the intersection or union, then it will be The amount of data accumulated to ,if If a portion of the data is contained within the intersection or union, the amount of data within that intersection or union is estimated and accumulated. The data distribution matching reward factor aims to reward clusters that are similar in size to the query range and match in data distribution, thereby improving prefetching accuracy. The calculation formula for the data distribution matching reward factor consists of two parts: firstly, Indicates range similarity, used to measure query range length Cluster coverage length The degree of similarity between the two ranges. First, calculate the ratio of the lengths of the two ranges. This ratio intuitively reflects the relative relationship between the two range sizes: whether the query range is larger or the cluster coverage is larger, and by how many times. Then, the natural logarithm of the absolute value of this ratio is taken. Taking the logarithm here serves two purposes: first, to achieve symmetry, since the focus is on the degree of difference between the two range sizes, rather than which range is larger, it is necessary to ensure that when... yes When the range is several times larger than the cluster range, for example, when the query range is twice the cluster range, and... yes When the cluster range is several times larger than the query range, for example, twice the size of the cluster range, the similarity metric obtained is the same. This is because the logarithmic function has the property of... In other words, a number and its logarithm are opposites. Therefore, when we take the absolute value of the length ratio, and They are completely equal. This symmetry ensures the fairness of the similarity measurement and avoids bias caused by different relative sizes. Secondly, it compresses the numerical range; the ratio of the lengths of the two ranges... This ratio could be extremely large or extremely small. For example, if the coverage of a cluster is much larger than the query range, this ratio might be close to 0; if the query range is much larger than the coverage of a cluster, this ratio might be very large. Directly using these extreme values ​​may lead to unstable numerical calculations, or, when weighted averaging with other metrics, the large differences in numerical ranges may result in unreasonable weight allocation. Logarithmic operations can compress the numerical range into a more reasonable interval. In application scenarios, similarity metrics need to have non-linear sensitivity to differences in range size. That is, when two ranges are very close in size, we want the similarity to be close to 1 and change relatively smoothly; even if the ranges differ slightly in size, the similarity will not drop sharply. When the difference between the two ranges gradually increases, we want the similarity to drop rapidly. When the difference between the two ranges is very large, we need the similarity to be close to 0. If we directly use the ratio or difference of range lengths to measure similarity, their changes are linear and cannot meet the above requirements. For example, suppose there are two ranges with lengths of 10 and 11 respectively, and another range with a length of 1000. If we directly use the difference, the difference in the first group is 1, while the difference in the second group is 990 or 999, making the second group's difference much larger than the first. However, in reality, from a similarity perspective, the similarity of the first group should be much higher than that of the second group. This is because exponential functions, as a type of function, possess non-linear characteristics, and their curves tend to... The surrounding area (i.e., when the two areas are of equal size) is very flat, while... The similarity decreases rapidly as it moves away from 1. The similarity reaches its maximum value of 1 when two ranges are exactly the same length; as the difference in length between the two ranges increases, As the similarity deviates from 1, the range size similarity also decreases and approaches 0. Finally, the data distribution matching degree, i.e. This measure assesses the degree of match between the data density within the intersection of the query range and the cluster coverage range and the average data density within the union. Similarly, using non-linear transformations such as logarithmic and exponential transformations, the degree of difference in density ratios is mapped to... Matching degree within the interval.

[0055] This completes the process of obtaining the reward factor for the matching degree between the range size and data distribution by analyzing the matching relationship between the query range size and data distribution.

[0056] Step S4: Obtain the corrected distance metric by combining the range overlap reward factor with the data distribution matching reward factor using Euclidean distance.

[0057] After obtaining the range overlap reward factor and the data distribution matching reward factor, the corrected distance metric is obtained by combining the range overlap reward factor and the data distribution matching reward factor with Euclidean distance. Specifically, for any target query range and target cluster coverage range in the standardized dataset, the result of adding the constant 1, the range overlap reward factor, and the data distribution matching reward factor is used as the denominator, and the constant 1 is used as the numerator. The resulting fraction is used as the distance correction factor. The distance correction factor is used as the weight to perform weighted optimization on the Euclidean distance between the target query range and the target cluster center to obtain the corrected distance metric between the target query range and the target cluster center.

[0058] In one implementation, assume the first The cluster center is Then the first The query range and the first The expression for calculating the corrected distance metric between cluster centers is:

[0059]

[0060] in, Indicates the first The query range and the first The corrected distance metric between cluster centers; Indicates the first The query range and the first Euclidean distance between the centers of each cluster; Indicates the first The query range and the first The overlap of the coverage area of ​​each cluster with the reward factor; Indicates the first The query range and the first Data distribution matching degree reward factor for each cluster coverage area.

[0061] It should be noted that, for the implementation of the reward mechanism in the above formula, the values ​​of the range overlap reward factor and the data distribution matching degree reward factor are both greater than or equal to 0. The sum of these two reward factors is then added to the baseline value of 1 to form the comprehensive reward coefficient. Then, the reciprocal is applied to the Euclidean distance. The logic behind this construction is very clear: the larger the overall reward, the larger the denominator, and the smaller the final corrected distance. (No overlapping rewards) and (Without a matching reward) The overall reward coefficient is 1, and the correction coefficient is also 1. In this case, the adjusted distance equals the original Euclidean distance, ensuring that the correction does not change the original metric without a reward. When the overlap between the query range and the cluster is low, or the data distribution match is low, the correction coefficient is close to 1. This makes the reduction in Euclidean distance insignificant, thus achieving a relative penalty for poor matches. When the overlap between the query range and the cluster is high, and the data distribution match is high, the overall reward coefficient is large, and the correction coefficient is much less than 1. This results in a significant reduction in Euclidean distance, thus achieving an effective reward for good matches. In this way, by comprehensively considering range overlap, range size, and data distribution match, the Euclidean distance is corrected, thereby more accurately assessing the relevance between the query range and the cluster.

[0062] After obtaining the corrected distance metric between the query range and the cluster center, K-means clustering is performed using the corrected distance metric to obtain the corresponding clustering results. It should be noted that in this embodiment of the invention, the number of clusters is determined by the elbow method.

[0063] This completes the process of obtaining a corrected distance metric by combining the range overlap reward factor with the data distribution matching reward factor using Euclidean distance.

[0064] Step S5 involves analyzing the clustering results based on the modified distance metric to obtain data prefetching strategies and business decision support.

[0065] After performing K-means clustering based on the modified distance metric, a set of clusters representing historical query ranges was obtained. Each cluster represents a typical query pattern, including the centroid of the query range, its coverage area, and the distribution characteristics of data blocks within the cluster. Based on these clustering results, a more refined data prefetching strategy was developed. Specifically, when a new query range is received... First calculate The algorithm calculates the corrected distance to all cluster centers and selects the clusters closest to them. Then, based on the coverage and data distribution characteristics of these clusters, it determines which data blocks to prefetch. For example, it prioritizes prefetching data blocks corresponding to clusters that have high overlap with the query range, similar range size, and matching data distribution. This prefetching strategy can more accurately hit the data required by the query and reduce unnecessary I / O operations.

[0066] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A data-driven enterprise business management data analysis method, characterized in that, The method includes: Step S1: Obtain a standardized dataset by collecting and preprocessing query logs and data block metadata; Step S2: Obtain the range overlap reward factor by analyzing the overlap relationship between the query range and the cluster coverage range in the standardized dataset; Step S3: Obtain the reward factor for the matching degree of the range size and data distribution by analyzing the matching relationship between the query range size and data distribution in the standardized dataset; Step S4: Obtain the corrected distance metric by combining the range overlap reward factor with the range size and data distribution matching degree reward factor using Euclidean distance; Step S5: Analyze the clustering results based on the modified distance metric to obtain data prefetching strategies and business decision support basis; The method of obtaining a reward factor for range size and data distribution matching degree by analyzing the matching relationship between query range size and data distribution in a standardized dataset includes: obtaining range size similarity by performing ratio processing on query range length data and cluster coverage range length data; obtaining the intersection interval and union interval between any target query range and target cluster coverage range in the standardized dataset; obtaining intersection density data by traversing the data blocks involved in the intersection interval and performing ratio processing on the data volume of the data blocks involved in the intersection interval and the length of the intersection interval; obtaining union density data by traversing the data blocks involved in the union interval and performing ratio processing on the data volume of the data blocks involved in the union interval and the length of the union interval; obtaining data distribution matching degree by performing ratio processing on the intersection density data and the union density data; and obtaining a reward factor for range size and data distribution matching degree by fusing the range size similarity and data distribution matching degree. The process of obtaining range size similarity by performing ratio processing on query range length data and cluster coverage range length data includes: for any target query range and target cluster coverage range in the standardized dataset, obtaining the length of the target query range and the length of the target cluster coverage range respectively; dividing the length of the target query range by the length of the target cluster coverage range to obtain length ratio data; performing logarithmic operation on the length ratio data and taking the absolute value to obtain range length difference data; using the range length difference data as the negative exponent input of an exponential function, and using the corresponding exponential mapping result as the range size similarity between the target query range and the target cluster coverage range; The process of obtaining the data distribution matching degree by performing ratio processing on intersection density data and union density data includes: for any target query range and target cluster coverage range in the standardized dataset, obtaining the intersection interval and union interval between them; by traversing the data blocks involved in the intersection interval and union interval, calculating the total amount of data in the intersection interval and the total amount of data in the union interval respectively; dividing the total amount of data in the intersection interval by the length of the intersection interval to obtain the intersection density data; dividing the total amount of data in the union interval by the length of the union interval to obtain the union density data; and using the ratio processing of the intersection density data and union density data, along with the calculation result of logarithmic operation and exponential mapping, as the data distribution matching degree between the target query range and the target cluster coverage range. The step of fusing range size similarity and data distribution matching degree to obtain a range size and data distribution matching degree reward factor includes: for any target query range and target cluster coverage range in the standardized dataset, the mean of the range size similarity and data distribution matching degree of the target query range and target cluster coverage range is used as the range size and data distribution matching degree reward factor of the target query range and target cluster coverage range.

2. The data-driven enterprise business management data analysis method according to claim 1, characterized in that, The process of obtaining a standardized dataset by collecting and preprocessing query logs and data block metadata includes: Query logs from a distributed database system are collected, and sensitive information in the logs is anonymized. The query logs include query time, query range, data tables and fields involved in the query, and the amount of data returned. The start and end keys of the query ranges are extracted from the query logs, and each query range is represented as a two-dimensional vector. Meta-information of data blocks in the distributed database is collected, and each data block is represented as... , and ,in, This is the starting key value of the data block. This is the end key value of the data block. The data volume of the data block; routine data cleaning and standardization processing is performed on the query logs and data block metadata to obtain a standardized dataset for analysis.

3. The data-driven enterprise business management data analysis method according to claim 2, characterized in that, The step of obtaining the range overlap reward factor by analyzing the overlap relationship between the query range and the cluster coverage range includes: By performing intersection analysis on the query range data and the cluster coverage range data, effective overlapping interval data is obtained. Then, by performing a ratio calculation on the effective overlapping interval data and the interval length data, range overlap evaluation data is obtained. Finally, by performing exponential amplification processing on the range overlap evaluation data, range overlap reward factor is obtained.

4. The data-driven enterprise business management data analysis method according to claim 3, characterized in that, The process involves performing intersection analysis on the query range data and the cluster coverage range data to obtain effective overlapping interval data, and then performing a ratio calculation between the effective overlapping interval data and the interval length data to obtain range overlap evaluation data, including: For any target query range in the standardized dataset and any target cluster coverage range in the clustering process, the larger of the minimum value of the target query range and the minimum value of the target cluster coverage range is taken as the intersection starting point, and the smaller of the maximum value of the target query range and the maximum value of the target cluster coverage range is taken as the intersection ending point. The range formed by the intersection starting point and the intersection ending point is taken as the effective overlap interval data between the target query range and the target cluster. The larger value between the result of subtracting the intersection starting point from the intersection ending point and 0 is taken as the interval length of the effective overlap interval between the target query range and the target cluster. The length of the target query range and the length of the target cluster coverage range are obtained accordingly. The length of the effective overlapping interval between the target query range and the target cluster coverage range is used as the numerator, and the smaller of the length of the target query range and the length of the target cluster coverage range is used as the denominator. The calculation result of the corresponding fraction is used as the range overlap evaluation data between the target query range and the target cluster coverage range.

5. The data-driven enterprise business management data analysis method according to claim 3, characterized in that, The process of obtaining a range overlap reward factor by exponentially amplifying the range overlap evaluation data includes: The result of subtracting the constant 1 from the range overlap evaluation data is subjected to an exponential mapping with the natural constant as the base, and the corresponding mapping result is used as the range overlap exponential transformation result; the result of multiplying the range overlap evaluation data and the range overlap exponential transformation result is used as the range overlap reward factor.

6. The data-driven enterprise business management data analysis method according to claim 1, characterized in that, The method of obtaining a corrected distance metric by combining the range overlap reward factor with the range size and data distribution matching degree reward factor using Euclidean distance includes: For any target query range and target cluster coverage range in the standardized dataset, the result of adding constant 1, the range overlap reward factor, and the range size and data distribution matching degree reward factor is used as the denominator, constant 1 is used as the numerator, and the corresponding fraction is used as the distance correction factor. The distance correction factor is used as the weight to perform weighted optimization on the Euclidean distance between the target query range and the target cluster center, and the corrected distance metric between the target query range and the target cluster center is obtained.