Intelligent routing method, device and electronic equipment for data cluster

By introducing a management service module into the HDFS RBF federated architecture, automated routing path creation and verification are achieved, solving the problems of low efficiency and error-prone routing configuration in existing technologies. This improves the efficiency and accuracy of load balancing and routing adjustment, and reduces operation and maintenance costs and the risk of business interruption.

CN122247905APending Publication Date: 2026-06-19DUXIAOMAN TECH (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DUXIAOMAN TECH (BEIJING) CO LTD
Filing Date
2026-03-23
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The existing HDFS RBF federated architecture lacks sufficient correlation between user group lifecycle and routing rules, resulting in low routing configuration efficiency and susceptibility to errors. The load assessment dimension is too singular, the balancing effect is poor, and the load balancing and routing adjustment are disconnected, resulting in lag and high risk, and a lack of security guarantees.

Method used

A management service module is introduced, which connects with the routing layer, configuration storage layer, and monitoring and alarm layer to periodically obtain the load status of user groups and sub-clusters, automatically create and verify routing paths, realize multi-dimensional load assessment and optimal sub-cluster allocation, support automated verification and rollback, and form a closed-loop process.

Benefits of technology

It improves the efficiency and accuracy of routing configuration in the HDFS RBF federated architecture, reduces configuration time and error rate, shortens the time to discover load issues, and reduces operation and maintenance costs and business interruption risks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122247905A_ABST
    Figure CN122247905A_ABST
Patent Text Reader

Abstract

This application provides an intelligent routing method, apparatus, and electronic device for data clusters. By adding a management service module to the HDFS RBF federated architecture data cluster, this module connects to the routing layer, configuration storage layer, and monitoring and alarm layer of the HDFS RBF federated architecture. It periodically obtains the load status of all user groups and all sub-clusters. When a new user group exists, it determines an available target sub-cluster for the new user group based on the load status of each sub-cluster. It then creates a new routing path for the new user group according to preset routing rules and verifies its validity. If valid, the newly created routing path is updated in the routing layer and configuration storage layer of the HDFS RBF federated architecture. This eliminates the need for manual configuration of routing rules and automates the verification of the validity of created routing paths, effectively improving the efficiency and accuracy of routing configuration in the HDFS RBF federated architecture.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data cluster management, and in particular to an intelligent routing method, apparatus and electronic device for data clusters. Background Technology

[0002] HDFS (Hadoop Distributed File System) is a system specifically designed for distributed storage of massive amounts of data. In large-scale HDFS distributed storage scenarios, when a single cluster cannot meet the requirements of multi-tenant isolation and storage scalability, the industry generally adopts the HDFS RBF (Router-Based Federation) federated architecture to achieve unified management of multiple namespaces.

[0003] The HDFS RBF architecture consists of a client, a router component, multiple independent NameNode sub-clusters (each managing its own namespace and data blocks), and a DataNode cluster. All HDFS requests from the client no longer directly access the NameNode sub-clusters; instead, they are uniformly sent to the router component. The router component forwards the HDFS requests to the corresponding backend NameNode sub-clusters according to predefined routing rules, processes them, and then returns the results to the client. Therefore, compared to traditional Block Based Federation, the HDFS RBF federated architecture introduces a centralized routing component (Router) as the unified entry point for the client. This router component maps, directs, and schedules user requests through configured routing rules.

[0004] However, in the existing HDFS RBF architecture, the lifecycle of user groups is not related to routing rules. This can easily lead to the creation of new user groups (such as a new team in a business department). In this case, HDFS cluster administrators need to manually configure routing rules for users in the user group. Manually configuring routing rules is time-consuming and prone to errors. Summary of the Invention

[0005] In view of this, embodiments of this application provide an intelligent routing method, apparatus, and electronic device for data clusters to improve the routing configuration efficiency and accuracy of the HDFS RBF federated architecture.

[0006] In a first aspect, embodiments of this application provide an intelligent routing method for a data cluster, wherein the method is applied to a pre-built management service module, the management service module being connected to the routing layer, configuration storage layer, and monitoring and alarm layer in the HDFS RBF federated architecture, and the method includes: All user groups of the HDFS RBF are obtained at preset time intervals, and the load status of each sub-cluster monitored by the monitoring and alarm layer is obtained. The new user groups are determined by comparing all the user groups with the user groups with configured routes stored in the configuration storage layer. Based on the load status of each sub-cluster, the target sub-cluster is determined, and a new routing path is created for the new user group according to the preset routing setting rules based on the target sub-cluster, and the validity of the new routing path is verified. If effective, the new routing path is updated in the routing layer and the configuration storage layer.

[0007] Secondly, embodiments of this application provide an intelligent routing apparatus for a data cluster, wherein the apparatus is used to execute the intelligent routing method for the data cluster described in the first aspect, and the apparatus includes: The monitoring module is used to obtain all user groups of the HDFS RBF at preset time intervals and to obtain the load status of each sub-cluster monitored by the monitoring and alarm layer. The analysis module is used to compare all user groups with the user groups of configured routes stored in the configuration storage layer to determine new user groups; The route creation module is used to determine the target sub-cluster based on the load status of each sub-cluster, create a new route path for the new user group based on the target sub-cluster according to the preset route setting rules, and verify whether the new route path is valid. The route update module is used to update the new route path to the routing layer and the configuration storage layer if it is valid.

[0008] Thirdly, embodiments of this application provide an electronic device, wherein the electronic device includes: a processor; and a memory storing a program; wherein the program includes instructions, which, when executed by the processor, cause the processor to perform the intelligent routing method for the data cluster described in the first aspect.

[0009] Fourthly, embodiments of this application provide a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute the intelligent routing method for the data cluster described in the first aspect.

[0010] The beneficial effects of this application are: This application provides an intelligent routing method, apparatus, and electronic device for data clusters. By adding a management service module to the HDFS RBF federated architecture data cluster, this module connects to the routing layer, configuration storage layer, and monitoring and alarm layer of the HDFS RBF federated architecture. It periodically obtains the load status of all user groups and all sub-clusters. When a new user group is created, it determines an available target sub-cluster for the new user group based on the load status of each sub-cluster. Then, according to preset routing rules, it creates a new routing path for the new user group based on the target sub-cluster and verifies the validity of the new routing path. If valid, the newly created routing path is updated in the routing layer and configuration storage layer of the HDFS RBF federated architecture. Thus, when a new user group appears, there is no need for manual configuration of routing rules, and the validity of the created routing path is automatically verified, effectively improving the routing configuration efficiency and accuracy of the HDFS RBF federated architecture. Attached Figure Description

[0011] Further details, features, and advantages of this application are disclosed in the following description of exemplary embodiments in conjunction with the accompanying drawings, in which: Figure 1 A flowchart illustrating an intelligent routing method for data clusters provided in this application is shown. Figure 2 This diagram illustrates the relationship between the management service module provided in this application and the HDFS RBF federated architecture. Figure 3 This illustration shows a specific functional diagram of the management service module provided in this application; Figure 4 A schematic diagram of an intelligent routing device for a data cluster provided in this application is shown. Figure 5 A structural block diagram of an exemplary electronic device that can be used to implement embodiments of this application is shown. Detailed Implementation

[0012] Embodiments of this application will now be described in more detail with reference to the accompanying drawings. While some embodiments of this application are shown in the drawings, it should be understood that this application can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this application. It should be understood that the drawings and embodiments of this application are for illustrative purposes only and are not intended to limit the scope of protection of this application.

[0013] It should be understood that the steps described in the method embodiments of this application may be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown. The scope of this application is not limited in this respect.

[0014] The term "comprising" and its variations as used herein are open-ended, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the following description. It should be noted that the concepts of "first", "second", etc., mentioned in this application are used only to distinguish different devices, modules, or units, and are not intended to limit the order of functions performed by these devices, modules, or units or their interdependencies.

[0015] It should be noted that the terms "a" and "a plurality of" used in this application are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".

[0016] As described in the background section, compared to traditional Block Based Federation (BPF) architectures, the HDFS RBF federation architecture introduces a centralized routing component (Router) as a unified entry point for clients. This Router component maps, directs, and schedules user requests through configured routing rules. Routine operation and maintenance of the HDFS RBF federation architecture typically involves configuring routing rules using the following methods: Method 1: HDFS Native RBF Routing Configuration: This method relies on the administrator manually creating routing rules for user groups and sub-cluster namespaces using the command `hdfs dfsadmin -router`. Routing information is stored in ZooKeeper or a local configuration file. ZooKeeper serves as the core coordination hub in the entire federated architecture.

[0017] Method 2: Configure routing using the Fedbalance tool. The Fedbalance tool is used for load balancing of the HDFS federated cluster. You need to manually execute Fedbalance -analyze to analyze the load and Fedbalance -execute to perform data migration. However, it only supports balancing based on storage capacity and is not linked to routing rules.

[0018] Method 3: Configure routing using the native HDFS Balancer tools. However, HDFS Balancer only supports storage capacity balancing between nodes within a single cluster and cannot adapt to cross-namespace balancing of multiple sub-clusters under a federated architecture.

[0019] Method 4: Manual operation and maintenance monitoring. For example, manually checking the cluster load through the HDFS Web UI (such as the Datanodes page of the NameNode UI) or the HDFS dfsadmin -report command, which lacks automated alerting and dynamic adjustment capabilities.

[0020] When using the above four types of routing rule configuration methods to operate and maintain devices in the HDFS RBF federation architecture, the following four drawbacks typically occur: Disadvantage 1: Manual route configuration is inefficient and error-prone. Specifically, when a new team is added to a business unit, a new user group is created. For this new user group, the HDFS administrator needs to manually execute 3-5 commands to configure routes. When the number of user groups exceeds 500, a single configuration can take over an hour, and route failures are easily caused by incorrect command parameters (such as incorrect sub-cluster names). The underlying reason is that the existing RBF architecture does not link the user group lifecycle (creation / deletion) with routing rules, lacking a linkage mechanism of "user group creation → automatic route generation," and lacking the ability to verify and clean up redundant route configurations (such as residual routing rules after user group deletion).

[0021] Disadvantage 2: Limited load assessment dimensions and poor balancing. Specifically, FedBalancer and HDFSBalancer assess HDFS device load solely based on storage capacity utilization. This often results in balanced capacity but unbalanced performance. For example, a sub-cluster might have normal capacity utilization, but an excessive number of small files (over 1 million) could cause NameNode JVM memory usage to exceed 90%, leading to increased read / write latency. The underlying reason is that existing tools do not incorporate key metrics such as file count, read / write QPS, and NameNode memory utilization into their load monitoring. This prevents them from assessing the performance load of sub-clusters, only addressing uneven capacity distribution but failing to address the actual load pain points in business operations.

[0022] Disadvantage 3: Load balancing and routing adjustments are disconnected, resulting in a lag. Specifically, after Fedbalance performs data migration, administrators need to manually modify routing rules (e.g., migrating user groups from high-load sub-clusters to low-load sub-clusters). The entire process takes at least 2 hours, during which time services still access the original high-load cluster, failing to alleviate pressure in real time. The underlying reason is that current load balancing tools and routing management are independent, lacking an automated linkage mechanism. The results of balancing operations (such as changes in sub-cluster load) cannot automatically trigger routing adjustments, relying on manual intervention.

[0023] Disadvantage 4: High risk of route changes and lack of security guarantees. Specifically, when administrators manually modify routes, operational errors (such as accidentally deleting routes for critical user groups) can lead to service interruptions. Furthermore, if the old routes cached by clients are not updated promptly after a route change, inconsistencies can arise due to modified routes but services still accessing the old cluster. The underlying reason is that the existing HDFS RBF federated architecture lacks atomicity verification (i.e., route changes only have two states: either modified or not modified at all) and consistency synchronization mechanisms (such as actively notifying clients to refresh their caches). Operational risks rely entirely on administrator experience.

[0024] In view of this, this application provides an intelligent routing method, apparatus, and electronic device for data clusters, specifically applied to the HDFS RBF federated architecture, for routing management of data clusters under the HDFS RBF federated architecture, so as to improve the routing management efficiency and accuracy of data clusters under the HDFS RBF federated architecture.

[0025] Firstly, this application provides an intelligent routing method for data clusters. This method is applicable to any electronic device equipped with intelligent routing functionality for data clusters, including but not limited to personal mobile terminals, computers, or servers. In some possible embodiments, the method is applied to a pre-built management service module, which is connected to the routing layer, configuration storage layer, and monitoring and alarm layer in the HDFS RBF federated architecture, respectively. Figure 1 As shown, the method includes the following steps: S11. Obtain all user groups of the HDFS RBF according to a preset time interval, and obtain the load status of each sub-cluster monitored by the monitoring and alarm layer; S12. Compare all user groups with the user groups with configured routes stored in the configuration storage layer to determine new user groups; S13. Based on the load status of each sub-cluster, determine the target sub-cluster, create a new routing path for the new user group based on the target sub-cluster according to the preset routing setting rules, and verify whether the new routing path is valid. S14. If valid, update the new routing path to the routing layer and the configuration storage layer.

[0026] This application adds a management service module to the data cluster of the HDFS RBF federated architecture. This management service module connects to the routing layer, configuration storage layer, and monitoring and alarm layer in the HDFS RBF federated architecture. It periodically obtains the load status of all user groups and all sub-clusters. When a new user group exists, it determines the available target sub-cluster for the new user group based on the load status of each sub-cluster. Then, according to the preset routing rules, it creates a new routing path for the new user group based on the target sub-cluster and verifies the validity of the new routing path. If valid, the newly created routing path is updated in the routing layer and configuration storage layer of the HDFS RBF federated architecture.

[0027] Thus, by using the embodiments of this application, when a new user group appears, there is no need to manually configure routing rules, and the validity of the created routing path can be automatically verified, which can effectively improve the routing configuration efficiency and accuracy of the HDFS RBF federation architecture.

[0028] The following section will provide a detailed explanation of steps S11 to S14 with specific examples: The management service module of this application is a functional module specifically designed for the HDFS RBF federated architecture. It can be integrated into the HDFS RBF federated architecture data cluster to assist HDFS administrators in managing the HDFS RBF federated architecture data cluster, especially in routing management and load management. In some possible embodiments, the HDFS RBF federated architecture can be configured as follows: Figure 2 As shown, it includes: HDFS RBF client (hereinafter referred to as the client), routing layer, configuration storage layer, monitoring and alarm layer, and sub-cluster layer. Among them, the client can be understood as a client application layer. Business applications (such as Spark and Hive) can access HDFS through the HDFS client without being aware of the specific sub-cluster.

[0029] The HDFS RBF routing layer receives client requests and forwards them to the corresponding sub-clusters according to the routing rules recorded in the routing layer. Within the HDFS RBF federated architecture, a Namespace is the smallest independent metadata management unit with complete metadata management capabilities. Each Namespace corresponds to a dedicated NameNode cluster. A NameNode cluster can be understood as a sub-cluster, so managing a sub-cluster only requires managing one Namespace. Therefore, in the HDFS RBF federated architecture, one routing path corresponds to one sub-cluster, which is also one Namespace. When the HDFS RBF routing layer forwards client requests, it forwards the client's request to the corresponding Namespace according to the routing path. Each sub-cluster layer is a cluster of nodes that actually store the data.

[0030] The configuration storage layer uniformly stores data related to the distributed coordination of the HDFS RBF federated architecture data cluster, including configuration data, status data, node information, etc. Although the amount of data stored is small, it belongs to the core layer. A common configuration storage layer can be a functional layer built on the Zookeeper distributed consensus protocol. After each router in the routing layer starts, it automatically registers its own node information (such as IP, port, status) with Zookeeper. Then, the configuration storage layer maintains a global state list of nodes for that router, so that the routing layer can perceive the status of other nodes in the cluster in real time based on the global state list of nodes.

[0031] Because HDFS RBF is a large-scale distributed architecture composed of multiple routers, multiple namespace sub-clusters (NNs), a global DataNode pool, and a ZooKeeper cluster, manual troubleshooting alone cannot provide real-time monitoring of the cluster status. Therefore, setting up a monitoring and alerting layer in the HDFS RBF federated architecture to provide real-time monitoring and alerting for all components and dimensions is the core operational support for ensuring the stable operation of the RBF cluster and quickly locating faults. As one implementation method, this monitoring and alerting layer can be implemented using a combination of the Prometheus computing engine and the Grafana computing engine. This layer collects operational metrics for each module, such as route creation success rate and load balancing time, and issues alert messages when the monitored values ​​exceed preset thresholds.

[0032] like Figure 2As shown, this management service module is located between the routing layer and the configuration storage layer, monitoring and alarm layer, and sub-cluster layer. It receives routing rules sent by the routing layer and manages these rules based on data feedback from the configuration storage layer, monitoring and alarm layer, and sub-cluster layer. It can also modify and update routing rules by manipulating the routing layer and then send the updated rules back to the routing layer to change the routing rules recorded there. In one implementation, this management service module can also be called the management service layer, which is independent of other layers in HDFS RBF, interacting with other layers to manage routing rules, requests, and device load.

[0033] As one implementation method, the management service module can be as follows: Figure 3 The system comprises four functional units: a user group monitoring unit, a route management unit, a load analysis unit, and an execution engine unit. The user group monitoring unit is used to execute step S11, which involves retrieving all user groups of the HDFS RBF at preset time intervals, and to execute step S12. This means that by periodically scanning the system user groups of the HDFS RBF federated architecture data cluster, user groups are identified, and automatic route creation is triggered. Different operating systems use different methods to identify new user groups. For example, on Linux systems, all user groups can be retrieved using the Linux command `getent group`, and compared with the user groups of configured routes stored in ZooKeeper to identify new user groups (the default scan interval is 5 minutes, which can be adjusted through configuration). During the identification of new user groups, it is necessary to exclude built-in user groups in the HDFS RBF federated architecture data cluster system (such as the administrator user group and the root user group), and only business user groups need to be identified. Specifically, on Linux systems, the exclusion list can be set by configuring `rbf.manager.ignore-groups`.

[0034] The operation code (pseudocode) for the user group monitoring unit can be implemented with reference to the code snippet shown below: Java run / / The core logic of periodically scanning new user groups public void scanNewUserGroups() { / / 1. Get all user groups in the system List <string>allGroups = executeSystemCommand("getent group | cut -d:-f1"); / / 2. Retrieve the user groups with configured routes (read from ZooKeeper) List <string>routedGroups = zkClient.getChildren(" / rbf / routes"); / / 3. Filter new user groups (excluding system group + already routed group) List <string>newGroups = allGroups.stream() .filter(group ->!ignoreGroups.contains(group)) .filter(group ->!routedGroups.contains(group)) .collect(Collectors.toList()); / / 4. Trigger route creation if (!newGroups.isEmpty()) { routingManager.createRoutesForNewGroups(newGroups); log.info("{} new user groups were found, triggering route creation: {}", newGroups.size(), newGroups); } } Furthermore, after the user group monitoring unit identifies a new user group, it sends the user group list of the new user group to the routing management unit, triggering the routing management unit to execute step S13 to allocate routes to the new user group, that is, to create a new routing path for the new user group.

[0035] When the routing management unit executes step S13, it first needs to determine a suitable sub-cluster and then bind the suitable sub-cluster to the new user group. At this time, the load analysis unit needs to execute step S11 to obtain the load status of each sub-cluster and determine a suitable or optimal sub-cluster based on the load status of each sub-cluster. Then, the routing management unit will allocate the optimal sub-cluster to the new user group.

[0036] In some possible embodiments, the load analysis unit can determine the load status of each sub-cluster by the following steps, and allocate the optimal sub-cluster based on the load status of each sub-cluster: Send a capacity assessment metric request to the monitoring and alarm layer so that the monitoring and alarm layer can obtain the capacity utilization rate, number of files, and memory utilization rate of each sub-cluster based on the capacity assessment metric request; Based on the capacity utilization, number of files, and memory utilization of each sub-cluster, the load status of each sub-cluster is determined according to a preset load assessment algorithm.

[0037] After receiving the new user group list from the user group analysis unit, the load analysis unit sends a capacity assessment indicator request to the monitoring and alarm layer, requesting the layer to obtain and upload indicators such as the capacity utilization rate, file quantity, and NameNode memory utilization rate of each sub-cluster in real time. Further, the load analysis unit determines the load status of each sub-cluster based on the capacity utilization rate, file quantity, and memory utilization rate of each sub-cluster. As one implementation method, a weighted sum can be performed on the capacity utilization rate, file quantity, and memory utilization rate of each sub-cluster to obtain a comprehensive weighted score for each sub-cluster, i.e., Comprehensive Weighted Score = α * Capacity Utilization Rate + β * File Quantity + θ * Sub-Cluster Memory Utilization Rate. This comprehensive weighted score is then used to determine the load status of each sub-cluster. Here, α is the weighting coefficient for capacity utilization rate, β is the weighting coefficient for file quantity, and θ is the weighting coefficient for sub-cluster memory utilization rate, which can be set based on practical experience.

[0038] By using the embodiments of this application, we break through the existing tools that only assess load conditions based on capacity. We introduce three core indicators: capacity utilization rate, file quantity ratio, and sub-cluster memory utilization rate. By calculating the comprehensive load of the sub-cluster through standardized scores, we avoid the problem of capacity balance but performance imbalance. This greatly shortens the discovery time of sub-cluster performance problems from 2 hours to 1 minute, and reduces business read and write latency by 30%.

[0039] As another implementation, the load analysis unit can determine the optimal sub-cluster according to the following priority after obtaining the real-time load of each sub-cluster: First, subclusters with performance loads exceeding the threshold are excluded. For example, if the NameNode memory utilization rate in a subcluster is greater than 85% (the set memory utilization threshold), the subcluster can be excluded and not considered for the target subcluster. Next, the subcluster with the lowest capacity load is selected, where capacity utilization rate = used capacity / total capacity, and lower is better. In other words, the subcluster with the lowest capacity load is determined as the target subcluster, and routing paths are assigned to the new user group.

[0040] Furthermore, the routing management unit creates new routing paths for the new user group based on the storage path of the target sub-cluster. Specifically, the routing path can be created by following these steps: Step 1: Write routing information (e.g., {"namespace":"ns2","path":" / "}) under the / rbf / routes / <user group> node path in the ZooKeeper configuration storage layer.

[0041] Step 2: Invoke the refreshRouterConfig() interface of the HDFS RBF routing layer to refresh the routing cache of the routing layer.

[0042] Step 3: Verify that the route takes effect (access by the command "hdfs dfs -ls hdfs: / / rbf / <user group> / test"); that is, execute step S13 to verify the validity of the generated routing path and determine whether the route takes effect. If it takes effect, update the route to the routing layer and the configuration storage layer, especially update each node information in the routing path to the configuration storage layer. In some possible embodiments, if it is verified that the new routing path is invalid, delete the new routing nodes recorded in the configuration storage layer and restore the routing rules in the configuration storage layer to the state before creating the new routing path. In this way, it is possible to automatically delete the routing nodes in the configuration storage layer Zookeeper and restore the routing path in the routing layer to the state before creation, so as to ensure automatic rollback of the route and avoid interference with the request scheduling of the HDFS RBF routing layer due to invalid routes generating new routing paths.

[0043] Selecting the embodiments of the present application, through the process automation of system user group scanning, multi-dimensional load evaluation, optimal sub-cluster allocation, route creation, effectiveness verification and failure rollback, can effectively replace manual configuration. Among them, the optimal sub-cluster allocation combines the capacity load and performance load of the sub-clusters, which can ensure that the route allocation balances both capacity and performance. It can reduce the routing configuration time of new user groups from 1 hour / unit to 10 seconds / unit, and directly reduce the error rate to 0.

[0044] Among them, the routing management unit can be configured according to the following rules to determine its scanning interval, capacity monitoring threshold, etc.: <!-- Core configuration of the routing management unit --> <property> <name> rbf.manager.route.scan.interval< / name> <value> 300000< / value> <!-- Scanning interval: 5 minutes (milliseconds) --> < / property> <property> <name> rbf.manager.ignore-groups< / name> <value> HDFS, Root, Spark< / value> <!-- Excluded system user groups --> < / property> <property> <name> rbf.manager.perf.threshold.namenode.mem< / name> <value> 85< / value> <!-- Sub-cluster memory usage threshold (%), if exceeded, the sub-cluster is excluded. --> < / property> Among them, the function of the load analysis unit is to collect multi-dimensional load indicators of each sub-cluster to obtain the load status of each sub-cluster, and then perform load balancing analysis on each sub-cluster based on this load status, that is, evaluate whether load balancing is required based on the collected load indicators. And judge whether there are risk sub-clusters with load imbalance. A risk sub-cluster indicates that the load of this sub-cluster has exceeded the set threshold. For example, if the capacity utilization rate is greater than 85%, it can be determined that this sub-cluster is a risk sub-cluster with load imbalance. In this article, other sub-clusters that can achieve load balance are defined as non-risk sub-clusters. At this time, it is necessary to adjust the resources of the sub-clusters to reduce the load of the risk sub-clusters and avoid the life loss caused by the overloading of the risk sub-clusters.

[0045] In some possible embodiments, the configuration storage layer stores a load evaluation metric configuration file, which specifies the weights of each load metric. Based on this, the method further includes: The system receives and parses the load assessment metric configuration file, and determines the load status of each sub-cluster based on the load metric weights specified in the configuration file.

[0046] As one implementation, this load analysis unit can obtain "used capacity, total capacity, and capacity utilization" through the command "hdfs dfsadmin -report<subcluster>", thereby obtaining the capacity metrics for each subcluster. This load analysis unit can also obtain capacity metrics for each subcluster through the NameNode JMX interface (http: / / <nn-host>By using the command `50070 / jmx` to obtain "file count, read / write QPS, and JVM memory usage", the performance metrics for each sub-cluster can be obtained. Furthermore, data can be collected based on the collection frequency of the load analysis unit set in the load assessment metric configuration file. For example, setting the collection frequency for capacity metrics to be set to 5 minutes / time and performance metrics to be set to 1 minute / time allows for high-frequency monitoring of performance changes.

[0047] Furthermore, based on the weights specified in the load assessment metric configuration file, the comprehensive load score of each sub-cluster is calculated using a pre-set load assessment algorithm. Specifically, the comprehensive load score can be calculated using the formula: Comprehensive Load Score = 0.4 * Capacity Utilization + 0.3 * File Quantity Ratio + 0.3 * Sub-cluster Memory Utilization. The weights in the formula are for illustrative purposes only, and the corresponding weight coefficients in the load assessment metric configuration file can be flexibly set according to actual needs.

[0048] By using the embodiments of this application, it is possible to adjust the weight of indicators in the comprehensive load assessment through the load assessment indicator configuration file (e.g., when the business is more focused on performance, the NameNode memory weight can be increased to 50%), and to customize excluded user groups (e.g., the system group and core business group are excluded from automatic routing adjustment). It can adapt to different business scenarios, such as big data analysis scenarios and real-time computing scenarios, and the flexibility is significantly improved.

[0049] If the difference in the overall load score of each sub-cluster exceeds the set difference threshold, it can be determined that there is a risk imbalance, that is, there is a risky sub-cluster with an unbalanced load. According to the overall load score, the sub-clusters with an overall load score greater than the set overall load score threshold are identified as risky sub-clusters, and the remaining sub-clusters are identified as non-risky sub-clusters.

[0050] The load analysis unit can calculate the overall load score and determine whether the load is unbalanced using the following code: / / Calculate the overall load score of the sub-cluster public double calculateComprehensiveLoad(NamespaceMetrics metrics) { / / 1. Standardize each indicator (convert to a score of 0-100) double capacityScore = metrics.getCapacityUsedRatio() * 100; / / Capacity utilization rate (0-100) double fileRatio = metrics.getFileCount() / totalFileCount; / / Percentage of file counts double fileScore = fileRatio * 100; / / File count score (0-100) double nnMemScore = metrics.getNnJvmMemUsedRatio() * 100; / / NameNode memory score (0-100) / / 2. Calculate the overall score based on weights (weights can be adjusted through configuration). double weightCapacity = conf.getDouble("rbf.manager.load.weight.capacity", 0.4); double weightFile = conf.getDouble("rbf.manager.load.weight.file",0.3); double weightNnMem = conf.getDouble("rbf.manager.load.weight.nnmem",0.3); return capacityScore * weightCapacity + fileScore * weightFile +nnMemScore * weightNnMem; } / / Determine if there is a load imbalance public boolean isImbalanced(List <namespacemetrics>allMetrics) { List <double>scores = allMetrics.stream() .map(this::calculateComprehensiveLoad) .collect(Collectors.toList()); double maxScore = Collections.max(scores); double minScore = Collections.min(scores); / / Score difference = (highest score - lowest score) / average score; exceeding the threshold indicates imbalance. double avgScore = scores.stream().mapToDouble(Double::doubleValue).average().orElse(0); double diffRatio = (maxScore - minScore) / avgScore; return diffRatio>conf.getDouble("rbf.manager.load.imbalance.threshold", 0.2); } Furthermore, the risky sub-clusters are aggregated into an unbalanced sub-cluster list, and the user group to which the unbalanced sub-cluster list belongs, along with the unbalanced sub-cluster list itself, is sent to the execution engine unit in the management service module. The execution engine unit then performs data migration and adjusts the load of each sub-cluster.

[0051] In some possible implementations, the HDFS RBF federated architecture also includes a Fedbalance tool. This tool can be invoked to perform load balancing operations and automatically update routing rules, thus forming a closed loop of load balancing and routing adjustments. Specifically, the Fedbalance tool can be invoked to migrate some data from risky sub-clusters to non-risk sub-clusters based on the load status of each sub-cluster.

[0052] Specifically, the Fedbalance tool can generate data migration suggestions based on load imbalance notification messages issued by the load analysis unit, and verify the rationality of the generated migration suggestions. Specifically, it can determine whether the target sub-cluster has sufficient capacity, and how the load status of the target sub-cluster will change after data migration, etc. If invalid migration suggestions exist, they are filtered out, and valid migration suggestions are executed. For example, the execution engine unit can achieve load balancing by performing the following steps: Step A: Invoke the Fedbalance tool (the path is specified by the configuration path "rbf.manager.fedbalance.path") and execute Fedbalance -analyze to obtain migration suggestions (such as "migrate user group group1 from ns1 to ns2").

[0053] Step B: Verify the rationale behind the migration recommendations (e.g., whether the target sub-cluster has sufficient space) and filter out invalid recommendations.

[0054] Step C: Execute fedbalance -execute to complete the data migration and monitor the migration progress (via fedbalance-status).

[0055] After data migration is complete, the routes for the migrated user group can be updated from the initial sub-cluster to the destination sub-cluster by calling the updateRoute() interface of the routing management unit. The migration results are then fed back, including whether the migration was successful, the amount of data migrated, the number of route updates, and route update details. If the migration fails, an alarm notification message is triggered, allowing administrators to intervene manually in a timely manner.

[0056] The execution code for this execution engine unit can be referenced as follows to achieve load balancing: / / Call Fedbalance to perform the balancing operation public boolean executeFedBalance(List <rebalancerecommendation>recommendations) { / / 1. Build the fedbalance command (migrating only the recommended user group) List <string>command = new ArrayList<>(); command.add(conf.get("rbf.manager.fedbalance.path", " / usr / bin / fedbalance")); command.add("-execute"); / / Add recommended migration rules (e.g., --migrate group1:ns1→ns2) for (RebalanceRecommendation rec : recommendations) { command.add("--migrate"); command.add(String.format("%s:%s→%s", rec.getUserGroup(),rec.getOldNs(), rec.getNewNs())); } / / 2. Execute the command and get the output ProcessBuilder pb = new ProcessBuilder(command); pb.redirectErrorStream(true); Process process = pb.start(); String output = new String(process.getInputStream().readAllBytes(),StandardCharsets.UTF_8); int exitCode = process.waitFor(); / / 3. Verify the execution results if (exitCode == 0&&output.contains("Migration completedsuccessfully")) { log.info("Fedbalance operation successful, output: {}", output); / / 4. Automatic route updates for (RebalanceRecommendation rec : recommendations) { routingManager.updateRoute(rec.getUserGroup(), rec.getNewNs()); } return true; } else { log.error("Fedbalance operation failed, exit code: {}, output: {}", exitCode,output); / / Trigger alarm alertService.sendAlert("Fedbalance execution failed", output); return false; } } In this way, end-to-end automated processing of load balancing analysis, data migration, route updates and cache refresh can be achieved without manual intervention. The total time spent from problem discovery to problem resolution in load balancing is significantly reduced, and the manpower cost of operation and maintenance is also reduced accordingly.

[0057] In some possible embodiments, the method provided in this application further includes: When a route is created and / or updated, the node lock of the configuration storage layer is invoked to lock the target route of the operation; Once the route is created and / or updated, the route configuration interface in the routing layer is called to synchronously refresh the client's cached data.

[0058] Specifically, during route creation and / or updates, the target route can be locked by configuring node locks in the ZooKeeper storage layer to ensure the atomicity of the target route. This means ensuring that only one operation is allowed on the same user group's route at a time. Through a complete set of processing logic—pre-creation backup, in-creation verification, post-creation validation, and failure rollback—route security can be guaranteed. Simultaneously, after a route change, the refreshRouterConfig() interface of the RBF routing layer is proactively called to synchronously refresh the client cache, avoiding inconsistencies. In this way, the service interruption rate caused by route changes can be reduced from 5% to 0%, and the cache inconsistency problem is completely resolved.

[0059] Through the above process, a complete closed-loop workflow can be achieved. The load analysis unit evaluates the load status of each sub-cluster at intervals t. If an imbalance is detected, a list of unbalanced sub-clusters is generated. For example, user groups group1 and group2 are migrated from Namespace1 to Namespace2, and a migration instruction is sent to the execution engine unit. The execution engine unit then calls the Fedbalance tool to perform the data migration. After the migration is complete, the routing management unit is notified. The routing management unit updates the routing rules for the corresponding user groups, changing the original routing rule Namespace1 to the new routing rule Namespace2, and refreshes the routing layer cache. Subsequently, if a client requests accessing either group1 or group2 sub-clusters, the request is automatically forwarded to Namespace2 according to the updated routing rules, thus alleviating the load on Namespace1.

[0060] Secondly, this application provides an intelligent routing device for a data cluster, used to execute the method described in the first aspect, which can be integrated into a management service module. Wherein, as... Figure 4 As shown, the device 40 includes: The monitoring module 401 is used to obtain all user groups of the HDFS RBF at preset time intervals and to obtain the load status of each sub-cluster monitored by the monitoring and alarm layer. Analysis module 402 is used to compare all user groups with the user groups of configured routes stored in the configuration storage layer to determine new user groups; The route creation module 403 is used to determine the target sub-cluster based on the load status of each sub-cluster, create a new route path for the new user group based on the target sub-cluster according to the preset route setting rules, and verify whether the new route path is valid. The route update module 404 is used to update the new route path to the routing layer and the configuration storage layer if it is valid.

[0061] In some possible embodiments, the route update module is further configured to: If the new routing path is invalid, delete the new routing node recorded in the configuration storage layer, and restore the routing rules in the configuration storage layer to the state before the creation of the new routing path.

[0062] In some possible embodiments, the monitoring module is further configured to: Send a capacity assessment metric request to the monitoring and alarm layer so that the monitoring and alarm layer can obtain the capacity utilization rate, number of files, and memory utilization rate of each sub-cluster based on the capacity assessment metric request; Based on the capacity utilization, number of files, and memory utilization of each sub-cluster, the load status of each sub-cluster is determined according to a preset load assessment algorithm.

[0063] In some possible embodiments, the HDFS RBF federated architecture also includes the Fedbalance tool, and the analysis module is further used for: Based on the load status of each sub-cluster, a load balancing analysis is performed on each sub-cluster to determine whether there are any risky sub-clusters with unbalanced loads. If so: The Fedbalance tool is invoked to migrate data from the risky sub-clusters to the non-risky sub-clusters where there is no risk imbalance, based on the load status of each sub-cluster. The initial routes recorded in the routing layer and the configuration storage layer are updated to destination routes, wherein the initial routes are the routes of the migrated data before the migration, and the destination routes are the routes of the migrated data after the migration.

[0064] In some possible embodiments, the HDFS RBF federation architecture also includes a client, and the route creation module is further configured to: When a route is created, the node lock of the configuration storage layer is invoked to lock the target route of the operation; Once the route is created, the route configuration interface in the routing layer is called to synchronously refresh the cached data on the client.

[0065] In some possible embodiments, the route update module is further configured to: During route updates, the node lock of the configuration storage layer is invoked to lock the target route of the operation; Once the route update is complete, the route configuration interface in the routing layer is called to synchronously refresh the cached data on the client.

[0066] In some possible embodiments, the analysis module is further configured to: The system receives and parses the load assessment metric configuration file, and determines the load status of each sub-cluster based on the load metric weights specified in the configuration file.

[0067] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in this application comply with relevant laws and regulations and do not violate public order and good morals.

[0068] The names of the messages or information exchanged between multiple devices in the embodiments of this application are for illustrative purposes only and are not intended to limit the scope of these messages or information.

[0069] Thirdly, exemplary embodiments of this application also provide an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor to cause the electronic device to perform a method according to an embodiment of this application.

[0070] An exemplary embodiment of this application also provides a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a computer's processor, is used to cause the computer to perform a method according to an embodiment of this application.

[0071] An exemplary embodiment of this application also provides a computer program product, including a computer program, wherein, when executed by a computer's processor, the computer program is used to cause the computer to perform a method according to an embodiment of this application.

[0072] refer to Figure 5 The present invention describes a structural block diagram of an electronic device 500 that can serve as a server or client of this application, which is an example of a hardware device that can be applied to various aspects of this application. The electronic device is intended to represent various forms of digital electronic computer devices, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the application described and / or claimed herein.

[0073] like Figure 5 As shown, the electronic device 500 includes a computing unit 501, which can perform various appropriate actions and processes based on a computer program stored in a read-only memory (ROM 502) or a computer program loaded from a storage unit 508 into a random access memory (RAM 503). The RAM 503 may also store various programs and data required for the operation of the electronic device 500. The computing unit 501, ROM 502, and RAM 503 are interconnected via a bus 504. An input / output interface (I / O interface 505) is also connected to the bus 504.

[0074] Multiple components in electronic device 500 are connected to I / O interface 505, including: input unit 506, output unit 507, storage unit 508, and communication unit 509. Input unit 506 can be any type of device capable of inputting information to electronic device 500. Input unit 506 can receive input digital or character information and generate key signal inputs related to user settings and / or function control of electronic device. Output unit 507 can be any type of device capable of presenting information and may include, but is not limited to, a display, speaker, video / audio output terminal, vibrator, and / or printer. Storage unit 508 may include, but is not limited to, disk and optical disk. Communication unit 509 allows electronic device 500 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and / or chipsets, such as Bluetooth™ devices, WiFi devices, WiMax devices, cellular communication devices, and / or the like.

[0075] The computing unit 501 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above. For example, in some embodiments, the aforementioned intelligent routing method for data clusters can be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program can be loaded and / or installed on the electronic device 500 via ROM 502 and / or communication unit 509. In some embodiments, the computing unit 501 can be configured to perform the aforementioned intelligent routing method for data clusters by any other suitable means (e.g., by means of firmware).

[0076] The program code used to implement the methods of this application may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that when executed by the processor or controller, the functions / operations specified in the flowcharts and / or block diagrams are implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0077] In the context of this application, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0078] As used in this application, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, device, and / or apparatus (e.g., disk, optical disk, memory, programmable logic device (PLD)) for providing machine instructions and / or data to a programmable processor, including machine-readable media that receive machine instructions as machine-readable signals. The term "machine-readable signal" refers to any signal for providing machine instructions and / or data to a programmable processor.

[0079] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0080] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0081] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other.< / string> < / rebalancerecommendation> < / double> < / namespacemetrics> < / string> < / string> < / string>

Claims

1. An intelligent routing method for a data cluster, characterized in that, The method is applied to a pre-built management service module, which is connected to the routing layer, configuration storage layer, and monitoring and alarm layer in the HDFS RBF federated architecture. The method includes: All user groups of the HDFS RBF are obtained at preset time intervals, and the load status of each sub-cluster monitored by the monitoring and alarm layer is obtained. The new user groups are determined by comparing all the user groups with the user groups with configured routes stored in the configuration storage layer. Based on the load status of each sub-cluster, the target sub-cluster is determined, and a new routing path is created for the new user group according to the preset routing setting rules based on the target sub-cluster, and the validity of the new routing path is verified. If effective, the new routing path is updated in the routing layer and the configuration storage layer.

2. The intelligent routing method according to claim 1, characterized in that, The method further includes: If the new routing path is invalid, delete the new routing node recorded in the configuration storage layer, and restore the routing rules in the configuration storage layer to the state before the creation of the new routing path.

3. The intelligent routing method according to claim 1, characterized in that, The method further includes: Send a capacity assessment metric request to the monitoring and alarm layer so that the monitoring and alarm layer can obtain the capacity utilization rate, number of files, and memory utilization rate of each sub-cluster based on the capacity assessment metric request; Based on the capacity utilization, number of files, and memory utilization of each sub-cluster, the load status of each sub-cluster is determined according to a preset load assessment algorithm.

4. The intelligent routing method according to claim 1, characterized in that, The HDFS RBF federated architecture also includes the Fedbalance tool, and the method further includes: Based on the load status of each sub-cluster, a load balancing analysis is performed on each sub-cluster to determine whether there are any risky sub-clusters with unbalanced loads. If so: The Fedbalance tool is invoked to migrate data from the risky sub-clusters to the non-risky sub-clusters where there is no risk imbalance, based on the load status of each sub-cluster. The initial routes recorded in the routing layer and the configuration storage layer are updated to destination routes, wherein the initial routes are the routes of the migrated data before the migration, and the destination routes are the routes of the migrated data after the migration.

5. The intelligent routing method according to claim 1, characterized in that, The HDFS RBF federated architecture also includes a client, and the method further includes: When a route is created and / or updated, the node lock of the configuration storage layer is invoked to lock the target route of the operation; Once the route is created and / or updated, the route configuration interface in the routing layer is called to synchronously refresh the client's cached data.

6. The intelligent routing method according to claim 1, characterized in that, The method further includes: The system receives and parses the load assessment metric configuration file, and determines the load status of each sub-cluster based on the load metric weights specified in the configuration file.

7. An intelligent routing device for a data cluster, characterized in that, The apparatus is used to perform the method as described in any one of claims 1 to 6, the apparatus comprising: The monitoring module is used to obtain all user groups of the HDFS RBF at preset time intervals and to obtain the load status of each sub-cluster monitored by the monitoring and alarm layer. The analysis module is used to compare all user groups with the user groups of configured routes stored in the configuration storage layer to determine new user groups; The route creation module is used to determine the target sub-cluster based on the load status of each sub-cluster, create a new route path for the new user group based on the target sub-cluster according to the preset route setting rules, and verify whether the new route path is valid. The route update module is used to update the new route path to the routing layer and the configuration storage layer if it is valid.

8. The apparatus according to claim 7, characterized in that, The routing update module is also used for: If the new routing path is invalid, delete the new routing node recorded in the configuration storage layer, and restore the routing rules in the configuration storage layer to the state before the creation of the new routing path.

9. An electronic device, characterized in that, The electronic device includes: a processor and a memory storing a program; wherein the program includes instructions that, when executed by the processor, cause the processor to perform the method according to any one of claims 1-6.

10. A non-transitory computer-readable storage medium storing computer instructions, characterized in that, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-6.