A disk cleaning method and device, electronic equipment and storage medium
By obtaining disk usage information from Kafka cluster nodes, determining the target state, and performing partition migration, the zombie state problem caused by insufficient disk space on server nodes was resolved, and the stable operation of the cluster was achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING KINGSOFT CLOUD NETWORK TECH CO LTD
- Filing Date
- 2024-12-30
- Publication Date
- 2026-06-30
AI Technical Summary
In a Kafka cluster, if a server node has a large number of partitions, even if the partitions do not exceed the threshold in terms of space usage and duration, insufficient disk space may prevent the node from writing data, potentially leading to a zombie state and cluster failure.
By obtaining disk usage information from each server node, the nodes in the target non-transfer state are identified, and their partitions are transferred to the disks of the nodes in the target transfer state to complete the cleanup. Specifically, this includes priority transfer from replica partitions and emergency cleanup in case of emergencies.
It effectively avoids the zombie state of server nodes, prevents Kafka cluster failures, and enables timely disk cleanup and data transfer.
Smart Images

Figure CN122308705A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a disk cleaning method, apparatus, electronic device, and storage medium. Background Technology
[0002] In a distributed messaging system (Kafka cluster), the partitions of all message sets (topics) on each server node (broker) are persisted to the server node's disk. However, since the server node's disk space is limited, and considering that the Kafka cluster is primarily a message queue component with message data having inherent expiration characteristics, it is necessary to frequently clean up the server node's disk to prevent it from filling up.
[0003] Currently, the default disk cleanup policy for server nodes in a Kafka cluster is based on time and the disk space occupied by partitions. That is, if the disk space occupied by a partition on a server node exceeds a certain threshold or the duration of a partition on the server node's disk exceeds a certain threshold, the cleanup policy of that server node will be triggered, thereby cleaning up the partitions on that server node's disk.
[0004] However, for a particular server node, if the number of partitions on the server node is large, and the disk space occupied by these partitions and the duration of their use have not exceeded the corresponding thresholds, if the remaining disk space of the server node is insufficient, it may affect the server node's inability to write data, but it may still be able to receive external requests. This may cause the server node to be in a zombie state, which in turn may lead to the failure of the entire Kafka cluster. Summary of the Invention
[0005] In view of this, in order to solve the above-mentioned technical problems or some of the technical problems, embodiments of this application provide a disk cleanup method, apparatus, electronic device and storage medium.
[0006] In a first aspect, this application provides a disk cleanup method applied to a Kafka cluster, the Kafka cluster comprising a set of server nodes, the method comprising:
[0007] Obtain disk usage information for the disks of each server node in the server node set;
[0008] For each server node in the server node set, the target state of the server node is determined based on the disk usage information corresponding to the disk of the server node;
[0009] When there is a first server node in the server node set whose target state is non-transferring and at least one second server node in the target state is transferring, the partition of the first server node is transferred from the disk of the first server node to the disk of each of the second server nodes to complete the cleanup of the disk of the first server node.
[0010] The non-transfer state indicates that other partitions are not allowed to be transferred to the disk of the first server node, while the transfer state indicates that other partitions are allowed to be transferred to the disk of the second server node.
[0011] In an optional implementation, the step of transferring the partitions of the first server node from the disk of the first server node to the disks of each of the second server nodes to complete the cleanup of the disks of the first server node includes:
[0012] At least one first partition is determined from all partitions of the first server node, wherein the first partition is a replica;
[0013] The first partitions of the first server node are transferred from the disk of the first server node to the disk of the second server node to complete the cleanup of the disk of the first server node.
[0014] In an optional implementation, the step of transferring each of the first partitions of the first server node from the disk of the first server node to the disk of each of the second server nodes to complete the disk cleanup of the first server node includes:
[0015] For each of the first partitions of the first server node, determine the first occupied space of the disk of the first server node occupied by the first partition;
[0016] The following steps are performed on each of the first partitions in the first server node in descending order of the first occupied space:
[0017] The first partition is transferred from the disk of the first server node to the disks of each of the second server nodes;
[0018] After the transfer of the first partition is completed, if it is determined that the target state of the first server node is a transfer state, then it is determined that the disk cleanup of the first server node is completed.
[0019] After the transfer of the first partition is completed, if it is determined that the target state of the first server node is still in a non-transfer state and there are still first partitions to be transferred in the first server node, the transfer process for the next first partition continues until the disk cleanup of the first server node is completed.
[0020] In an optional implementation, determining that the target state of the first server node is a transition state includes:
[0021] Obtain the disk usage information corresponding to the disk of the first server node, wherein the disk usage information includes disk utilization rate;
[0022] If the disk utilization rate corresponding to the disk of the first server node is less than the preset utilization rate threshold, and if the disk utilization rate corresponding to the disk of the first server node is still less than the preset utilization rate threshold after a first preset time period, then the target state of the first server node is determined to be a transition state.
[0023] The preset utilization threshold is used to switch the target state of the server node.
[0024] In an optional implementation, after the transfer of the first partition is completed, the method further includes:
[0025] If it is determined that the target state of the first server node is still a non-transfer state and there is no first partition to be transferred in the first server node, obtain the remaining disk space corresponding to the disk of the first server node;
[0026] When the remaining disk space is less than or equal to a first space threshold, write operations on the first server node are blocked. The first space threshold represents the upper limit of the remaining disk space corresponding to the need for emergency cleanup of the server node's disk.
[0027] When it is determined that the Kafka cluster has started emergency cleanup mode, emergency cleanup is performed on the partition of the first server node in the disk of the first server node so that the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold.
[0028] In an optional implementation, after performing the step of determining the target state of each server node in the server node set based on the disk usage information corresponding to the disk of the server node, the method further includes:
[0029] When there is a first server node in the service node set whose target state is non-transferring and there is no second server node in the target state of transfer, the following operation is performed for each first server node:
[0030] Obtain the remaining disk space corresponding to the disk of the first server node;
[0031] When the remaining disk space is less than or equal to a first space threshold, write operations on the first server node are blocked. The first space threshold represents the upper limit of the remaining disk space corresponding to the need for emergency cleanup of the server node's disk.
[0032] When it is determined that the Kafka cluster has started emergency cleanup mode, emergency cleanup is performed on the partition of the first server node in the disk of the first server node so that the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold.
[0033] In an optional implementation, the step of performing emergency cleanup on the partition of the first server node's disk to ensure that the remaining disk space corresponding to the first server node's disk is greater than or equal to the first space threshold includes:
[0034] Perform the following operations for each partition in the first server node:
[0035] Determine the target duration and second space occupied by the partition on the disk of the first server node;
[0036] When the target occupancy time exceeds the second preset time and / or the second occupancy space exceeds the second space threshold, the partition of the first server in the disk of the first server node is urgently cleaned up so that the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold.
[0037] Wherein, the second preset duration is less than the third preset duration, the second space threshold is less than the third space threshold, the third preset duration represents the lower limit of the target occupied duration corresponding to disk cleanup when the Kafka cluster does not enable the emergency cleanup mode, and the third space threshold represents the lower limit of the second occupied space corresponding to disk cleanup when the Kafka cluster does not enable the emergency cleanup mode.
[0038] In one optional implementation, the disk usage information includes disk usage rate;
[0039] Determining the target state of the server node based on the disk usage information corresponding to the disk of the server node includes:
[0040] When the disk utilization rate corresponding to the disk of the server node is greater than or equal to a preset utilization rate threshold, the target state of the server node is determined to be a non-transfer state.
[0041] When the disk utilization rate corresponding to the disk of the server node is less than the preset utilization rate threshold, the target state of the server node is determined to be a transition state;
[0042] The preset utilization threshold is used to switch the target state of the server node.
[0043] Secondly, this application provides a disk cleanup apparatus, comprising:
[0044] The acquisition module is used to obtain disk usage information for each server node in the server node set included in the Kafka cluster.
[0045] The determination module is used to determine the target state of each server node in the server node set based on the disk usage information corresponding to the disk of the server node;
[0046] The cleanup module is used to transfer the partitions of the first server node from the disk of the first server node to the disks of each of the second server nodes when there is a first server node in the server node set whose target state is non-transfer state and at least one second server node in the target state is transfer state, so as to complete the cleanup of the disk of the first server node.
[0047] The non-transfer state indicates that other partitions are not allowed to be transferred to the disk of the first server node, while the transfer state indicates that other partitions are allowed to be transferred to the disk of the second server node.
[0048] Thirdly, this application provides an electronic device, including a processor and a memory, wherein the processor is configured to execute a disk cleanup program stored in the memory to implement the disk cleanup method described above.
[0049] Fourthly, this application also provides a storage medium storing one or more programs that can be executed by one or more processors to implement the disk cleanup method described above.
[0050] Compared with the prior art, the technical solution provided in this application has the following advantages. The method provided in this application includes: obtaining disk usage information corresponding to the disks of each server node in the server node set included in the Kafka cluster; for each server node in the server node set, determining the target state of the server node based on the disk usage information corresponding to the server node's disk; when there is a first server node in the server node set with a target state of non-transfer and at least one second server node with a target state of transfer, transferring the partitions of the first server node from the disk of the first server node to the disks of each of the second server nodes to complete the cleanup of the disk of the first server node; wherein, the non-transfer state indicates that other partitions are not allowed to be transferred to the first server node, and the transfer state indicates that other partitions are allowed to be transferred to the disks of the second server nodes. In this embodiment, the disk usage information of each server node in the Kafka cluster is obtained through the above method. The target state of each server node is determined by the obtained disk usage information. Based on the target state, if there is a first server node that does not allow other partitions to be transferred in and a second server node that allows other partitions to be transferred in, the partitions of the first server node can be transferred from the disk of the first server node to the disk of the second server node. This cleans up the disk of the first server node when it is about to be full, avoids the first server node from being in a zombie state, and thus avoids the failure of the Kafka cluster. Attached Figure Description
[0051] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention.
[0052] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0053] One or more embodiments are illustrated by way of example with reference numerals in the accompanying drawings. These illustrations do not constitute a limitation on the embodiments. Elements with the same reference numerals in the drawings are denoted as similar elements. Unless otherwise stated, the figures in the drawings are not to be limited by scale.
[0054] Figure 1 A schematic flowchart illustrating a disk cleanup method provided in an embodiment of this application;
[0055] Figure 2A flowchart illustrating another disk cleanup method provided in an embodiment of this application;
[0056] Figure 3 A schematic flowchart illustrating yet another disk cleanup method provided in this application embodiment;
[0057] Figure 4 A schematic diagram illustrating a disk cleanup process using an automatic redistribution mode, provided for an embodiment of this application;
[0058] Figure 5 A schematic diagram illustrating a disk cleanup process using emergency cleanup mode, provided as an embodiment of this application;
[0059] Figure 6 This is a schematic diagram of the structure of a disk cleanup device provided in an embodiment of this application;
[0060] Figure 7 A schematic diagram of the structure of an electronic device provided in an embodiment of this application;
[0061] In the attached diagrams above:
[0062] 10. Obtain module; 20. Confirm module; 30. Clean up module;
[0063] 700. Electronic device; 701. Processor; 702. Memory; 7021. Operating system; 7022. Application program; 703. User interface; 704. Network interface; 705. Bus system. Detailed Implementation
[0064] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0065] The following disclosure provides numerous different embodiments or examples for implementing various structures of the invention. To simplify the disclosure, specific examples of components and arrangements are described below. These are merely examples and are not intended to limit the scope of the invention. Furthermore, reference numerals and / or letters may be repeated in different examples. Such repetition is for simplification and clarity and does not in itself indicate a relationship between the various embodiments and / or arrangements discussed.
[0066] Before introducing the specific content of the embodiments of this application, the relevant technical terms of the embodiments of this application will be introduced first.
[0067] Kafka cluster: A high-throughput distributed messaging system that provides message persistence through a disk-based data structure. This structure maintains stable performance for extended periods, even with terabytes of message storage. It can provide millions of messages per second on very ordinary server hardware and offers a good producer-consumer interface, enabling effective consumer isolation and other capabilities.
[0068] Broker: All server nodes in a Kafka cluster are called brokers. Brokers are responsible for handling client production and consumption requests and storing all data.
[0069] Topic: The collection of messages sent to the Kafka cluster. It is a logical concept. A Kafka cluster can theoretically create any number of topics, and all messages sent to Kafka must specify which topic they belong to.
[0070] A partition, or partition of a topic, is a physical concept. Every topic must contain at least one partition. A topic can create multiple partitions, each acting as a message commit log. Messages are appended to the partition, and the order of messages within a single partition is guaranteed. Each partition is assigned to, and can only be assigned to, one broker. All partitions of all topics are evenly distributed across all brokers in the cluster. Partitions can have multiple replicas (one primary replica and at least one secondary replica). These replicas serve as data backups for the partition. Typically, when using Kafka, the number of replicas for a partition can be customized, but each partition must have at least one replica.
[0071] ISR (In-Sync Replicas): In a Kafka cluster, the ISR mechanism is a crucial mechanism for ensuring data reliability and consistency. It consists of a group of synchronized replicas maintained by the partition leader. Each partition's ISR contains the broker IDs of the multiple replicas for that partition. These brokers form an ISR. Assuming a partition has three replicas, one broker acts as the replica leader, and the other two brokers act as followers. The leader is responsible for writing data, while the followers are responsible for synchronizing data with the leader. Whenever a follower in the ISR completes synchronization, the leader sends an ACK acknowledgment to the follower, indicating that the synchronization is complete. If a follower fails to synchronize data with the leader within a certain time period, that follower is removed from the ISR, and another broker is elected as a new follower. When the leader fails, a new leader is elected from the followers of the ISR.
[0072] This application provides a disk cleanup method, which is described in the embodiments of the present application. Figure 1 Specifically, it includes the following steps:
[0073] S101: Obtain disk usage information for the disks of each server node in the server node set.
[0074] In this embodiment, the disk cleanup method described above is applied to a Kafka cluster, which includes a set of server nodes, and the server node set includes multiple server nodes. Disk usage information includes disk utilization rate. In this embodiment, to avoid the server nodes in the Kafka cluster being in a zombie state, the disk usage information corresponding to the disks of each server node in the Kafka cluster is obtained. Based on the obtained disk usage information, the target state of each server node is determined, and the disks of each server node are cleaned up according to the target state of each server node.
[0075] In this context, disk utilization of a server node's disk can be understood as the ratio between the used space on that server node's disk and the available disk space. Disk space can be understood as the space on a server node's disk before any data is written to it.
[0076] S102: For each server node in the server node set, determine the target state of the server node based on the disk usage information corresponding to the disk of the server node.
[0077] In this embodiment, after obtaining the disk usage information corresponding to each server node in the Kafka cluster, the disk usage information can be compared with a preset utilization threshold. Based on the comparison result, the target state of each server node can be determined. The preset utilization threshold can be set according to actual needs; this embodiment does not limit the specific value of the preset utilization threshold. For example, the preset utilization threshold could be 30%.
[0078] S103: When there is a first server node in the server node set whose target state is non-transfer state and at least one second server node whose target state is transfer state, the partition of the first server node is transferred from the disk of the first server node to the disk of each second server node to complete the cleanup of the disk of the first server node.
[0079] In this embodiment, a non-transfer state indicates that other partitions are not allowed to be transferred to the disk of the first server node, while a transfer state indicates that other partitions are allowed to be transferred to the disk of the second server node. After obtaining the target state of the disks of each server node in the Kafka cluster, it is determined whether there is a first server node in the Kafka cluster with a target state of non-transfer. If there is a first server node in the Kafka cluster with a target state of non-transfer and at least one second server node in the Kafka cluster with a transfer state, it indicates that the first server node may be in a zombie state. Therefore, in order to avoid the first server node being in a zombie state, the partitions of the first server node are transferred from the disk of the first server node to the disks of each of the second server nodes, thereby completing the cleanup of the disk of the first server node.
[0080] Specifically, any one of the at least one second server nodes can be selected to transfer the partitions of the first server node from the disk of the first server node to the disk of the selected second server node, thus avoiding the first server node being in a zombie state.
[0081] This embodiment provides a disk cleanup method that obtains disk usage information for each server node in a Kafka cluster. Based on the obtained disk usage information, the target state of each server node is determined. If the target state indicates that there is a first server node in the Kafka cluster that does not allow other partitions to be transferred in, and a second server node that does allow other partitions to be transferred in, then the partitions of the first server node can be transferred from the disk of the first server node to the disk of the second server node. This cleanup is achieved when the disk of the first server node is about to be full, thus preventing the first server node from being in a zombie state and preventing the Kafka cluster from failing.
[0082] refer to Figure 2 , Figure 2 This is a flowchart illustrating another disk cleanup method provided in an embodiment of this application. The disk cleanup method provided in this embodiment includes the following steps:
[0083] S201: Obtain disk usage information for the disks of each server node in the server node set.
[0084] In this embodiment, step S201 is the same as step S101 described above. For details, please refer to step S101 described above. This embodiment will not repeat the details here.
[0085] S202: For each server node in the server node set, determine the target state of the server node based on the disk usage information corresponding to the disk of the server node.
[0086] In this embodiment, disk usage information includes disk utilization rate. The step S202 above, which determines the target state of the server node based on the disk usage information corresponding to the server node's disks, includes:
[0087] When the disk utilization rate of the disk corresponding to the server node is greater than or equal to the preset utilization rate threshold, the target state of the server node is determined to be a non-transfer state.
[0088] When the disk utilization rate of a server node's disk is less than a preset utilization threshold, the target state of the server node is determined to be a transition state.
[0089] The preset utilization threshold is used to switch the target state of server nodes. When the disk utilization of a server node's disk is greater than or equal to the preset utilization threshold, it indicates that the server node may be about to become a zombie node. To avoid this, the target state of the server node is set to non-transfer state to prevent other partitions from being transferred to the disk of this server node, but partitions on the disk of this server node can still be written to. When the disk utilization of a server node's disk is less than the preset utilization threshold, it indicates that the server node is not at risk of becoming a zombie node. To avoid other server nodes becoming zombie nodes, partitions from other server nodes that are about to become zombie nodes can be transferred from the disk of this server node to the disk of the server node whose disk utilization is less than the preset utilization threshold, thereby cleaning up the disk of the server node that is about to become zombie node and preventing the server node from becoming a zombie node.
[0090] S203: When there is a first server node in the server node set whose target state is non-transition state and at least one second server node whose target state is transition state, at least one first partition is determined from all partitions of the first server node.
[0091] In this embodiment, the first partition is a slave replica. Since the server node's partitions include primary and slave replicas, to avoid blocking client write requests during partition transfer, only the slave replica partitions of the first server node are transferred, not the primary replica partitions. Therefore, when transferring the partitions of the first server node, the first partition of the slave replica is first determined from all partitions of the first server node, and then each of the first partitions of the first server node is transferred from the disk of the first server node to the disk of the respective second server node to complete the cleanup of the disk of the first server node.
[0092] Specifically, when there is a first server node in the server node set with a target state of non-transfer and at least one second server node with a target state of transfer, it indicates that there are server nodes in the Kafka cluster whose disk utilization has reached the warning threshold. An alarm will be output at this time. Furthermore, if the Kafka cluster has automatic redistribution mode enabled, it can select any one of the at least one second server node to transfer the first partitions from the first server node's disk to the selected second server node's disk. This prevents the first server node from being in a zombie state. Automatic redistribution mode represents a mode where the distribution among server nodes needs to be transferred.
[0093] S204: Transfer each of the first partitions of the first server node from the disk of the first server node to the disk of each of the second server nodes to complete the cleanup of the disk of the first server node.
[0094] In this embodiment, when transferring each first partition of the first server node from the disk of the first server node to the disk of each second server node, the primary replica partition in the first server node requests to take itself offline from the ISR, and the primary replica partition completes the transfer of each first partition from the disk of the first server node to the disk of each second server node. After the transfer of the first partition is completed, the data of the first partition of the first server node in the disk of the first server node will be cleaned up, thereby realizing the transfer of each first partition and thus realizing the cleanup of the disk of the server node.
[0095] In step S204, each first partition of the first server node is transferred from the disk of the first server node to the disk of each second server node to complete the disk cleanup of the first server node. Specifically, this includes:
[0096] S2041: For each first partition of the first server node, determine the first occupied space of the disk of the first server node occupied by the first partition.
[0097] S2042: Transfer each of the first partitions of the first service node from the disk of the first server node to the disk of each of the second server nodes in descending order of the first occupied space.
[0098] S2043: After the transfer of the first partition is completed, if it is determined that the target state of the first server node is the transfer state, then it is determined that the disk cleanup of the first server node is completed.
[0099] S2044: After each first partition transfer is completed, if it is determined that the target state of the first server node is still non-transferable and there is still a first partition to be transferred in the first server node, execute step S2042 to transfer the next first partition until the disk cleanup of the first server node is completed.
[0100] S2045: After each first partition transfer is completed, if it is determined that the target state of the first server node is still non-transferable and there is no first partition to be transferred in the first server node, obtain the remaining disk space corresponding to the disk of the first server node.
[0101] S2046: When the remaining disk space is greater than the first space threshold, continuously output alarm prompts.
[0102] S2047: When the remaining disk space is less than or equal to the first space threshold, the write operation on the first server node is blocked.
[0103] S2048: After confirming that the Kafka cluster has started emergency cleanup mode, perform emergency cleanup on the partition of the first server node to ensure that the remaining disk space of the disk corresponding to the first server node is greater than the first space threshold.
[0104] Regarding steps S2041 to S2048 above, the first space threshold represents the upper limit of the remaining disk space corresponding to the need for emergency cleanup of the server node's disk. The first space threshold can be set according to actual needs, and the specific value of the first space threshold is not limited in this embodiment. For example, the first space threshold can be 100GB.
[0105] Specifically, when it is necessary to transfer each of the first partitions of the first server node from the disk of the first server node to the disks of the second server nodes, in order to quickly clean up the disk of the first server node, the first occupied space of each first partition on the disk of the first server node is determined, and all first partitions are sorted in ascending order of first occupied space. The first partition with the largest first occupied space is transferred from the disk of the first server node to the disks of the second server nodes first. After the transfer of the first partitions is completed, the target state of the first server node is determined. If the target state of the first server node is determined to be a transfer state, it means that the cleanup of the first server node has been achieved by transferring the first partition with the largest first occupied space. If the target state of the first server node is determined to be a non-transfer state and there are still first partitions to be transferred in the first server node, it means that the cleanup of the first server node has not been completed by transferring the first partition with the largest first occupied space. Therefore, the next first partition is transferred in descending order of first occupied space. If it is determined that the target state of the first server node is still non-transferable and there is no first partition to be transferred in the first server node, it indicates that transferring all the first partitions of the first server node has failed to clean up the disk of the first server node. At this time, an alarm message will be continuously output, waiting for manual intervention.
[0106] More specifically, since partitions on the first server node can still be written to when the target state of the first server node is non-transferable, in order to further avoid the first server node being in a zombie state, when it is determined that the target state of the first server node is still non-transferable and there is no first partition to be transferred in the first server node, the remaining disk space corresponding to the disk of the first server node is obtained. The obtained remaining disk space is compared with a first space threshold to determine whether the remaining disk space corresponding to the disk of the current first server node is less than or equal to the first space threshold corresponding to the need for emergency cleanup of the disk of the first server node. The threshold is defined as follows: when the remaining disk space is greater than the first space threshold, it indicates that no emergency cleanup of the disk of the first server node is required, and alarm prompts can continue to be output. When the remaining disk space is less than or equal to the first space threshold, it indicates that emergency cleanup of the disk of the first server node is required. In this case, write operations on the first server node are blocked. When it is determined that the Kafka cluster has enabled emergency cleanup mode, emergency cleanup is performed on the partition of the first server node on the disk of the first server node. When the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold, the blocking of write operations on the first server node ends.
[0107] It's important to note that remaining disk space can be understood as the space on the disk that has not yet been written to. Before using a Kafka cluster, it's advisable to agree on whether to enable the Kafka cluster's emergency cleanup mode. When cleaning up the disks of server nodes in the Kafka cluster, emergency cleanup is not allowed if the Kafka cluster's emergency cleanup mode is not enabled. Emergency cleanup is only permitted when the Kafka cluster's emergency cleanup mode is enabled.
[0108] As an example, the following describes the process of cleaning up the partitions on the disk of the first server node. The partitions of the first server node include the first partition P1, the first partition P2 and the first partition P3. The first occupied space corresponding to the first partition P1 is O1, the first occupied space corresponding to the first partition P2 is O2 and the first occupied space corresponding to the first partition P3 is O3, where O1 is greater than O2 and O2 is greater than O3.
[0109] The first partition P1 is transferred from the disk of the first server node to the disks of each of the second server nodes. After the transfer of the first partition P1 is completed, if it is determined that the target state of the first server node is a transfer state, then the disk of the first server node is cleaned up. After the transfer of the first partition P1 is completed, if it is determined that the target state of the first server node is a non-transfer state and since there are still first partitions P2 and P3 to be transferred in the first server node, the first partition P2 is transferred from the disk of the first server node to the disks of each of the second server nodes. After the transfer of the first partition P2 is completed, if it is determined that the target state of the first server node is a transfer state, then the disk of the first server node is cleaned up. After the transfer of the first partition P2 is completed, if it is determined that the target state of the first server node is a non-transfer state and since there is still first partition P3 to be transferred in the first server node, then the first partition P3 is transferred from the disk of the first server node to the disks of each of the second server nodes. After the transfer of the first partition P3 is completed, if the target state of the first server node is determined to be a transferred state, then the disk cleanup of the first server node is completed. If, after the transfer of the first partition P3 is completed, the target state of the first server node is determined to be a non-transfer state, and since the first partitions P1, P2, and P3 in the first server node have all been transferred, an alarm message is continuously output, awaiting manual intervention. During the continuous output of alarm messages, the remaining disk space corresponding to the disk of the first server node is obtained. If the remaining disk space is greater than the first space threshold, an alarm message is continuously output, awaiting manual intervention. If the remaining disk space is less than or equal to the first space threshold, write operations on the first server node are blocked. If the Kafka cluster has enabled emergency cleanup mode, emergency cleanup is performed on the partitions of the first server node on the disk of the first server node, so that the blocking of write operations on the first server node ends when the remaining disk space is greater than the first space threshold.
[0110] In this embodiment, step S2048 above involves performing emergency cleanup on the partition of the first server node's disk to ensure that the remaining disk space corresponding to the first server node's disk is greater than or equal to a first space threshold, including:
[0111] Perform the following operations on each partition in the first server node:
[0112] Determine the target duration and second space occupied by the partition on the disk of the first server node;
[0113] When the target occupancy time exceeds the second preset time and / or the second occupied space exceeds the second space threshold, emergency cleanup is performed on the partition of the first server in the disk of the first server node so that the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold.
[0114] In this embodiment, the second preset duration is less than the third preset duration, and the second space threshold is less than the third space threshold. The third preset duration represents the lower limit of the target disk cleanup duration when the Kafka cluster is not in emergency cleanup mode, and the third space threshold represents the lower limit of the second occupied space when the Kafka cluster is not in emergency cleanup mode. The Kafka cluster not in emergency cleanup mode can be understood as the Kafka cluster's disk cleanup mode being the default cleanup mode. The third space threshold and the third preset duration can be set according to actual needs; this embodiment does not limit the specific values of the third space threshold and the third preset duration. For example, the third preset duration can be 7 days, and the third space threshold can be 1GB. Since when the remaining disk space corresponding to the disk of the first server node is less than or equal to the first space threshold, it indicates that the first server node is about to enter a zombie state. Therefore, in order to quickly resolve the crisis of the first server node, the second preset duration can be set to 50% of the third preset duration, and the second occupied space can be set to 50% of the third occupied space. Currently, the second preset duration and the second space occupied should not be set too small. For example, the second preset duration should not be less than 1 day and the second space occupied should not be less than 100MB. This is to avoid the second preset duration and the second space occupied being set too small, which would make emergency cleanup meaningless and would also result in the loss of additional data.
[0115] Specifically, performing emergency cleanup on the partition of the first server on the disk of the first server node can be understood as performing emergency cleanup on data in the partition whose target duration exceeds the second preset duration and / or whose second occupied space exceeds the second space threshold.
[0116] It should be noted that when the target occupancy time is less than or equal to the second preset time and the second occupied space is less than or equal to the second space threshold, there is no need to perform emergency cleanup on the partitions of the first server node's disk. After traversing all partitions of the first server node, if the remaining disk space corresponding to the disk of the first server node is still less than or equal to the first space threshold, an alarm message will be continuously output, awaiting manual intervention.
[0117] In the above, in step S2043, after completing the transfer of the first partition, if it is determined that the target state of the first server node is a transfer state, then it is determined that the disk cleanup of the first server node is completed, including:
[0118] S2043a: After completing the transfer of the first partition, obtain the disk usage information corresponding to the disk of the first server node.
[0119] S2043b: When the disk utilization rate of the disk corresponding to the disk of the first server node is less than the preset utilization rate threshold, if the disk utilization rate of the disk corresponding to the disk of the first server node is still less than the preset utilization rate threshold after a first preset time period, then the target state of the first server node is determined to be the transition state.
[0120] Specifically, disk usage information includes disk utilization rate. A preset utilization rate threshold is used to switch the target state of server nodes. When the disk utilization rate corresponding to the disk of the first server node is less than the preset utilization rate threshold, it indicates that the non-transfer state of the disk of the first server has been lifted. In this embodiment, in order to avoid the situation where the data volume of a certain partition of a server node in the Kafka cluster is too large, resulting in the continuous transfer of partitions between two server nodes, when it is determined that the disk utilization rate corresponding to the disk of the first server node is less than the preset utilization rate threshold, a first preset time is waited. If the disk utilization rate corresponding to the disk of the first server node is still less than the preset utilization rate threshold after the first preset time, the non-transfer state of the first server node is determined to be a transfer state. At this time, the first server node allows other partitions to be transferred to the disk of the server node to realize subsequent disk cleanup.
[0121] More specifically, when the disk utilization rate of the disk corresponding to the first server node is less than a preset utilization threshold, if the disk utilization rate of the disk corresponding to the first server node is still less than the preset utilization threshold after a first preset time period, then the target state of the first server node is determined to be a non-transfer state. When the disk utilization rate of the disk corresponding to the first server node is greater than or equal to the preset utilization threshold, then the target state of the disk of the first server node is determined to be a non-transfer state.
[0122] This embodiment provides a disk cleanup method that obtains disk usage information for each server node in a Kafka cluster. Based on the obtained disk usage information, the target state of each server node is determined. If the target state indicates that there is a first server node in the Kafka cluster that does not allow other partitions to be transferred in, and a second server node that does allow other partitions to be transferred in, then the partitions of the first server node can be transferred from the disk of the first server node to the disk of the second server node. This cleanup is achieved when the disk of the first server node is about to be full, thus preventing the first server node from being in a zombie state and preventing the Kafka cluster from failing.
[0123] refer to Figure 3 , Figure 3This is a flowchart illustrating another disk cleanup method provided in this application embodiment. The disk cleanup method provided in this application embodiment includes the following steps:
[0124] S301: Obtain disk usage information for the disks of each server node in the server node set.
[0125] S302: For each server node in the server node set, determine the target state of the server node based on the disk usage information corresponding to the disk of the server node.
[0126] S303: When there is a first server node in the server node set whose target state is non-transfer state and at least one second server node whose target state is transfer state, the partition of the first server node is transferred from the disk of the first server node to the disk of each second server node to complete the cleanup of the disk of the first server node.
[0127] For steps S301 to S303, step S301 is the same as step S101 above, step S302 is the same as step S102 above, and step S303 is the same as step S103 above. For details, please refer to the description of steps S101 to S103 above. This embodiment will not repeat it here.
[0128] S304: When there is a first server node in the server node set whose target state is non-transfer state and there is no second server node in the target state of transfer state, for each first server node, obtain the remaining disk space corresponding to the disk of the first server node.
[0129] S305: When the remaining disk space is less than or equal to the first space threshold, the write operation on the first server node is blocked.
[0130] S306: When it is determined that the Kafka cluster has started emergency cleanup mode, perform emergency cleanup on the partition of the first server node in the disk of the first server node so that the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold.
[0131] Regarding steps S304 to S306 above, the first spatial threshold can be referred to as described above, and will not be repeated here in this embodiment.
[0132] In this scenario, if there is a first server node in the server node set with a target state of non-transfer and no second server node with a target state of transfer, it indicates that the Kafka cluster as a whole has entered a state of excessive disk capacity. At this point, partition transfer within the server nodes cannot be achieved; only continuous alarm output is possible, awaiting manual intervention. Since partitions within each first server node can still be written to when the target state is non-transfer, to further prevent first server nodes from being in a zombie state, the following operation is performed on each first server node after confirming that all server nodes in the server node set are first server nodes with a target state of non-transfer:
[0133] The remaining disk space corresponding to the disk of the first server node is obtained and compared with a first space threshold to determine whether the remaining disk space of the current disk of the first server node is less than or equal to the first space threshold corresponding to the need for emergency cleanup of the disk of the first server node. If the remaining disk space is greater than the first space threshold, it indicates that no emergency cleanup of the disk of the first server node is needed, and alarm prompts can continue to be output. If the remaining disk space is less than or equal to the first space threshold, it indicates that emergency cleanup of the disk of the first server node is needed. At this time, write operations on the first server node are blocked. If it is determined that the Kafka cluster has enabled emergency cleanup mode, emergency cleanup is performed on the partition of the first server node in the disk of the first server node. When the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold, the blocking of write operations on the first server node ends.
[0134] This embodiment provides a disk cleanup method that obtains disk usage information for each server node in a Kafka cluster. Based on the obtained disk usage information, the target state of each server node is determined. If the target state indicates that there is a first server node in the Kafka cluster that does not allow other partitions to be transferred in, and a second server node that does allow other partitions to be transferred in, then the partitions of the first server node can be transferred from the disk of the first server node to the disk of the second server node. This cleanup is achieved when the disk of the first server node is about to be full, thus preventing the first server node from being in a zombie state and preventing the Kafka cluster from failing.
[0135] The following is an example of how to perform disk cleanup using the automatic redistribution mode and the emergency cleanup mode.
[0136] refer to Figure 4 , Figure 4 This document provides a schematic diagram of a disk cleanup process using an automatic redistribution mode, as illustrated in an embodiment of this application. The entire process of disk cleanup using the automatic redistribution mode is as follows:
[0137] Get the disk usage of each server node in the Kafka cluster;
[0138] Determine whether the utilization rate of each disk is greater than or equal to the preset utilization rate threshold; when there is a first server node whose disk utilization rate threshold is greater than or equal to the preset utilization rate threshold and a second server node whose disk utilization rate threshold is less than the preset utilization rate threshold, output an alarm message;
[0139] When the usage rate of all disks is greater than or equal to the preset usage rate threshold, an alarm will be continuously output, waiting for manual intervention.
[0140] Determine if the Kafka cluster has automatic redistribution mode enabled; if the Kafka cluster has not enabled automatic redistribution mode, end the disk cleanup process; if the Kafka cluster has enabled automatic redistribution mode, set the target state of the first server node to non-transfer state and the target state of the second server node to transfer state. The first server node in non-transfer state does not accept other partitions to be transferred to this server node, but partitions already held by the first server node can still be written to. The second server node in transfer state accepts other partitions to be transferred to this server node.
[0141] Identify at least one first partition from all partitions of the first server node as a replica and determine the first occupied space of the disk of the first server node occupied by each first partition;
[0142] In descending order of the first space occupied, the first partitions of the first server node are transferred from the disk of the first server node to the disk of the second server node.
[0143] After the transfer of the first partition is completed, obtain the disk usage rate corresponding to the disk of the first server node;
[0144] If the disk usage rate of the disk corresponding to the disk of the first server node is less than the preset usage rate threshold, and if the disk usage rate of the disk corresponding to the disk of the first server node is still less than the preset usage rate threshold after a first preset time period, then the disk cleanup of the first server node is determined to be completed.
[0145] If the disk usage rate of the disk corresponding to the disk of the first server node is greater than or equal to the preset usage rate threshold and there are still first partitions to be transferred in the first server node, continue to execute the step of transferring each first partition in the first server node from the disk of the first server node to the disk of each second server node in descending order of the first occupied space, so as to transfer the next partition until the disk of the first server node is cleaned up.
[0146] If the disk usage rate of the disk corresponding to the disk of the first server node is greater than or equal to the preset usage rate threshold and there is no first partition to be transferred in the first server node, an alarm message will be continuously output, waiting for manual intervention.
[0147] refer to Figure 5 , Figure 5 This is a schematic diagram illustrating a disk cleanup process using an emergency cleanup mode, as provided in an embodiment of this application. The emergency cleanup mode is designed to prevent new data from continuously being written to a server node after automatic redistribution fails to resolve disk capacity issues, potentially reaching a first space threshold. In this emergency, the automatic redistribution mode is switched to emergency cleanup mode, and the server node's disk is cleaned up immediately. This ensures that even under the most demanding disk conditions, the server node is prevented from becoming a zombie state due to a full disk.
[0148] When the disk usage of all disks is greater than or equal to the preset usage threshold, or when the disk usage of the disks corresponding to the disks of the first server node is greater than or equal to the preset usage threshold and there is no first partition to be transferred in the first server node, an emergency cleanup mode can be executed on each first server node, as follows:
[0149] Get the remaining disk space of the disk corresponding to the disk of the first server node;
[0150] Determine if the remaining disk space is greater than the first space threshold;
[0151] When the remaining disk space exceeds the first space threshold, an alarm message will be continuously output.
[0152] Write operations on the first server node are blocked when the remaining disk space is less than or equal to the first space threshold.
[0153] Determine if the Kafka cluster has enabled emergency cleanup mode. If it is determined that the Kafka cluster has not enabled emergency cleanup mode, end the emergency cleanup process.
[0154] When it is determined that the Kafka cluster has started emergency cleanup mode, determine the target duration and second space occupied by each partition of the first server node on the disk of the first server node.
[0155] For each partition, when the target occupied duration exceeds the second preset duration and / or the second occupied space exceeds the second space threshold, the partition of the first server in the disk of the first server node is urgently cleaned up.
[0156] After completing the emergency cleanup of all partitions, determine whether the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold. If the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold, stop the write operation on the first server node and block it.
[0157] The emergency cleanup process ends when the remaining disk space corresponding to the disk of the first server node is less than or equal to the first space threshold.
[0158] Figure 6 This is a schematic diagram of a disk cleanup device provided in an embodiment of this application. The disk cleanup device provided in this embodiment includes an acquisition module 10, a determination module 20, and a cleanup module 30. The acquisition module 10 is used to acquire disk usage information corresponding to the disks of each server node in a server node set included in a Kafka cluster. The determination module 20 is used to determine the target state of each server node in the server node set based on the disk usage information corresponding to the disks of that server node. The cleanup module 30 is used to, when there is a first server node in the server node set whose target state is non-transferring and at least one second server node in the server node set whose target state is transferring partitions from the disk of the first server node to the disks of each of the second server nodes, thereby completing the cleanup of the disk of the first server node; the non-transferring state indicates that other partitions are not allowed to be transferred to the disk of the first server node, and the transfer state indicates that other partitions are allowed to be transferred to the disk of the second server node.
[0159] In this embodiment, the cleaning module 30 is further configured to:
[0160] At least one first partition is determined from all partitions of the first server node, wherein the first partition is a replica;
[0161] The first partitions of the first server node are transferred from the disk of the first server node to the disk of the second server node to complete the cleanup of the disk of the first server node.
[0162] In this embodiment, the cleaning module 30 is further configured to:
[0163] For each of the first partitions of the first server node, determine the first occupied space of the disk of the first server node occupied by the first partition;
[0164] The following steps are performed on each of the first partitions in the first server node in descending order of the first occupied space:
[0165] The first partition is transferred from the disk of the first server node to the disks of each of the second server nodes;
[0166] After the transfer of the first partition is completed, if it is determined that the target state of the first server node is a transfer state, then it is determined that the disk cleanup of the first server node is completed.
[0167] After the transfer of the first partition is completed, if it is determined that the target state of the first server node is still in a non-transfer state and there are still first partitions to be transferred in the first server node, the transfer process for the next first partition continues until the disk cleanup of the first server node is completed.
[0168] In this embodiment, the cleaning module 30 is further configured to:
[0169] Obtain the disk usage information corresponding to the disk of the first server node, wherein the disk usage information includes disk utilization rate;
[0170] If the disk utilization rate corresponding to the disk of the first server node is less than the preset utilization rate threshold, and if the disk utilization rate corresponding to the disk of the first server node is still less than the preset utilization rate threshold after a first preset time period, then the target state of the first server node is determined to be a transition state.
[0171] The preset utilization threshold is used to switch the target state of the server node.
[0172] In this embodiment, the cleaning module 30 is further configured to:
[0173] After the transfer of the first partition is completed, if it is determined that the target state of the first server node is still non-transfer state and there is no first partition to be transferred in the first server node, the remaining disk space corresponding to the disk of the first server node is obtained.
[0174] When the remaining disk space is less than or equal to a first space threshold, write operations on the first server node are blocked. The first space threshold represents the upper limit of the remaining disk space corresponding to the need for emergency cleanup of the server node's disk.
[0175] When it is determined that the Kafka cluster has started emergency cleanup mode, emergency cleanup is performed on the partition of the first server node in the disk of the first server node so that the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold.
[0176] In this embodiment, the cleaning module 30 is further configured to:
[0177] When there is a first server node in the service node set whose target state is non-transferring and there is no second server node in the target state of transfer, the following operation is performed for each first server node:
[0178] Obtain the remaining disk space corresponding to the disk of the first server node;
[0179] When the remaining disk space is less than or equal to a first space threshold, write operations on the first server node are blocked. The first space threshold represents the upper limit of the remaining disk space corresponding to the need for emergency cleanup of the server node's disk.
[0180] When it is determined that the Kafka cluster has started emergency cleanup mode, emergency cleanup is performed on the partition of the first server node in the disk of the first server node so that the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold.
[0181] In this embodiment, the cleaning module 30 is further configured to:
[0182] Perform the following operations for each partition of the first server node:
[0183] Determine the target duration and second space occupied by the partition on the disk of the first server node;
[0184] When the target occupancy time exceeds the second preset time and / or the second occupancy space exceeds the second space threshold, the partition of the first server in the disk of the first server node is urgently cleaned up so that the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold.
[0185] Wherein, the second preset duration is less than the third preset duration, the second space threshold is less than the third space threshold, the third preset duration represents the lower limit of the target occupied duration corresponding to disk cleanup when the Kafka cluster does not enable the emergency cleanup mode, and the third space threshold represents the lower limit of the second occupied space corresponding to disk cleanup when the Kafka cluster does not enable the emergency cleanup mode.
[0186] In this embodiment, the disk usage information includes disk usage rate.
[0187] In this embodiment, the determining module 20 is further configured to:
[0188] When the disk utilization rate corresponding to the disk of the server node is greater than or equal to a preset utilization rate threshold, the target state of the server node is determined to be a non-transfer state.
[0189] When the disk utilization rate corresponding to the disk of the server node is less than the preset utilization rate threshold, the target state of the server node is determined to be a transition state;
[0190] The preset utilization threshold is used to switch the target state of the server node.
[0191] This embodiment provides a disk cleanup device that obtains disk usage information corresponding to the disks of each server node in a Kafka cluster. By obtaining the disk usage information, the target state of each server node is determined. Based on the target state, if it is determined that there is a first server node in the Kafka cluster that does not allow other partitions to be transferred in, and a second server node that allows other partitions to be transferred in, the partitions of the first server node can be transferred from the disk of the first server node to the disk of the second server node. In this way, the disk of the first server node is cleaned up when it is about to be full, avoiding the first server node from being in a zombie state, and thus avoiding the failure of the Kafka cluster.
[0192] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 7 The illustrated electronic device 700 includes at least one processor 701, a memory 702, at least one network interface 704, and other user interfaces 703. The various components in the electronic device 700 are coupled together via a bus system 705. It is understood that the bus system 705 is used to implement communication between these components. In addition to a data bus, the bus system 705 also includes a power bus, a control bus, and a status signal bus. However, for clarity, ... Figure 7 The general labeled all buses as Bus System 705.
[0193] The user interface 703 may include a display, keyboard, or clicking device (e.g., mouse, trackball, touchpad, or touchscreen).
[0194] It is understood that the memory 702 in the embodiments of this application can be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced Synchronous DRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 702 described herein is intended to include, but is not limited to, these and any other suitable types of memory.
[0195] In some implementations, memory 702 stores elements, executable units or data structures, or subsets thereof, or extended sets thereof: operating system 7021 and application program 7022.
[0196] The operating system 7021 includes various system programs, such as the framework layer, core library layer, and driver layer, used to implement various basic business functions and handle hardware-based tasks. The application program 7022 includes various applications, such as a media player and a browser, used to implement various application functions. The program implementing the method of the embodiments of this application can be included in the application program 7022.
[0197] In the embodiments of this application, the processor 701 executes the method steps provided in each method embodiment by calling the program or instructions stored in the memory 702, specifically the program or instructions stored in the application program 7022.
[0198] The methods disclosed in the embodiments of this application can be applied to or implemented by processor 701. Processor 701 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the integrated logic circuit of the hardware in processor 701 or by instructions in the form of software. The processor 701 may be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor may be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software units in the decoding processor. The software units may be located in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. The storage medium is located in memory 702. Processor 701 reads the information in memory 702 and, in conjunction with its hardware, completes the steps of the above method.
[0199] It is understood that the embodiments described herein can be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For hardware implementation, the processing unit can be implemented in one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), general-purpose processors, controllers, microcontrollers, microprocessors, other electronic units for performing the functions described herein, or combinations thereof.
[0200] For software implementation, the techniques described herein can be implemented by units that perform the functions described herein. The software code can be stored in memory and executed by a processor. The memory can be implemented in the processor or external to the processor.
[0201] The electronic device provided in this embodiment may be as follows: Figure 7 The electronic device shown can perform the following: Figures 1-5 All steps of the disk cleanup method are followed to achieve... Figures 1-5 For details on the technical effects of the disk cleanup method shown, please refer to [link / reference]. Figures 1-5 The relevant descriptions are presented concisely and will not be elaborated upon here.
[0202] This application also provides a storage medium (computer-readable storage medium). This storage medium stores one or more programs. The storage medium may include volatile memory, such as random access memory; it may also include non-volatile memory, such as read-only memory, flash memory, hard disk, or solid-state drive; and it may also include combinations of the above types of memory.
[0203] When one or more programs in the storage medium can be executed by one or more processors to implement the disk cleanup method described above that is executed on the disk cleanup device side.
[0204] The processor is used to execute a disk cleanup program stored in memory to implement the steps of the disk cleanup method executed on the disk cleanup device side.
[0205] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.
[0206] It should be noted that the terms "one implementation," "embodiment," "exemplary embodiment," and "some embodiments" used in the specification indicate that the described embodiment may include a specific feature, structure, or characteristic, but not every embodiment necessarily includes that specific feature, structure, or characteristic. Furthermore, such phrases do not necessarily refer to the same embodiment. Moreover, when a specific feature, structure, or characteristic is described in connection with an embodiment, implementing such a feature, structure, or characteristic in conjunction with other embodiments, whether explicitly described or not, is within the knowledge scope of those skilled in the art.
[0207] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0208] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.
Claims
1. A disk cleanup method, characterized in that, Applied to a Kafka cluster, wherein the Kafka cluster comprises a set of server nodes, the method includes: Obtain disk usage information for the disks of each server node in the server node set; For each server node in the server node set, the target state of the server node is determined based on the disk usage information corresponding to the disk of the server node; When there is a first server node in the server node set whose target state is non-transferring and at least one second server node in the target state is transferring, the partition of the first server node is transferred from the disk of the first server node to the disk of each of the second server nodes to complete the cleanup of the disk of the first server node. The non-transfer state indicates that other partitions are not allowed to be transferred to the disk of the first server node, while the transfer state indicates that other partitions are allowed to be transferred to the disk of the second server node.
2. The method according to claim 1, characterized in that, The step of transferring the partitions of the first server node from the disk of the first server node to the disks of each of the second server nodes to complete the disk cleanup of the first server node includes: At least one first partition is determined from all partitions of the first server node, wherein the first partition is a replica; The first partitions of the first server node are transferred from the disk of the first server node to the disk of the second server node to complete the cleanup of the disk of the first server node.
3. The method according to claim 2, characterized in that, The step of transferring each of the first partitions of the first server node from the disk of the first server node to the disk of each of the second server nodes to complete the disk cleanup of the first server node includes: For each of the first partitions of the first server node, determine the first occupied space of the disk of the first server node occupied by the first partition; The following operations are performed on each of the first partitions in the first server node in descending order of the first occupied space: The first partition is transferred from the disk of the first server node to the disks of each of the second server nodes; After the transfer of the first partition is completed, if it is determined that the target state of the first server node is a transfer state, then it is determined that the disk cleanup of the first server node is completed. After the transfer of the first partition is completed, if it is determined that the target state of the first server node is still in a non-transfer state and there are still first partitions to be transferred in the first server node, the transfer process for the next first partition continues until the disk cleanup of the first server node is completed.
4. The method according to claim 3, characterized in that, Determining that the target state of the first server node is a transition state includes: Obtain the disk usage information corresponding to the disk of the first server node, wherein the disk usage information includes disk utilization rate; If the disk utilization rate corresponding to the disk of the first server node is less than the preset utilization rate threshold, and if the disk utilization rate corresponding to the disk of the first server node is still less than the preset utilization rate threshold after a first preset time period, then the target state of the first server node is determined to be a transition state. The preset utilization threshold is used to switch the target state of the server node.
5. The method according to claim 3, characterized in that, After completing the transfer of the first partition, the method further includes: If it is determined that the target state of the first server node is still a non-transfer state and there is no first partition to be transferred in the first server node, obtain the remaining disk space corresponding to the disk of the first server node; When the remaining disk space is less than or equal to a first space threshold, write operations on the first server node are blocked. The first space threshold represents the upper limit of the remaining disk space corresponding to the need for emergency cleanup of the server node's disk. When it is determined that the Kafka cluster has started emergency cleanup mode, emergency cleanup is performed on the partition of the first server node in the disk of the first server node so that the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold.
6. The method according to claim 1, characterized in that, After performing the step of determining the target state of each server node in the server node set based on the disk usage information corresponding to the disk of the server node, the method further includes: When there is a first server node in the service node set whose target state is non-transferring and there is no second server node in the target state of transfer, the following operation is performed for each first server node: Obtain the remaining disk space corresponding to the disk of the first server node; When the remaining disk space is less than or equal to a first space threshold, write operations on the first server node are blocked. The first space threshold represents the upper limit of the remaining disk space corresponding to the need for emergency cleanup of the server node's disk. When it is determined that the Kafka cluster has started emergency cleanup mode, emergency cleanup is performed on the partition of the first server node in the disk of the first server node so that the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold.
7. The method according to claim 5 or 6, characterized in that, The step of performing emergency cleanup on the partitions of the first server node's disk to ensure that the remaining disk space corresponding to the first server node's disk is greater than or equal to the first space threshold includes: Perform the following operations for each partition in the first server node: Determine the target duration and second space occupied by the partition on the disk of the first server node; When the target occupancy time exceeds the second preset time and / or the second occupancy space exceeds the second space threshold, the partition of the first server in the disk of the first server node is urgently cleaned up so that the remaining disk space corresponding to the disk of the first server node is greater than the first space threshold. Wherein, the second preset duration is less than the third preset duration, the second space threshold is less than the third space threshold, the third preset duration represents the lower limit of the target occupied duration corresponding to disk cleanup when the Kafka cluster does not enable the emergency cleanup mode, and the third space threshold represents the lower limit of the second occupied space corresponding to disk cleanup when the Kafka cluster does not enable the emergency cleanup mode.
8. The method according to claim 1, characterized in that, The disk usage information includes disk usage rate; Determining the target state of the server node based on the disk usage information corresponding to the disk of the server node includes: When the disk utilization rate corresponding to the disk of the server node is greater than or equal to a preset utilization rate threshold, the target state of the server node is determined to be a non-transfer state. When the disk utilization rate corresponding to the disk of the server node is less than the preset utilization rate threshold, the target state of the server node is determined to be a transition state; The preset utilization threshold is used to switch the target state of the server node.
9. A disk cleanup device, characterized in that, include: The acquisition module is used to obtain disk usage information for each server node in the server node set included in the Kafka cluster. The determination module is used to determine the target state of each server node in the server node set based on the disk usage information corresponding to the disk of the server node; The cleanup module is used to transfer the partitions of the first server node from the disk of the first server node to the disks of each of the second server nodes when there is a first server node in the server node set whose target state is non-transfer state and at least one second server node in the target state is transfer state, so as to complete the cleanup of the disk of the first server node. The non-transfer state indicates that other partitions are not allowed to be transferred to the disk of the first server node, while the transfer state indicates that other partitions are allowed to be transferred to the disk of the second server node.
10. An electronic device, characterized in that, include: A processor and a memory, the processor being configured to execute a disk cleanup program stored in the memory to implement the disk cleanup method according to any one of claims 1 to 8.
11. A storage medium, characterized in that, The storage medium stores one or more programs, which can be executed by one or more processors to implement the disk cleanup method according to any one of claims 1 to 8.