[0042] The present invention will be described in detail below with reference to specific embodiments.
[0043] In this embodiment, the method of the present invention is described in detail by taking a dual-machine cluster of shared disks composed of 24 disks as an example.
[0044] One: Take one of the dual machines as the master node, create a RAID storage pool on it for 24 disks, that is, shared disks, and create a raw device volume and a file system data set on the pool. (Not limited to these two methods) external mapping, the file system data set provides external access through CIFS or NFS protocol (not limited to these two methods); when creating a RAID storage pool, a part is reserved on the disk selected for heartbeat communication The area is used as the heartbeat disk to store heartbeat information, and in the present embodiment, the position of 4MB that will start from the starting position of the disk is used as the starting position to draw out 1G space as the storage heartbeat information space of the heartbeat disk;
[0045] Divide the space of the heartbeat disk into two parts, such as figure 1 As shown, the first 500M is used as the information area of node 1 to store the heartbeat information of node 1, and the last 500M is used as the information area of node 2 to store the heartbeat information of node 2; The space occupied by the heartbeat information, and the extra space reserved is mainly used when the bad block in the current heartbeat information area cannot be read or written, and a space is divided from it as the valid heartbeat information area, such as figure 2 As shown, the information area includes cluster super block, write super block, message, and reserved area; node 1 and node 2 respectively read each other's heartbeat information to determine whether the other party really fails.
[0046] 2: The choice of the heartbeat disk, because the disk where the heartbeat disk is located may also fail, in order to ensure the high availability of the system, all system disks can be selected as the heartbeat disk in the design, but considering the space utilization, the initial design will use which Create heartbeat disks with at least two disks; usually, the business access volume of the storage system is relatively large. In order to prevent the mutual influence between business IO and heartbeat disks, the design preferentially selects disks with low business volume, such as idle disks or hot spare disks;
[0047] 3: IO optimization of heartbeat information, in order to avoid interaction with business IO and maximize the performance of the system, only when the heartbeat network fails, the heartbeat information will be written, and when writing the heartbeat information, the first heartbeat will be written first. Disk writing, when the first heartbeat disk is abnormal (damaged, artificially unplugged or replaced), then write to the second heartbeat disk, and so on, until the last heartbeat disk; reading heartbeat information also Consistent with writing, only when the heartbeat network fails, each node first reads the heartbeat information of the first heartbeat disk, when the disk fails, reads the heartbeat information of the second heartbeat disk, and so on. The order of enabling the heartbeat disks can be selected either according to the order preset by the user or according to a certain algorithm based on the current disk IO. When the heartbeat network is restored, continue to transmit heartbeat messages through the heartbeat network according to the working mode of the original two-machine cluster, and stop using the heartbeat disk to transmit heartbeat messages, that is, stop the process of writing and reading messages to the heartbeat disk.
[0048] Four: Information area design, for each information area, such as figure 2 shown, store the following:
[0049] The cluster super block is used to save the cluster mark information, indicating whether this information area matches the current cluster. Before each node reads the message, it first determines whether the disk is its heartbeat disk according to the information. In this embodiment, the cluster super block contains The heartbeat disk label, cluster name, and cluster UUID count three items; in order to improve the backward compatibility of the heartbeat disk, the version number can also be added.
[0050] Write a super block, which is used to record the number and time of the heartbeat information currently written, including at least two contents of the number mark SEQ and the timestamp. The number of pieces is marked as the total number of heartbeat information currently written. The message storage area in the heartbeat disk is fixed. After the heartbeat information is filled with the heartbeat information space of the heartbeat disk, it will be rewritten from the beginning. At this time, the message is recorded in the super block. The serial number SEQ will not restart from 0, but will continue to increase. Suppose the number of messages when the space is full is M. When SEQ is an integer multiple of M, the heartbeat message is written from the beginning, that is, the heartbeat information to be written currently. The position serial number of SEQ is the remainder obtained by dividing SEQ by M, and the mathematical formula is SEQ%M; the timestamp records the time of the last heartbeat information written. Whenever a node wants to write a message, it obtains the write position according to the number of messages, and updates the number of messages and timestamps after the writing is successful, where the number of messages is equal to the number of messages+1, and the timestamp is the current time ;
[0051] The message area saves the actual message content, including check code, timestamp and message content. The check code is the digest of the message content, which is used to check the integrity of the message content when reading to ensure that the information has not been tampered with after being written; the timestamp is the current time of the system, double-verified to ensure that the information has not expired; the message content Then, the heartbeat information written to the information node is recorded in detail, such as the current state of the node, the node name, and the running time of the node. In this embodiment, the check code is obtained by using a CRC check algorithm.
[0052] The lifespan of a disk is certain. Continuously reading and writing the same area of the disk can easily lead to disk damage, resulting in bad blocks in this area, which cannot be read and written. In order to ensure that the information area is not affected when a bad block occurs , The reserved area is divided. For each node, the initial space of the reserved area is 496MB. Once a bad block occurs in the current write message area space, the second 4MB space will be taken out of the reserved area and continue to be used until the last block of space. Thereby extending the availability of the heartbeat disk.
[0053] When the heartbeat network cable fails, the heartbeat disk starts to work, such as Figure 4 As shown, first write the cluster super block according to the information of the cluster where the node is located, initialize the write super block, and then write a message to the information area corresponding to the node at intervals of the preset write time threshold T1, and write the message that needs to be sent before writing. Packing, that is, encapsulating into the message structure, and then writing to the message area. The writing position is determined by the SEQ in the write super block and the size M of the message area. When the SEQ is an integer multiple of M, overwrite and write from the beginning, and then update the write The SEQ value of the super block is SEQ+1;
[0054] When reading the message of the other party, first read the cluster super block of the information area corresponding to the other node. After the verification is valid, initialize the sequence of the number of read messages to 0, and then poll and read the write super block of the other party. If the superblock SEQ is greater than the sequence in the program of this node, read the message with the sequence number from sequence to SEQ-1 in the message area of the other party, and update the sequence to SEQ; for each heartbeat message read, verify the message Verification code and timestamp, if it cannot pass the integrity check verification or the timestamp of the message and the current time of the system exceed a certain time difference T, the message is invalid, and the message is discarded; when the preset reading time interval T2 is not read When a valid message from the other party arrives, it is determined that the other party's node is invalid and starts to take over the other party's resources and services; otherwise, it is determined that the other party's node is valid and continues to read the heartbeat information.
[0055] Experimental results:
[0056] In this experiment, a two-machine cluster is built, such as image 3 As shown, the back end is SASJBOD, with 24 SATA hard disks, supporting link redundancy. The two nodes at the front end are connected to the SASJBOD through a SAS card with one SAS cable each to form a heartbeat disk link. 24 disks are used for reading and writing, and these 24 disks are used for testing; two nodes transmit heartbeat information through a heartbeat network cable, and 11 disks are selected on each node to create a RAID5 storage pool, and the remaining two disks are two Each storage pool creates a hot spare disk, reserves the disk space that the heartbeat disk needs to write when creating a RAID, and gives a reminder when it conflicts with other reserved space. The storage pools are POOLA and POOLB, respectively. Select two hot spare disks as heartbeat disks, and create raw device and file system data sets on the storage pool respectively. The raw device maps POOLA on node 1 and POOLB on node 2 through one or more optical fibers or iSCSI protocols. The file system data set provides external access on the node where the storage pool is located through one or more of the CIFS, NFS, HTTP, HTTPS, and FTP protocols.
[0057] When the heartbeat network works normally, the resource storage pool POOLA belongs to node 1, and the resource storage pool POOLB belongs to node 2. If the heartbeat network cable is unplugged, the heartbeat network fails. At this time, the heartbeat control module starts the heartbeat information writing module on nodes 1 and 2 respectively. , write the heartbeat information into the heartbeat disk at a certain time T1, and start the heartbeat information reading module respectively, read the heartbeat information from the heartbeat disk in a polling manner at a certain interval of time T2, and the two nodes continue to communicate through the heartbeat disk. , the IO on the service host side is not interrupted; if a heartbeat disk is pulled out again, the system still runs normally, the heartbeat disk successfully transmits the heartbeat information, which enhances the availability of the cluster and proves the effectiveness of the present invention.
[0058] The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements can be made, or some technical features can be modified. Equivalent replacement, these improvements and replacements should also be regarded as the protection scope of the present invention.