The invention provides a storage
system, and a method for operating a storage
system, that provides for relatively rapid and reliable takeover among a plurality of independent file servers. Each
file server maintains a reliable communication path to the others. Each
file server maintains its own state in reliable memory. Each
file server regularly confirms the state of the other file servers. Each file
server labels messages on the redundant communication paths, so as to allow other file servers to combine the redundant communication paths into a single ordered
stream of messages. Each file
server maintains its own state in its persistent memory and compares that state with the ordered
stream of messages, so as to determine whether other file servers have progressed beyond the file
server's own last known state. Each file server uses the shared resources (such as
magnetic disks) themselves as part of the redundant communication paths, so as to prevent mutual attempts at takeover of resources when each file server believes the other to have failed. Each file server provides a
status report to the others when recovering from an error, so as to prevent the possibility of multiple file servers each repeatedly failing and attempting to seize the resources of the others.