An apparatus for and method of enhancing reliability within a cluster lock
processing system having a relatively large number of commodity cluster instruction processors which are managed by a cluster
lock manager. Because the commodity processors have virtually no
system viability features such as
memory protection, failure
recovery, etc., the cluster / lock processors assume the responsibility for providing these functions. The low cost of the commodity cluster instruction processors makes the
system almost linearly scalable. The cluster / locking, caching, and
mass storage accessing functions are fully integrated into a single hardware platform which performs the role of the cluster / lock master. Upon failure of this hardware platform, a second redundant hardware platform converts from slave to master role. The logic for the failure detection and role swapping is placed within
software, which can run as an application under a commonly available
operating system. Furthermore, the
recovery is completely accomplished without assistance of the Host computer(s) or ultimate user(s) coupled via the Host computer(s). Following repair of the failed
server, it is restarted in an orderly fashion to resume a slave role. For the
server to be completely restored, coherent memory must be copied from master to slave. Because cluster lock
processing must be paused throughout the system to transfer the copy, it is important to minimize the transfer time to minimize the
impact on system
throughput.