A method and
system for achieving highly available, fault-tolerant execution of components in a
distributed computing system, without requiring the writer of these components to explicitly write code(such as entity beans or
database transactions) to make component state persistent. It is achieved by converting the intrinsically non-deterministic behavior of the distributed
system to a deterministic behavior, thus enabling state
recovery to be achieved by advantageously efficient checkpoint-replay techniques. The method comprises: adapting the execution environment for enabling message communication amongst and between the components; automatically associating a deterministic
timestamp in conjunction with a message to be communicated from a sender component to a
receiver component during program execution, the
timestamp representative of
estimated time of arrival of the message at a
receiver component. At a component, tracking state of that component during program execution, and periodically checkpointing the state in a local storage device. Upon failure of a component, the component state is restored by recovering a recent stored checkpoint and re-executing the events occurring since the last checkpoint. The system is deterministic by repeating the execution of the receiving component by
processing the messages in the same order as their associated timestamps.