Implementations are provided herein for non-disruptive
upgrade including
rollback capabilities for a
distributed file system within a cluster of nodes. To continue availability of the
file system to external clients during the
upgrade process, nodes can be upgraded piecemeal, for example, in one implementation, one node at a time. When a node is undergoing certain stages of the
upgrade process, external clients can be directed toward the remaining nodes of the
file system that are not currently being upgraded, including already upgraded nodes, to perform
client activity. During the upgrade process, a first subset of nodes can be running in an upgraded state while a second subset of nodes can be in a non-upgraded state, both providing access to external clients in a seamless manner. During the upgrade process, an administrator can decide to
rollback any upgrades and return the
distributed file system to its previous version (e.g., the version of the
file system prior to starting the non-disruptive upgrade process). Hooks can be provided prior to, during, and after various stages of the upgrade or
rollback process that can allow services of the
distributed file system to be notified of certain events of the upgrade process, or execute service specific processes at distinct times during the upgrade process. At the conclusion of an upgrade or rollback process, the distributed file
system can enter a committed state that finalizes the process and cements an upgrade or a rollback to a more permanent state.