17.op/JKO.bakermg .m From mgbaker@allspice.Berkeley.EDU Wed Aug 25 16:11:04 1993 .ls 2 .na .LP Fast Crash Recovery in Distributed File Systems Mary Baker (Professor J. K. Ousterhout) (ARPA/NASA) NAG-2-591 and (NSF) CCR-89-00029 This project focuses on fast crash recovery for improving distributed system availability. The traditional way to improve availability is to use various forms of redundancy to mask failures; unfortunately, this approach can result in higher cost, increased complexity, and reduced performance. Using fast crash recovery, I assume that critical resources will fail, but I design them to recover so quickly that nobody is inconvenienced. This method is simple and inexpensive and has little performance overhead. In particular, I examine the recovery of distributed state such as the file caching information maintained on file servers. Using the Sprite distributed file system, I compare the recovery speed, performance overhead, and complexity of three different fast crash recovery techniques for file servers. In the first technique, client-driven recovery, clients of the file server send their distributed state information to the server after a crash. In the second technique, server-driven recovery, the server initiates recovery with each of the clients. The last technique, transparent recovery, employs non-volatile memory on servers so that clients and servers need not communicate at all during recovery. This technique provides the fastest recovery, allowing a Sprite file server with 40 clients to recover from a crash in 20 seconds.