"If we knew what it was we were doing, it would not be called research, would it?"

Albert Einstein
As a graduate student at the University of Illinois, I have been working in the area of software fault tolerance since 1997. We are involved in a project called Chameleon which is a software infrastructure for providing fault tolerance services to a wide range of applications - from off-the-shelf applications to applications customized for the Chameleon environment - in a distributed environment. The infrastructure is adaptive and can adapt to meet different availability requirements of different applications, as well changing requirements of the same application. The software can run on the standard workstations or PCs running off-the-shelf operating systems like Sun Solaris or Windows NT, and does not require any specialized hardware. The architecture of Chameleon is such that it can provide its services in a an network of computing nodes which may be heterogeneous and may not be intrinsically fault tolerant.

The project is being sponsored by the Jet Propulsion Laboratory which is an organizational wing of NASA and is based in Caltech, Pasadena, CA. For my MS thesis, I built the first prototype of the system and showed a demonstration of its capability to support multiple classes of applications. For my Ph.D. thesis, I explored the error detection protocols in Chameleon. Error detection is a key ingredient in building reliable distributed systems where the errors to be tolerated can be as varied as hardware errors, system software errors, application errors, errors in the software infrastructure itself, or malicious security attacks. In the networked environment, while the issue of application error detection has been fairly well studied, infrastructure error detection is less well understood. The currently employed techniques of timeouts or exceptions raised by the operating system often cannot provide fault containment to a reasonable level. In existing systems, fail silence of nodes and processes is often assumed away though field studies and our fault injection studies have repeatedly shown that in a distributed environment executing on off-the-shelf hardware components, the fail silence assumption is often violated. Hence, there is a need to explore more sophisticated detection techniques for the SIFT infrastructure. As part of my doctoral thesis work, I proposed a hierarchy for error detection techniques, which can be applied in a distributed software implemented fault tolerance (SIFT) environment. Within this hierarchy, novel techniques for making a process self-checking and a node fail-silent were explored. The levels in the hierarchy were designed to enforce different fault containment boundaries - a process, a node, or a replication group. The protocols for escalating the error conditions from one level to another to optimize the detection scheme and minimize error propagation were clearly defined. The error detection framework was implemented and demonstrated on the Chameleon testbed, though the principles are of general applicability in a message-passing-based distributed system.

If you would like to obtain a distribution of Chameleon, please send a mail to us by clicking here
The project is lead by Professor Ravishankar K. Iyer. Dr. Zbigniew Kalbarczyk has been involved with the project since its inception from 1997 and continues to be a driving force. Various current and past graduate students have been involved with the project. Current: Keith Whisnant, Dheeraj Ahuja, and Claudio Basile.

Chameleon links

Publications (The links are in chronological order, from earliest to the latest. All links are postscript files, unless otherwise mentioned)

Presentations

1. 24th April, 1997

2. 4th February, 1998

3. 8th April, 1998 (ICAP, UIUC)

4. 16th October, 1998 (Jet Propulsion Lab, Pasadena, California)

5. 25th October, 1998 (Symposium on Reliable Distributed Systems 98, Purdue University, Indiana)

6. 17th November, 1998 (CRHC Seminar, UIUC) Powerpoint file

7. 16th February, 1999 (powerpoint file) (Research Group Seminar given to Prof. Algirdas Avizienis, Center for Reliable & High Performance Computing, U. of Illinois.

8. 22nd March, 1999 Powerpoint file

9. 6th April, 1999 Powerpoint file Posters presented at ICAP (Industrial Computer Affiliates Program), University of Illinois, April, 1999.

9. 21st July, 1999 Powerpoint file Presentation given to Motorola internally at CRHC, UIUC. It contains preliminary availability and coverage measures from the error detection protocols obtained through fault injection experiments.

10. 17th April, 2000 Talk given by Keith Whisnant to Jim Gray of Microsoft Research that describes the high-level philosophy of Chameleon, and the latest work on micro-checkpointing.
17th April, 2000 Talk given by me that presents results of the evaluation of the pre-emptive control flow signature technique. It contains a table that compares the detection techniques available in existing SIFT systems.

11. 5th May, 2000 Presentation that contains the latest work on control flow signatures and presents results from the evauation done of the signature schemes.

12. 13th September, 2000 A short summary of the raison-d'etre behind error detection and latest fault injection and performance results from the pre-emptive control flow signature (PECOS) scheme.

13. 1st November, 2000 The slides from my final Ph.D. thesis defense. In the talk, I focus on the work on control flow error detection. It presents the motivation, the techniques, and detailed evaluation of two large-scale commercial applications - Dynamic Host Configuration Protocol (DHCP) application, and a call processing application for a mobile environment. Here are the backup slides that contain useful stuff but could not be accomodated in the 90 minute presentation

Thesis

The topic of my Ph.D. thesis is Hierarchical Error Detection Protocols in a Software Implemented Fault Tolerance (SIFT) Environment . I defended my thesis on November 1st, 2000. I will be submitting the thesis shortly (end-November, early-December). Here is an abstract of the thesis. And here is the full postscript version.

I had defended the proposal at my Preliminary examination on 10th June, 1999. Here is the proposal. And here are the slides from my prelim talk.

The topic of my Masters thesis is Chameleon: A Software Infrastructure for Adaptive Fault Tolerance in Distributed Systems . Here is an abstract of the thesis. And here is the full postscript version.

My presentations for the Qualifying examination (September 1997) are here:

  1. Methodology for Adapting to Patterns of Faults. By, Gul Agha and Daniel Sturman. From, Foundations of Ultradependable Computing, 1994.
  2. Refinement for Fault-Tolerance: An Aircraft Hand-off Protocol. By, Keith Marzullo, Fred Schneider & Jon Dehn. From, Foundations of Ultradependable Computing, 1994.

 

Home Page

Purdue Home Page

Purdue ECE Home Page

Saurabh Bagchi
Last modified: January 26, 2003