Research (UIUC)

"If we knew what it was we were doing, it would not be called research, would it?"

Albert Einstein

As a graduate student at the University of Illinois, I have been working in the area of software fault tolerance since 1997. We are involved in a project called Chameleon which is a software infrastructure for providing fault tolerance services to a wide range of applications - from off-the-shelf applications to applications customized for the Chameleon environment - in a distributed environment. The infrastructure is adaptive and can adapt to meet different availability requirements of different applications, as well changing requirements of the same application. The software can run on the standard workstations or PCs running off-the-shelf operating systems like Sun Solaris or Windows NT, and does not require any specialized hardware. The architecture of Chameleon is such that it can provide its services in a an network of computing nodes which may be heterogeneous and may not be intrinsically fault tolerant.

The project is being sponsored by the Jet Propulsion Laboratory which is an organizational wing of NASA and is based in Caltech, Pasadena, CA. For my MS thesis, I built the first prototype of the system and showed a demonstration of its capability to support multiple classes of applications. For my Ph.D. thesis, I explored the error detection protocols in Chameleon. Error detection is a key ingredient in building reliable distributed systems where the errors to be tolerated can be as varied as hardware errors, system software errors, application errors, errors in the software infrastructure itself, or malicious security attacks. In the networked environment, while the issue of application error detection has been fairly well studied, infrastructure error detection is less well understood. The currently employed techniques of timeouts or exceptions raised by the operating system often cannot provide fault containment to a reasonable level. In existing systems, fail silence of nodes and processes is often assumed away though field studies and our fault injection studies have repeatedly shown that in a distributed environment executing on off-the-shelf hardware components, the fail silence assumption is often violated. Hence, there is a need to explore more sophisticated detection techniques for the SIFT infrastructure. As part of my doctoral thesis work, I proposed a hierarchy for error detection techniques, which can be applied in a distributed software implemented fault tolerance (SIFT) environment. Within this hierarchy, novel techniques for making a process self-checking and a node fail-silent were explored. The levels in the hierarchy were designed to enforce different fault containment boundaries - a process, a node, or a replication group. The protocols for escalating the error conditions from one level to another to optimize the detection scheme and minimize error propagation were clearly defined. The error detection framework was implemented and demonstrated on the Chameleon testbed, though the principles are of general applicability in a message-passing-based distributed system.

If you would like to obtain a distribution of Chameleon, please send a mail to us by clicking here
The project is lead by Professor Ravishankar K. Iyer. Dr. Zbigniew Kalbarczyk has been involved with the project since its inception from 1997 and continues to be a driving force. Various current and past graduate students have been involved with the project. Current: Keith Whisnant, Dheeraj Ahuja, and Claudio Basile.

Chameleon links

Publications (The links are in chronological order, from earliest to the latest. All links are postscript files, unless otherwise mentioned)

Chameleon : A Software Infrastructure and Testbed for High-Speed Networked Computing : This technical report from July, 1997 presents some of the early design philosophy and the initial issues that were targeted in the research.
Fast Abstracts, 28th International Symposium on Fault Tolerant Computing, 1998. This is an abstract that can serve as a brief and concise introduction to the environment.
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance . Published at the 18th Symposium on Reliable Distributed Systems, Purdue University, 1998. This paper provides details of the Chameleon architecture and some results from an early implementation and simulation. The full length paper is here .
A Flexible Software Architecture for High Availability Computing . Published at the 3rd IEEE Conference on High Assurance Systems Engineering, 1998.
Chameleon: A Software Architecture for Adaptive Fault Tolerance. IEEE Transactions on Parallel and Distributed Systems, Special Issue on Real-Time Systems, June, 1999. This paper provides a fairly comprehensive discussion of the Chameleon architecture with emphasis on its real-time aspects.
Hierarchical Error Detection and Recovery in a SIFT Environment . Abstract . This paper explores the issues involved in error detection and recovery in Chameleon, primarily exploring the error handling for the Chameleon components called ARMORs. It was submitted to the Fault Tolerance Computer Symposium-29, held in Wisconsin, June 15-18, 1999.
Providing Adaptive Fault Tolerance through the Reconfigurable ARMOR Architecture of Chameleon . This paper describes the reconfigurable ARMOR architecture in Chameleon and explains how it is used to provide adaptive fault tolerance in a distributed environment. It was submitted to the Fault Tolerance Computer Symposium-29, held in Wisconsin, June 15-18, 1999.
Software-based Signaturing in Distributed Systems . Published as a Fast Abstract at the 29th International Fault Tolerant Computer Symposium, Wisconsis, June 15th-18th, 1999. It gives a brief overview of our recent work on software signaturing which is proposed as one of the techniques to make the Chameleon ARMORs more robust. However, the discussion here is generally applicable in the context of distributed systems.
Hierarchical Error Detection in a SIFT Environment . Published in IEEE Transactions on Knowledge and Data Engineering, April, 2000. This paper details the detection algorithms in Chameleon, along with the Optimization framework. It gives fairly detailed results from the latest implementation and simulation of the environment.
Fault Injection Based Assessment of Fail-Silence Provided by Process Duplication versus Internal Error Detection . Submitted to the 30th International Symposium on Fault-Tolerant Computing, 2000. This paper provides the results of fault injection into Chameleon and another distributed SIFT middleware from the University of Newcastle called Voltan. The goal is to evaluate the fail-silence provided by the internal ARMOR self-checking techniques with full duplication provided by Voltan. Results from both direct and random fault injection campaigns are presented.
Design and Evaluation of Preemptive Control Signature (PECOS) Checking for Distributed Applications . Revised and submitted to IEEE Transactions on Computers in January 2002. It provides the design of a technique for protecting against control flow faults, called PECOS. The main advantages of PECOS are that it is pre-emptive in nature thereby reducing recovery effort and system downtime, and it can handle control structures determined at runtime (e.g., dynamic library calls). Evaluation of PECOS is done on a client-server application called DHCP, using software based fault injection. Detailed fault injection based evaluation and performance measures are provided.
A Framework for Database Audit and Control Flow Checking for a Wireless Telephone Network Controller . Published at the IEEE Dependable Systems and Networks Conference, Sweden, July, 2001. It provides the dependability assessment of a controller for a wireless telephone network. Two classes of techniques for protecting against data and control errors in such an environment are presented. The resultant improvement in system reliability is evaluated through detailed fault injection. The full paper is here .

Presentations

1. 24th April, 1997

2. 4th February, 1998

3. 8th April, 1998 (ICAP, UIUC)

4. 16th October, 1998 (Jet Propulsion Lab, Pasadena, California)

5. 25th October, 1998 (Symposium on Reliable Distributed Systems 98, Purdue University, Indiana)

6. 17th November, 1998 (CRHC Seminar, UIUC) Powerpoint file

7. 16th February, 1999 (powerpoint file) (Research Group Seminar given to Prof. Algirdas Avizienis, Center for Reliable & High Performance Computing, U. of Illinois.

8. 22nd March, 1999 Powerpoint file

9. 6th April, 1999 Powerpoint file Posters presented at ICAP (Industrial Computer Affiliates Program), University of Illinois, April, 1999.

9. 21st July, 1999 Powerpoint file Presentation given to Motorola internally at CRHC, UIUC. It contains preliminary availability and coverage measures from the error detection protocols obtained through fault injection experiments.

10. 17th April, 2000 Talk given by Keith Whisnant to Jim Gray of Microsoft Research that describes the high-level philosophy of Chameleon, and the latest work on micro-checkpointing.
17th April, 2000 Talk given by me that presents results of the evaluation of the pre-emptive control flow signature technique. It contains a table that compares the detection techniques available in existing SIFT systems.

11. 5th May, 2000 Presentation that contains the latest work on control flow signatures and presents results from the evauation done of the signature schemes.

12. 13th September, 2000 A short summary of the raison-d'etre behind error detection and latest fault injection and performance results from the pre-emptive control flow signature (PECOS) scheme.

13. 1st November, 2000 The slides from my final Ph.D. thesis defense. In the talk, I focus on the work on control flow error detection. It presents the motivation, the techniques, and detailed evaluation of two large-scale commercial applications - Dynamic Host Configuration Protocol (DHCP) application, and a call processing application for a mobile environment. Here are the backup slides that contain useful stuff but could not be accomodated in the 90 minute presentation

Thesis

The topic of my Ph.D. thesis is Hierarchical Error Detection Protocols in a Software Implemented Fault Tolerance (SIFT) Environment . I defended my thesis on November 1st, 2000. I will be submitting the thesis shortly (end-November, early-December). Here is an abstract of the thesis. And here is the full postscript version.

I had defended the proposal at my Preliminary examination on 10th June, 1999. Here is the proposal. And here are the slides from my prelim talk.

The topic of my Masters thesis is Chameleon: A Software Infrastructure for Adaptive Fault Tolerance in Distributed Systems . Here is an abstract of the thesis. And here is the full postscript version.

My presentations for the Qualifying examination (September 1997) are here:

Methodology for Adapting to Patterns of Faults. By, Gul Agha and Daniel Sturman. From, Foundations of Ultradependable Computing, 1994.
Refinement for Fault-Tolerance: An Aircraft Hand-off Protocol. By, Keith Marzullo, Fred Schneider & Jon Dehn. From, Foundations of Ultradependable Computing, 1994.

Home Page

Purdue Home Page

Purdue ECE Home Page

Saurabh Bagchi
Last modified: January 26, 2003