"If we knew what it was we were doing, it would
not be called research, would it?"
Albert Einstein
As a graduate student at the University of Illinois,
I have been working in the area of software fault tolerance since 1997.
We are involved in a project called Chameleon which is a software infrastructure
for providing fault tolerance services to a wide range of applications
- from off-the-shelf applications to applications customized for the Chameleon
environment - in a distributed environment. The infrastructure is adaptive
and can adapt to meet different availability requirements of different
applications, as well changing requirements of the same application. The
software can run on the standard workstations or PCs running off-the-shelf
operating systems like Sun Solaris or Windows NT, and does not require
any specialized hardware. The architecture of Chameleon is such that it
can provide its services in a an network of computing nodes which may be
heterogeneous and may not be intrinsically fault tolerant.
The project is being sponsored by the Jet Propulsion
Laboratory which is an organizational wing of NASA and is based in Caltech,
Pasadena, CA. For my MS thesis, I built the first prototype of the system
and showed a demonstration of its capability to support multiple classes
of applications. For my Ph.D. thesis, I explored the error detection protocols
in Chameleon. Error detection is a key ingredient in building reliable
distributed systems where the errors to be tolerated can be as varied as
hardware errors, system software errors, application errors, errors in
the software infrastructure itself, or malicious security attacks. In the
networked environment, while the issue of application error detection has
been fairly well studied, infrastructure error detection is less well understood.
The currently employed techniques of timeouts or exceptions raised by the
operating system often cannot provide fault containment to a reasonable
level. In existing systems, fail silence of nodes and processes is often
assumed away though field studies and our fault injection studies have
repeatedly shown that in a distributed environment executing on off-the-shelf
hardware components, the fail silence assumption is often violated. Hence,
there is a need to explore more sophisticated detection techniques for
the SIFT infrastructure. As part of my doctoral thesis work, I proposed
a hierarchy for error detection techniques, which can be applied in a distributed
software implemented fault tolerance (SIFT) environment. Within this hierarchy,
novel techniques for making a process self-checking and a node fail-silent
were explored. The levels in the hierarchy were designed to enforce different
fault containment boundaries - a process, a node, or a replication group.
The protocols for escalating the error conditions from one level to another
to optimize the detection scheme and minimize error propagation were clearly
defined. The error detection framework was implemented and demonstrated
on the Chameleon testbed, though the principles are of general applicability
in a message-passing-based distributed system.
If you would like to obtain a distribution of
Chameleon, please send a mail to us by clicking here
The project is lead by Professor
Ravishankar K. Iyer. Dr.
Zbigniew Kalbarczyk has been involved with the project since its inception
from 1997 and continues to be a driving force. Various current and past
graduate
students have been involved with the project. Current: Keith
Whisnant,
Dheeraj Ahuja,
and Claudio Basile.
Chameleon links
Publications (The links are in chronological
order, from earliest to the latest. All links are postscript files, unless
otherwise mentioned)
-
Chameleon
: A Software Infrastructure and Testbed for High-Speed Networked Computing
:
This technical report from July, 1997 presents some of the early design
philosophy and the initial issues that were targeted in the research.
-
Fast
Abstracts, 28th International Symposium on Fault Tolerant Computing, 1998.
This is an abstract that can serve as a brief and concise introduction
to the environment.
-
Chameleon:
A Software Infrastructure for Adaptive Fault Tolerance . Published
at the 18th Symposium on Reliable Distributed Systems, Purdue University,
1998. This paper provides details of the Chameleon architecture and some
results from an early implementation and simulation. The full length paper
is here
.
-
A
Flexible Software Architecture for High Availability Computing . Published
at the 3rd IEEE Conference on High Assurance Systems Engineering, 1998.
-
Chameleon:
A Software Architecture for Adaptive Fault Tolerance. IEEE Transactions
on Parallel and Distributed Systems, Special Issue on Real-Time Systems,
June, 1999. This paper provides a fairly comprehensive discussion of the
Chameleon architecture with emphasis on its real-time aspects.
-
Hierarchical
Error Detection and Recovery in a SIFT Environment . Abstract
.
This paper explores the issues involved in error detection and recovery
in Chameleon, primarily exploring the error handling for the Chameleon
components called ARMORs. It was submitted to the Fault Tolerance Computer
Symposium-29, held in Wisconsin, June 15-18, 1999.
-
Providing
Adaptive Fault Tolerance through the Reconfigurable ARMOR Architecture
of Chameleon . This paper describes the reconfigurable ARMOR architecture
in Chameleon and explains how it is used to provide adaptive fault tolerance
in a distributed environment. It was submitted to the Fault Tolerance Computer
Symposium-29, held in Wisconsin, June 15-18, 1999.
-
Software-based
Signaturing in Distributed Systems . Published as a Fast Abstract at
the 29th International Fault Tolerant Computer Symposium, Wisconsis, June
15th-18th, 1999. It gives a brief overview of our recent work on software
signaturing which is proposed as one of the techniques to make the Chameleon
ARMORs more robust. However, the discussion here is generally applicable
in the context of distributed systems.
-
Hierarchical
Error Detection in a SIFT Environment . Published in IEEE Transactions
on Knowledge and Data Engineering, April, 2000. This paper details the
detection algorithms in Chameleon, along with the Optimization framework.
It gives fairly detailed results from the latest implementation and simulation
of the environment.
-
Fault
Injection Based Assessment of Fail-Silence Provided by Process Duplication
versus Internal Error Detection . Submitted to the 30th International
Symposium on Fault-Tolerant Computing, 2000. This paper provides the results
of fault injection into Chameleon and another distributed SIFT middleware
from the University of Newcastle called Voltan. The goal is to evaluate
the fail-silence provided by the internal ARMOR self-checking techniques
with full duplication provided by Voltan. Results from both direct and
random fault injection campaigns are presented.
-
Design
and Evaluation of Preemptive Control Signature (PECOS) Checking for Distributed
Applications . Revised and submitted to IEEE Transactions on Computers
in January 2002. It provides the design of a technique for protecting against
control flow faults, called PECOS. The main advantages of PECOS are that
it is pre-emptive in nature thereby reducing recovery effort and system
downtime, and it can handle control structures determined at runtime (e.g.,
dynamic library calls). Evaluation of PECOS is done on a client-server
application called DHCP, using software based fault injection. Detailed
fault injection based evaluation and performance measures are provided.
-
A
Framework for Database Audit and Control Flow Checking for a Wireless Telephone
Network Controller . Published at the IEEE Dependable Systems and Networks
Conference, Sweden, July, 2001. It provides the dependability assessment
of a controller for a wireless telephone network. Two classes of techniques
for protecting against data and control errors in such an environment are
presented. The resultant improvement in system reliability is evaluated
through detailed fault injection. The full paper is here
.
Presentations
1. 24th
April, 1997
2. 4th
February, 1998
3. 8th
April, 1998 (ICAP, UIUC)
4. 16th
October, 1998 (Jet Propulsion Lab, Pasadena, California)
5. 25th
October, 1998 (Symposium on Reliable Distributed Systems 98, Purdue
University, Indiana)
6. 17th
November, 1998 (CRHC Seminar, UIUC) Powerpoint
file
7. 16th
February, 1999 (powerpoint file) (Research Group Seminar given to Prof.
Algirdas Avizienis, Center for Reliable & High Performance Computing,
U. of Illinois.
8. 22nd
March, 1999 Powerpoint
file
9. 6th
April, 1999 Powerpoint
file Posters presented at ICAP (Industrial Computer Affiliates Program),
University of Illinois, April, 1999.
9. 21st
July, 1999 Powerpoint
file Presentation given to Motorola internally at CRHC, UIUC. It contains
preliminary availability and coverage measures from the error detection
protocols obtained through fault injection experiments.
10. 17th
April, 2000 Talk given by Keith Whisnant to Jim Gray of Microsoft Research
that describes the high-level philosophy of Chameleon, and the latest work
on micro-checkpointing.
17th
April, 2000 Talk given by me that presents results of the evaluation
of the pre-emptive control flow signature technique. It contains a table
that compares the detection techniques available in existing SIFT systems.
11. 5th
May, 2000 Presentation that contains the latest work on control flow
signatures and presents results from the evauation done of the signature
schemes.
12. 13th
September, 2000 A short summary of the raison-d'etre behind error detection
and latest fault injection and performance results from the pre-emptive
control flow signature (PECOS) scheme.
13. 1st
November, 2000 The slides from my final Ph.D. thesis defense. In the
talk, I focus on the work on control flow error detection. It presents
the motivation, the techniques, and detailed evaluation of two large-scale
commercial applications - Dynamic Host Configuration Protocol (DHCP) application,
and a call processing application for a mobile environment. Here
are the backup slides that contain useful stuff but could not be accomodated
in the 90 minute presentation
Thesis
The topic of my Ph.D. thesis is Hierarchical Error
Detection Protocols in a Software Implemented Fault Tolerance (SIFT) Environment
.
I defended my thesis on November 1st, 2000. I will be submitting the thesis
shortly (end-November, early-December). Here
is an abstract of the thesis. And here
is the full postscript version.
I had defended the proposal at my Preliminary
examination on 10th June, 1999. Here
is
the proposal. And here
are
the slides from my prelim talk.
The topic of my Masters thesis is Chameleon:
A Software Infrastructure for Adaptive Fault Tolerance in Distributed Systems
. Here
is
an abstract of the thesis. And here
is
the full postscript version.
My presentations for the Qualifying examination
(September 1997) are here:
-
Methodology
for Adapting to Patterns of Faults. By, Gul Agha and Daniel Sturman.
From, Foundations of Ultradependable Computing, 1994.
-
Refinement
for Fault-Tolerance: An Aircraft Hand-off Protocol. By, Keith Marzullo,
Fred Schneider & Jon Dehn. From, Foundations of Ultradependable Computing,
1994.
Home Page
|
Purdue Home Page |
Purdue ECE Home Page |
Saurabh Bagchi
Last modified: January 26, 2003