Linux-HA Project Task List

This page describes a set of Linux-HA Phase I activities that need to be done.  There is also a documentation team whose activities I haven't listed here, because I don't know enough about what needs to be done and how it should be done to break it up into distinct tasks.

The purpose of this list is to track all the things we think need doing, and hopefully match them up to people who will do them.  Perhaps you?  If you see a task you are interested in, or want to add tasks to the list or comment on it, contact Alan Robertson.
 

Organization of Activities

Linux-HA phase I is divided into three basic areas:
  1. Heartbeat and cluster management services
  2. HA resource implementations
  3. Diagnostics

HA Resource Implementations

Anything that can move from one machine to another (processes, IP addresses, MAC addresses, filesystems) are implemented as resources.  Lots of interesting activities fall under this category.  For example, IP address takeover is in this category, filesystem replication and takeover is in this category.  Lots of things are implemented as resources.  This is where we get into properly handling the great diversity of things that people want to put into HA systems.  Some of these resource types will require specialized hardware to test.  If you want to see a particular configuration supported, you may have to create the resource for it.

Resources are basically objects with four member functions:

Activity Description DependsOn WhoDo?
MACaddr MAC address resource implementation
SharedFS Shared filesystem takeover.  Useful for multi-interface RAID boxes and shared SCSI implementations.
NBD Network block device /  Mirroring Resource.  This implements the "poor-man's file synchronization" described on the list alanr (?)
NBD-NFS Design a resource sharing fileystem strategy based on NBD and NFS.  Implement the resource type for it. NBD
Intermezzo Intermezzo file sharing strategy.  This may not have to be a resource (?)
IPaddr-bcast Fix IPaddr so that it handles netmasks and broadcast addresses correctly.  Probably involves changing findip.c. alanr DONE 0.4.3.

Related Activities

Things that don't fall into one of the other categories show up here...
 
Activity Description DependsOn WhoDo?
FileSync Transactional file synchronization between nodes. Lars Marowsky-Bree (?)
GGUI GNOME-based configuration and status GUI [can KGUI and GGUI share code?]
KGUI KDE-based configuration and status GUI [can KGUI and GGUI share code?]
GFS Test GFS with Linux-HA...?
LVS Figure out how to best integrate Linux-HA with the LVS project.

Diagnostics Activities

Linux-HA Phase I needs a diagnostics subsystem to notice and handle things like hardware and software failures that aren't complete node failures.  This is where that will be carried out.
 
Activity Description DependsOn WhoDo?
DiagFrame Implement a Diagnostics API framework, or just adopt Mon and/or its API.
EtherDiag Implement a dead ethernet check for serial ports using new code for ethernet diagnostics stuff DiagFrame
HBDiags Implement a disconnect check (RTS, DCD) for serial ports.  This would be triggered on demand from Mon or called directly from heartbeat as needed.  It should probably exist in a library version and an a.out version. DiagFrame

Testing Activities

Lots of things need testing.  Linux-HA needs special attention in the testing department.  This is the beginning of such a list.
Activity Description DependsOn WhoDo?
ConfigRegress Configuration regression testing database containing valid and invalid configurations for testing the input validation below.
TestPlan Write a test plan for Phase I of Linux HA delineating specific test configurations and test cases that we really mean to have work.

Heartbeat and Cluster Manager Activities

These things are at the heart of Linux-HA.  You'll notice that many of them are marked as critical.  Lots of fun stuff to be done here.
Activity Description DependsOn WhoDo?
PartCluster Detect and perform basic recovery from a partitioned cluster condition.  Of course, this won't unscramble shared SCSI filesystems that might have occured as a result of a partitioned cluster :-)
CMFrame Create framework for "real" cluster manager.  This constitutes the APIs and supporting code allowing a cluster manager to be written NPhase
NPhase Create an n-phase commit protocol similar to IBM's Phoenix cluster services.  Pages 424-430 in "In Search of Clusters".  See especially pages 428 and 429.
CM1 Create the first cluster manager.  A translation of the current methodology into a cluster manager structure.  May be a throwaway. CMFrame, HBProtocol (before release)
CM2 Creat the first real cluster manager.  Must support an arbitary number of nodes.  Probably a voting/quorum-based cluster manager. CM1
InputCheck Verify and Validate system configuration rigorously before starting up.  Provide a standalone configuration validation tool or input checking mode for heartbeat.
SecKeeps Optionally allow the secondary host to keep the resources it has when the primary comes back online.
Restart heartbeat processes Heartbeat should be able to restart its processes that die. This is intended to allow for the possibility that one day a bug might be found in the code which would cause it to die. A little infrastructure work to support this effort is in 0.4.3. Heavens! Perish the thought! :-)
HBProtocol Fix the protocol so that lost packets are retransmitted, not just discovered. alanr
syslog-rsc Make the cluster-wide syslog a cluster resource. This may require a little thought to make it reliable, and keep messages from getting lost during transitions. Maybe have each message logged to two hosts? SysLog
buffers Should inspect code and modify to eliminate the possibility of buffer overrun attacks. This is especially true of the messaging code.
patchdoc Should document my expectations for patch submission. This should include a little bit about coding style.
manpage Write wonderful man pages for heartbeat, heartbeat.cf and haresources Shawn McKenzie (?)
Reconfig Should be able to update configurations without shutting down the cluster and restarting it. This could be accomplished in lots of different ways. From a local kill -1 type approach, to a global synchronized cluster restart. alanr (available in 0.4.5)
TrimScripts Should trim the number of scripts and how much they control. Scripts are fine, but heartbeat has a few too many... A few were trimmed for 0.4.3. In particular, the heartbeat process should instigate resource acquisition, on startup, and relinquishment on shutdown, and not rely on the scripts that call it to do that. It's probably OK to use scripts in the process, but starting heartbeat by itself should cause resources to be acquired, and killing it should cause them to be given up. alanr (available in 0.4.5)
SysLog change ha_log() functions to use syslog. Guenther Thomsen (Available in 0.4.5)
debugoutput Heartbeat debuginfo doesn't go into debug-log Guenther Thomsen (Available in 0.4.5)
HBSec Authenticate intracluster packets for heartbeat, etc.  This provides both error checking and security.  Could simply add a auth_packet entry point, and then we can plug our favorite authentication method into it, and even allow the method to be chosen in the configuration file.  This would allow people with secure networks to use crc without encryption, people in a more hostile environment could use HMAC-SHA1 with a shared secret, etc. Current thinking is to model the key file and authentication after the methods used by NTP. Mitja Sarp, Neal McBurnett (consulting) (Will be available in 0.4.5)
DEBUG Make SIGUSR1 increment debug level, and SIGUSR2 decrement it. alanr DONE. 0.4.3.
memloss Memory leaks are a danger since messages are allocated dynamically. There are a few things to do about it: 1) track buffer usage and make stats appear in the log from time to time. 2) Document how to properly use and dispose of messages in the code. Use SIGUSR1 or SIGUSR2 to get message allocation stats. alanr DONE. 0.4.3.
FHS Make Linux-HA file placement conform to the Linux Filesystem Hierarchy Standard. alanr DONE. 0.4.3.

Activities in bold blue are challenging, critical activities.