The Em* Event System

Em* has been designed to build complex, reactive systems. We chose an event-driven structure to handle concurrency. This page describes some of the details on how to use this event system. Most of the Em* libraries are built to integrate with the event system, enabling easy integration of different libraries.

A Word About Threads

Before getting to the usage details, we should address the main competitor to event-drive structures: threads.  Many systems use threads to manage concurrency.  Threads bring two  advantages:
  1. Threads are truly concurrent; a slow acting thread will not impede other independent threads from operating in a timely way.
  2. Threads enable a more direct programming style, such as a sequence of blocking calls, that in an event model would be a sequence of chained callbacks.
However, there is a hidden cost to threads: synchronization and deadlock avoidance. Because threads run concurrently, they must negotiate use of shared data structures. For simple uses of threads, that don't need to coordinate shared data structures, threads are easier to use, but when inter-thread communication is needed, the complexity skyrockets.

Reactivity, one of the key design requirements of Em*, essentially boils down to inter-thread communication as the common case. For example, when a lengthy operation is undertaken, reactivity means that if arriving data makes that operation irrelevant, it should be interrupted. Implementing this correctly with threads is much more difficult and error-prone than the corresponding event-based implementation.  For a more complete and lucid discussion of the difficulties of threads, see Ousterhout 96 and ESR 03.

The Big Picture of Em* Design

Em* follows the UNIX design philosophy: combat complexity with many small modules connected by clean interfaces. Em* modules are implemented in separate processes with separate process spaces. Clients of Em* modules use thin, lightweight client libraries to connect to other modules. This structure provides fault isolation between clients and servers, and simplifies the client libraries, reducing the possibility that bugs or unforseen interactions in library code cause problems for applications. More information on the big picture can be found in Girod 03 and Elson 03.

Individual modules form a dependency graph that is expressed in the EmRun config file. This dependency graph must be free of cycles. That is, a module can't be a client of some other module that is a client of the original module.  Such a loop in the dependency graph would potentially cause deadlock, because while a process is making a client call, it can't be responsive as a server. FUSD will detect this case and prevent it by refusing the client connection. While this restriction appears at first to be severe, in fact it is always possible to get around this limitation by structuring your system differently. If the dependency graph disallows module A being a client of module B, the transaction can usually be reversed by module A instead providing a service to module B.  Notification and callback can be exchanged for an active call. In practice, we have rarely found this issue to be a problem.

While separate modules run concurrently, the modules themselves are typically implemented using a single-threaded, event-driven structure. An event-driven implementation simplifies the implementation of reactive modules by eliminating the need to worry about synchronization and deadlock: no event handler will ever be interrupted to run another event handler, and no event handlers run "concurrently". The underlying sources of events (e.g. IPC from other modules, events from devices, etc.) are serialized at the kernel interface.  Thus, so long as every event handler leaves shared data structures in a valid state when it terminates, there is never a need to worry about whether a data structure will be valid when it is accessed.  

The main caveat to this structure is that in order to remain responsive, no event handler can run for a long time. Direct system calls that are likely to block should be replaced by a notification-based scheme. For example, in place of a blocking read() call, use an event to trigger when the file is readable and then call read().  In practice, the key phrase above is "likely to block".  If domain knowledge suggests that blocking is unlikely (or would only happen because of a bug), the system call is generally used directly.

The other caveat is processes that are by nature long-lived. For instance, a module might need to perform a lengthy computation for which segmentation into bite-sized chunks is impractical. In these cases, where concurrency really becomes handy, a thread is used. A message queue event can be used to enable the computation thread to communicate with the "main" thread that is running the event system. Even in this reduced case, working out the synchronization can be tricky -- but not nearly as bad as using threads pervasively.

Relationship to GLib

The Em* event system is a wrapper around GLib, that provides a slightly different interface.  However, because it is a wrapper, it is 100% compatible with GLib events, and GLib-based code can be used together with Em* libraries. The purpose of the interface changes is to simplify them and tune it for the common cases of embedded networked systems. GLib was designed to run GTK GUI apps, and the native simplified interface they provide didn't have quite the right API for building network protocols and reactive systems. The other change relative to GLib has to do with memory management. GLib uses a cumbersome reference-counting scheme to track memory usage. Rather than support a full reference counting scheme, we implement a much simpler scheme that handles a subset of the problem that represents the common case for our applications.

Em* Initialization and the Main Event Loop

An Em* main() function takes a routine form, that is typically cut-and-pasted from one module to the next.  The main() function is typically composed of the following steps:
  1. Initialize the module's state variables
  2. Call misc_init().  This will initialize certain Em* libraries and facilities.
  3. Process command-line arguments
  4. Initialize the events to be handled
  5. Call emrun_init().  This will connect to EmRun and enable graceful shutdown and enable EmRun to start other modules that depend on this one.
  6. Call g_main().  This will start up the event loop.  This function never returns; all activity from now on occurs when events are triggered.
An example of this can be found in the skeletons directory.

Event Initialization

Events are created and installed by event constructors.  These constructors typically take an options structure that defines the configuration of the event, and fill in an "event reference", which is a handle on the event.  This handle can be used to reconfigure the event, or perform other operations on it.  The nature of these event references will be discussed in detail in the next section.

Examples of events are:
  1. A timer that fires once per second.
  2. A timer that fires 4.25 seconds from now.
  3. An event that triggers when a particuar file descriptor becomes readable.
  4. An event that encapsulates an interface to a neighbor discovery service, and reports a new list of neighbors whenever a change occurs in the neighborhood.
As you can see, these events encapsulate varying amounts of complexity and domain knowledge; whereas timers are very simple, the neighbor discovery client event encapsulates the knowledge of how to connect to the neighbor discovery service, how to receive notification of new data, and how to read the new data.  This capability to encapsulate arbitrary mechanism within an "event" enables very simple APIs to be constructed that encapsulate complex access protocols.

The other advantage of events is that independent events can be integrated together very easily. Individual Em* events rarely need to know anything about the other events in the system in order to integrate properly. Through the GLib event system, all events from timers to notification events configure hooks in GLib's central select() loop, and then are dispatched accordingly.   

Event References

By convention, all event constructors take the address of an "event reference" as their final argument. This reference is a handle that enables the event to be asynchronously reconfigured, accessed, or destroyed.  For example, the g_timer_add() function creates a timer event. Using the reference, the timer may be asynchronously canceled using the g_event_destroy() function. For another example, the g_status_dev() function creates a status device. A reference to that status device can be used to cause notification on that device, or to destroy it. In the sense that it is passed to "method" functions, the event reference is similar to an "instance" of an object class.

GLib internally uses reference counting to address the variety of conditions that can arise when GLib applications need asynchronous destruction or multiple independent references. However, reference counting can be very tricky to use, leading to subtle memory bugs. Automated forms of reference counting always fall short as well, most notably when confronted with cyclic linked structures.

After gaining some experience with early versions of Em*, we found that we never used the "multiple reference" capability of reference counting, but that we did need to support asynchronous destruction. That is, we found that all of our applications could get by storing only one copy of a pointer to the event, but that our software often needed to destroy an event from several different places in the code. While events are never run concurrently, it's possible that an event might be destroyed down inside a call chain, and then be referenced later by the caller.

Given this insight, we implemented event references to support asynchronous destruction, but not multiple reference. Thus, each event has a single "owner" who holds the pointer to it.  Anyone may obtain that pointer from the owner's structure, and use it, even to destroy the event. If the pointer obtained is NULL, that means the event does not exist. Once obtained, the pointer should only be used while it is known to be valid; if you call out to unknown library code, you should re-acquire the pointer on your return. When the event is destroyed, the pointer in the owner's structure will automatically be zeroed, so that it can't be used again.  

This sounds complicated, but it is actually very simple, as long as a few simple rules are followed.  

Rules for event "owners":
  1. Allocate the event reference in a fixed location (typically in the owner's state structure). 
  2. Initialize the event reference to NULL.
  3. Pass the (fixed) address of the event reference to the event constructor.
  4. Never "move" or reallocate the structure containing the reference.
  5. Before deallocating the structure containing the reference, call the appropriate destroy() function to destroy the event.
Rules for using event references:
  1. Don't store a copy of an event reference, except temporarily. If you call out to code that might destroy the event, acquire a fresh copy before using it again. When possible, use the reference directly from its fixed location.
  2. If the reference is NULL, that means the event has been destroyed. However, you need not worry about this in general, because methods and destructors silently ignore NULL references.  (For example, it is always safe to destroy a destroyed reference.)
These rules place slight restrictions on the memory that holds event references, and disallow multiple independent copies of the event reference (i.e. requires a single "owner"). However, compared with the potential for reference-counting related bugs in a full-blown reference counted framework, this seemed a small price to pay.

NOTE: If you will never need to destroy or access the event, you can pass NULL in to the event reference argument of the constructor. In that case, no reference will be saved.

NOTE: Some data structure libraries, such as STL, may move memory "under the covers". If stored event references are moved this will cause problems. Most Em* modules use dynamically allocated structures kept in linked lists to avoid this problem.

The Main Event Loop

After initialization, all activity in an Em* system originates with the event loop.  This is similar in structure to the scheduler in a multitasking OS.  The implementation of the main loop is a part of GLib; it is documented in the GLib API reference. Applications that need to control the details of the scheduling can use optional arguments provided throughout the Em* API. Event-based programming can take a bit to get used to for programmers accustomed to thread-based programming, but once the control flow is understood, it should become natural.

Installing a Timer Event

The simplest form of event is the timer event. This event has a single callback that is called when the specified time expires. The callback may set the timer to retrigger, or terminate the event, depending on its return value. In addition, a timer can be canceled at any time by calling g_event_destroy() with the timer's event reference.

Setting a Timer

To set a timer, call the g_timer_add() function. For example:

g_event_t *timer_ref = NULL;
int cb_data = 5;
int status =
  g_timer_add(1000,        /* 1 second (1000 ms) */
              cb_func,     /* our callback function */
              &cb_data,    /* a data pointer */
              NULL,        /* some rarely-used options */
              &timer_ref); /* an event reference */


This call will set a one-second timer. If for some reason it cannot be set, it will return -1. If it is not canceled before 1 second, the callback will be called. The following call would cancel the timer:

g_event_destroy(timer_ref);

An example of a callback function is:

int cb_func(void *data, int interval, g_event_t *event)
{
  int *count = (int *)data;
  if (*count == 0) {
    elog(LOG_INFO, "No more timeouts");
    return TIMER_DONE;
  }
  (*count)--;
  elog(LOG_INFO, "%d more timeouts", *count);
  return TIMER_RENEW;
}


Given that cb_data starts out 5, after 1 second, this callback will print '4 more timeouts', then wait one second, print '3 more timeouts', then wait one seconds, etc. until it prints 'No more timeouts', and destroys the event. You can see this behavior in action by running the program in skeletons/tutorial/timer.c (obj.i686-linux/skeletons/timer). You may notice that after the timer finishes, the program hangs. This is because, having destroyed the timer, there are no further events registered, so the event loop is idle, and will stay that way forever.

There are just a few additional details about timers. First, in addition to renewing the timer for the same time interval, a timer callback can also return TIMER_RENEW_MS(x) to reset the timer for a different interval.

Second, if no callback is specified when setting the timer, the timer event will still be created, and will exist for the specified time, and then be destroyed. This can actually be useful, because the function g_timer_isset(timer_ref) will tell you if the timer is still set. Thus, this effectively sets a variable "true" for the specified length of time.

Watching a File Descriptor

The second-simplest event to construct is one that watches for notification on a file descriptor.