Neighbor Discovery: Learning Who's Around

Why do we need neighbor discovery?

In this lesson we'll see how to use the basic neighbor discovery service to learn which other nodes are in the area. Beyond learning how to use the neighbor discovery service, we'll also explore why neighbor discovery is such an important idea.

Neighbor discovery is a critical piece of any truly ad-hoc, dynamic system. If we imagine a huge bag full of nodes being dumped on a field from an airplane, it's probably clear that pre-configuring them with a particular network topology or neighbor list isn't going to work. In addition, dynamics makes the problem much more difficult: nodes can run out of batteries, fall into water, be carried away by wind, or get eaten by a wild boar. These changes are hard to predict in advance.

Even if we could predict exactly where nodes will be geographically, the vagaries of RF propagation would still prevent us from predicting the resulting topology. When writing emstar applications, it's important to keep this in mind.

Wrong mindset: I need to design a network topology makes solving the problem easy. Then, I can put a configuration file on each node, telling it what my pre-computed topology is.

Right mindset: The nodes we deploy will end up in some arbitrary arrangement. When a node activates, it needs to figure out who's around, and try to do useful work with whatever resources are available.

(The so-called ``wrong mindset'' can often be the right solution in Internet applications, where it's possible to explicitly set up a particular topology. But in sensor networks we can't assume the ability to engineer such controlled deployments. This is one aspect of design for sensor network software that can make it more challenging and interesting than Internet design.)

Neighbor discovery is an important part of the ``right mindset.'' When a node is activated, it must use neighbor discovery to figure out what other nodes are around. With a neighborlist, additional logical structures can be formed -- for example, routing tables can start to form if nodes transmit their neighborlists to their neighbors.

Soft state vs. Hard State

Equally important is the idea of soft state, a design philosophy that originated from the early days of the Internet. We mentioned soft state earlier, in the design principles document; we now see how it applies to neighbor discovery.

Recall from our earlier discussion what we mean by hard state: once we learn a fact about the state of the network, we assume that it remains true until we are told otherwise. For example, when a neighbor tells us ``I am here!'', we assume the neighbor is there until it explicitly tells us ``I am leaving now.'' In contrast, a soft state approach is one where a neighbor transmits ``I am here!'' messages periodically. If several periods pass without hearing this message, we assume the neighbor is gone.

Hard state is much more efficient, of course: a hypothetical hard-state neighbor discovery uses only a few packets on startup, and then rarely sends packets again. However, it is not robust. What if a node fails without being able to send a signoff message? What if a node moves out of range? What if the signoff message is corrupted? In all these cases, a node will think the neighbor is still there, even though it is not. And what if the receiver node fails -- for example, if it reboots, losing its neighbor list? Neighbors won't necessarily know that a node has rebooted, and therefore won't send new ``I am here'' messages. What if the sign-on message is corrupted? What if the message is valid, but the Node ID is incorrect? Other, more subtle problems can arise as well: what happens if there is a bug in the neighborlist code that causes entries to be lost off a linked list? In these cases, the node will not be aware of neighbors that it has.

Many of these failure modes can be fixed using some scheme or other -- checksums, request-response protocols, explicit reboot-recovery modes, sign-on and sign-off messages, three-phase commits, and so forth. Each of these special cases adds complexity; it adds up to what can be an unmanageable system.

In contrast, soft-state protocols make no distinction between normal operation and failure recovery. A neighbor will periodically transmit ``I am here!'', and receivers update their neighbor lists accordingly. A neighbor on the list who hasn't been heard from in some amount of time is removed from the list. Despite its simplicity, every failure mode we described above is handled gracefully---including the scenario where a bug in the code causes entries on the neighborlist to be deleted. In other words, the protocol can even be robust to bugs in its implementation! This ``self-healing'' property can vastly decrease the system's complexity; it's a benefit that is often (although not always) worth the cost in efficiency.

A side benefit we get from soft state is that it automatically handles dynamics. Imagine that two nodes are on the border of their radio ranges, and initially can't communicate with each other. Then, due to some environmental change, the RF propagation improves and the two nodes are able to communicate. In a soft-state protocol, this is automatically handled the right way---the new neighbor appears, without an explicit mode for handling new neighbors. This ability to handle new configurations as conditions change is critical.

The emstar neighbor discovery service is designed in a soft-state manner. However, it's important that your applications---things that use neighbor discovery---are designed with a similar mindset.

Wrong mindset: When a node boots, it first runs a neighbor discovery algorithm. When neighbor discovery is finished, the node runs (for example) a routing algorithm.

Right mindset:The neighbor discovery algorithm is always running. Every time it updates the neighbor-list, the layers above should update their state (e.g., recompute routes).

One-time configuration leads to brittle applications.

One objection to the ``right mindset'' is that it can be wasteful of CPU time. For example, when a node first boots, it is likely to experience a flurry of neighborlists updates as it learns which other nodes are in the area. This can lead to a large number of, say, spurious recomputations of the routing table that immediately get thrown away in favor of newer data.

A nice solution to this problem is using a holdoff timer, or a delay between receiving new state and acting on it. For example, an application can set a 5-second timer when it receives a new neighborlist. When the holdoff timer expires, the node can do its expensive recomputation on the most recent neighborlist received -- which might be a neighborlist that was subsequently received, after the timer was set. (Neighborlists that are received during holdoff should be saved, but should not reset the holdoff timer, or start a new one.) This way, route recomputation is done at most only once per holdoff period.

Using the EmStar Neighbor List Service

If the design principles above make sense, the interface to the neighbor discovery service will be easy to understand.

An application opens a connection to the neighbor discovery service by calling g_neighbors, passing an options structure that specifies which link we're interested in, and a callback to be used when new neighborlists are available. When the neighborlist changes, the user's callback is activated, and is passed an array of neighbor_t structures, and a count which gives the size of the array. Each entry in the array represents one neighbor reachable on the selected link. The callback is responsible for freeing the neighbor list (using free()) when it is done using it.

Each time the user's callback is activated, it gets the complete list of neighbors---not a ``diff'' to the previous list. There are no ``new neighbor arrived'' or ``neighbor left'' messages. This fits into the soft-state philosophy: by providing the complete list each time, we can be sure that any problems in processing the previous list are corrected by the newer one.

The neighbor_t structure has members that give you information about the neighbor, including:

Running a Real Example Program

In the emstar/link/examples/neighbors directory is a simple example program, neighbor-app, which demonstrates how to get neighbor lists from the discovery service. This program doesn't do anything with the neighbor list other than print it, but it serves as a useful example of how to write and run programs that rely on neighbor discovery.

In the previous examples (for example, when learning how to use ping and pingd), we saw that EmRun's configuration file describes the modules of the system we wish to use, and their interdependencies. neighbortab is a configuration file similar to the one in the basic link example, but which also tells emrun we want to use the neighbor discovery service. It looks like this:


udpd {
        type = daemon;
        cmd = "link/udpd";
        no-sim; # the simulator provides its own link-layer driver
}

neighbord {
        type = daemon;
        waitfor = udpd;
        cmd = "link/neighbord";

        # uncomment the line below if you want to see more detail
        # about what the neighbor discovery app is doing

        # loglevel = LOG_DEBUG(1);
}

neighbor-app {
        type = daemon;
        waitfor = neighbord;
        cmd = "link/examples/neighbor-app";
}
As in the pingd program's emruntab, the first block runs UDPd, which is the link layer driver for running over a ``real'' Ethernet network. The next block runs neighbord---the neighbor discovery service itself. The line "waitfor = udpd", as in the ping example, means that the neighbor discovery service should not be started until udpd is running. Finally, the neighbor-app program is run after the neighbord service is running (due to the "waitfor = neighbord" line).

Now, as with the ping example, we can log into two machines (e.g., dragon and tobiko, or two IPAQs), and run the following command on each computer (from the obj.i686-linux directory):

BLUE:/home/jelson/projects/emstar/obj.i686-linux(5044) emrun/emrun ../link/examples/neighbors/neighbortab
Tue Feb  4 17:32:39.416: emrun: emrun/emrun: starting (node id 1400)
Tue Feb  4 17:32:39.420: emrun: Boot complete... starting normal processes
Tue Feb  4 17:32:39.422: emrun: udpd launched, pid 23832
Tue Feb  4 17:32:39.427: udpd: link/udpd: starting (node id 1400)
Tue Feb  4 17:32:39.430: emrun: neighbord launched, pid 23833
Tue Feb  4 17:32:39.433: udpd: connecting /dev/link/0/data to eth0 port 6950
Tue Feb  4 17:32:39.437: neighbord: link/neighbord: starting (node id 1400)
Tue Feb  4 17:32:39.441: emrun: neighbor-app launched, pid 23834
Tue Feb  4 17:32:39.444: neighbord: running, sending with period 6648 to /dev/link/0/data
Tue Feb  4 17:32:39.448: neighbor-app: link/examples/neighbor-app: starting (node id 1400)
As in previous examples, we can go to another shell and cat the emrun status device (/dev/emrun/status) to check that the daemons are running:
BLUE:/home/jelson/projects/emstar(4946) cat /dev/emrun/status 

Process Name    Type       Current Status            Command
------------    --------   --------------            -------
udpd            daemon     pid 23906                 link/udpd
neighbord       daemon     pid 23907                 link/neighbord
neighbor-app    daemon     pid 23908                 link/examples/neighbor-app
In this example, we have an additional status device available, too---the neighbor discovery service creates a file called /dev/link/0/neighbors that shows the neighbors discovered so far on that link. (Systems with multiple network interfaces will have /dev/link/1/neighbors, /dev/link/2/neighbors, etc.). We can check that device to see who's around:
BLUE:/home/jelson/projects/emstar(4947) cat /dev/link/0/neighbors

Node         Interface       State   Last HB  Last Ack
----  ----------------  ----------   -------  --------
1400           0.0.0.0        self        --        --
1003   131.179.144.103      active      1.97      3.75
In this case, the columns indicate the node ID, the node's interface ID, the state of the neighbor, and the time since various transmissions were received from the neighbor. Neighborlists always have a "self" entry; this is convenient for some applications instead of having the node consider itself to be a special case.

The legal states are:

The "dead" state can be demonstrated by killing one of the emruns (on one computer). Within about 20 seconds, the other neighbor status device should reflect the change:

BLUE:/home/jelson(4990) cat /dev/link/0/neighbors 

Node         Interface       State   Last HB  Last Ack
----  ----------------  ----------   -------  --------
1400           0.0.0.0        self        --        --
1003   131.179.144.103        dead     67.25     67.32
This output shows it has been 67 seconds since we last heard from node 1003, and it is now considered "dead".

Running Neighbor Discovery in the Simulator

Of course, neighbor discovery can also be used in the simulator, just as in the pingd example. As in the previous examples, we need a simulator configuration file that describes how many fake nodes to run and what their locations should be. One way to do this would be to make a copy of the simulator configuration we used in the previous example (link/examples/ping/ping-simconfig), and change the line "emruntab = ../link/examples/ping/pingtab" to "emruntab = ../link/examples/neighbors/neighbortab"). This would work. But, it might become tedious to edit the simulator configuration file every time we want to run a different example program. So, we can instead create a "generic" simulator configuration that uses the command-line substitution feature, allowing us to specify the emrun configuration as an argument when we run the simulator.

An example of such a configuration file is in emrun/examples/sim-generic, which looks like this:

# Number of nodes is arg 2
num-nodes = $2;

# Physical size of the simulated field is 20m^2
field-size = (20, 20);

# We want to simulate having MoteNIC available
sim-component = "sim/sim_mote -m circle";

# The default for all nodes is for them to be at a random position in
# the field, power on
node default {
        position = (random(0, 20), random(0, 20));
        power = on;

        # "emruntab" is the emrun config file that you want the nodes
        #to run.  First argument.
        emruntab = $1;
}
This is just like the simulator configuration we used for ping, but the emruntab filename and the number of nodes in the simulation are now both command-line arguments. So, we can run a 10-node neighbor discovery simulation like this:
GAMA:/home/jelson/projects/emstar/obj.i686-linux(4950) setenv SIM_GROUP 5
GAMA:/home/jelson/projects/emstar/obj.i686-linux(4951) emrun/emsim ../emrun/examples/sim-generic ../link/examples/neighbors/neighbortab 10

Tue Feb 18 21:09:03.919: emrun/emsim: starting (simulator component)
Tue Feb 18 21:09:03.947: main: 10-node simulation starting (loglevel NOTICE)
Tue Feb 18 21:09:03.947: sim-component-0 launched, pid 2534
Tue Feb 18 21:09:03.950: libchannel_config: using circle channel model
Tue Feb 18 21:09:03.952: motesim_new_config: now simulating 10 motes
Tue Feb 18 21:09:04.460: simulator state 0 (t=0) now current
Tue Feb 18 21:09:04.461: node 1 powering ON
Tue Feb 18 21:09:04.461: node 2 powering ON
Tue Feb 18 21:09:04.461: node 3 powering ON
...
As with the ping example, we can look at /dev/sim/group5/config to see which random node positions were selected by the simulator:
GAMA:/home/jelson/projects/emstar/obj.i686-linux(5077) cat /dev/sim/group5/config 

Simulator Configuration State 0
Valid from beginning of simulation to end of simulation  *** ACTIVE NOW ***

Node On/Off   Position (m)      EmRun Configuration   
---- ------ ----------------  ------------------------
  1    On   ( 18.92,  19.14)  ../link/examples/neighbors/neighbortab
  2    On   ( 13.16,   6.31)  ../link/examples/neighbors/neighbortab
  3    On   (  9.11,  19.88)  ../link/examples/neighbors/neighbortab
  4    On   (  8.71,   8.97)  ../link/examples/neighbors/neighbortab
  5    On   (  0.77,  17.89)  ../link/examples/neighbors/neighbortab
  6    On   (  6.05,   3.83)  ../link/examples/neighbors/neighbortab
  7    On   ( 19.16,  15.66)  ../link/examples/neighbors/neighbortab
  8    On   (  1.59,  17.67)  ../link/examples/neighbors/neighbortab
  9    On   (  7.61,  11.44)  ../link/examples/neighbors/neighbortab
 10    On   ( 10.69,   6.64)  ../link/examples/neighbors/neighbortab

****
And, as before, all the device files from simulated nodes are moved from their normal locations into the /dev/sim/groupXX/nodeYY directory, where XX is the simulator group and YY is the Node ID. So, for example, to see the neighbor discovery status of simulated node 4, we can do this:
GAMA:/home/jelson/projects/emstar/obj.i686-linux(5078) cat /dev/sim/group5/node4/link/0/neighbors 
Node         Interface       State        Location  Last HB  Last Ack
----  ----------------  ----------  --------------  -------  --------
   4           0.0.0.0        self  ( 8.71,  8.97)       --        --
   2       142.116.0.2      active  (13.15,  6.31)     2.35      5.04
   9       199.148.0.9      active  ( 7.60, 11.44)     2.52      5.04
  10       191.89.0.10      active  (10.68,  6.64)     2.20      5.04
   6          69.3.0.6      active  ( 6.04,  3.82)     5.31      8.44
Of course, this human-readable output is purely for your own convenience in debugging and understanding the operation of the system. If your application needs a neighbor list, it should use the programmatic interface, as demonstrated in the neighbor-app.c source code -- the g_neighbors() function call we talked about earlier. (Don't parse this text output!)

Finally, it's useful to remember that we can also check on the overall status of simulated node 4, as we did in the ping example:

GAMA:/home/jelson(4934) cat /dev/sim/group5/node4/emrun/status 

Process Name    Type       Current Status            Command
------------    --------   --------------            -------
udpd            dead       not simulated             link/udpd
neighbord       daemon     pid 26533                 link/neighbord
neighbor-app    daemon     pid 26543                 link/examples/neighbor-app
Simply press Control-C to terminate the simulation.

Last modified by jelson, 18 February 2003