Chapter 17 Proprietary Fault Tolerance

Some of the low-level infrastructure defined by CORBA makes it possible for a CORBA product to have fault tolerance capabilities. However, for a long time, the CORBA specification did not specify the details of how this fault tolerance should be provided or administered. The result was that many CORBA products provided simple-to-use (but limited) fault tolerance capabilities in a proprietary manner. More recently, the OMG has defined the (optional) CORBA Fault Tolerance specification, which provides a more standardized and more powerful (but also more complex) fault tolerance infrastructure.

This chapter discusses the proprietary fault tolerance mechanisms provided by some CORBA implementations. Section 17.1 introduces some basic issues that need to be addressed by fault tolerance mechanisms. Then Section 17.2 gives overviews of the proprietary fault tolerance mechanisms in several CORBA products. Finally, Section 17.3 discusses some miscellaneous issues that need to be considered when using proprietary fault tolerance mechanisms. A discussion of the newer CORBA Fault Tolerance specification is deferred until the next chapter.

17.1 Basic Issues in Fault Tolerance

17.1.1 Replication Granularity

One important aspect of fault tolerance is the need for replication. Because CORBA is an object-oriented middleware technology, you might think that fault tolerance in CORBA is about the need to replicate objects. Although this is true, CORBA objects live in (POAs within) server processes, so, at a practical level, replicating CORBA objects involves replicating server processes. Indeed, most CORBA products that support replication do so at the granularity of entire server processes rather than at the finer granularity of individual objects.¹ This is because replication at the coarser granularity of an entire process dramatically reduces the volume of information that the CORBA runtime system needs to maintain about replicated entities, and so increases the scalability of applications that utilize replication.

17.1.2 Contact Details for a Replicated Object

If a client fails to communicate with one replica of an object then the client should attempt to communicate with another replica of the object. In order to do this, the client needs to have access to the “contact details” for all the replicas of an object. The flexibility of both interoperable object references (IORs) and implementation repositories (IMRs) provide the necessary infrastructure for this, as I now discuss.

Replicated Server deployed without an IMR.: As discussed in Chapter 10, an IOR can contain several sets of contact details. This makes it possible for a single IOR to contain a separate set of contact details for each replica of an object. A client attempts to communicate with a replicated object by using one of the sets of contact details in the IOR. If this fails then the client switches over to use another set of contact details in the IOR. CORBA does not specify which contact details should be tried first, but they are usually tried in the order they appear in the IOR.
Replicated Server deployed with an IMR.: As discussed in Section 7.2.3, some CORBA products allow a replicated server to be registered with the IMR. If this is done then the IOR of a replicated object contains the host and port of the IMR; when a client sends its first request to the IMR, the IMR redirects the client to one of the replicas of the desired object. The client sends future requests to that same replica. If the client’s communication with the replica ever fails then the CORBA runtime system in the client switches back to using the original host and port, which is for the IMR, and the IMR then redirects the client to another replica of the desired object.
Replicated IMR.: In order to prevent the IMR from becoming a single point of failure, the IMR should itself be replicated. An IOR for an object in a server that is deployed through a replicated IMR contains the host and port details for all the IMR replicas. Because of this, the client’s first request is sent to one of the IMR replicas. If the client cannot communicate with that IMR replica then the CORBA runtime system in the client switches over to use another replica of the IMR. The IMR replica then redirects the client to the desired (and possibly replicated) server.

17.1.3 Use of `PERSISTENT` POAs

The meaning of the PERSISTENT and TRANSIENT POA policies are explained in Section 6.1.3. To recap briefly, a POA (Section 5.5) is a container for servants (the host programming language objects that represent CORBA objects). The policies that are used to create a POA are applied to all the servants within a POA. If a POA has the PERSISTENT policy then the IORs for all servants/objects in that POA are valid even if the server process dies and is restarted. Conversely, if a POA has the TRANSIENT policy then the IORs for all servants/objects in that POA are valid only for the duration of the server process: once the server process dies, the TRANSIENT IORs automatically become invalid.

At a philosophical level, it is not sensible for TRANSIENT, that is, temporary, objects to be fault tolerant. At a practical level, a TRANSIENT IOR is automatically invalidated when a server terminates,² so this makes it difficult, if not impossible, for a fault tolerance infrastructure to work correctly when server processes that contain TRANSIENT objects die and are restarted.

If you want to implement a server that will be deployed with fault tolerance then ensure that you use PERSISTENT POAs in the server.

17.2 Example Products

17.2.1 OmniORB

OmniORB does not provide an implementation repository so there is no need to discuss how to replicate an omniORB server that is deployed through an IMR. Instead, this section focuses on how to deploy a replicated omniORB server without an IMR.

OmniORB has two configuration variables that, when combined, can be used to set up a replicated server.

endPoint = giop:tcp:<host>:<port>
The giop:tcp: portion of the value specifies that the IIOP communication protocol is to be used (omniORB supports other communication protocols too); after that the server’s host and listening port are specified. Some readers might think it is redundant to have to explicitly specify the host on which the server is running. However, doing this can be useful if the server runs on a multi-homed machine, that is, if the server’s machine is known by several different names or has several different IP addresses. The server listens on the specified host:port and also embeds that host:port information in IORs for objects within the server.
endPointNoListen = giop:tcp:<host>:<port>
The value of this variable is in the same format as that of endPoint. The server does not listen on the specified host:port, but it does embed that host:port information in IORs for objects within the server.

Let us assume that you want to have three replicas of a server: one replica will run on host1.foo.com and listen on port 5000, another replica will run on host2.foo.com and listen on port 6000, and the final replica will run on host3.foo.com and listen on port 7000. The omniORB configuration variables should be set for the first server replica as shown below:


# Extract from omniORB configuration file for replica 1
endPoint = giop:tcp:host1.foo.com:5000
endPointNoListen = giop:tcp:host2.foo.com:6000
	                 = giop:tcp:host3.foo.com:7000

The result is that the server listens on host1:5000, but exported IORs indicate that objects within the server can be contacted at any of: host1:5000, host2:6000 or host3:7000 (all within the foo.com domain). The corresponding information in the configuration file for the second replica is shown below:


# Extract from omniORB configuration file for replica 2
endPoint = giop:tcp:host2.foo.com:6000
endPointNoListen = giop:tcp:host1.foo.com:5000
	                 = giop:tcp:host3.foo.com:7000

Likewise, the information in the configuration file for the third replica is shown below:


# Extract from omniORB configuration file for replica 3
endPoint = giop:tcp:host3.foo.com:7000
endPointNoListen = giop:tcp:host1.foo.com:5000
	                 = giop:tcp:host2.foo.com:6000

The overall effect is that each server listens on its own host:port, but exported IORs contain the host:port details for all the replicas.

Note that an omniORB application finds the omniORB configuration file by the value of the OMNIORB_CONFIG environment variable. If you want to run several server replicas on the same machine then you will need to have several configuration files (one for each server replica) and set the OMNIORB_CONFIG environment variable appropriately when starting each replica. If you prefer to not have multiple configuration files then you can specify the endPoint and endPointNoListen configuration information as command-line options when starting each server replica. The servers then pass the command-line options as a parameter to ORB_init(), as discussed in Section 3.2.3.

17.2.2 Orbix

Orbix allows a replicated server to be deployed with or without an implementation repository. I discuss each of these forms of deployment in turn.

17.2.2.1 Deploying a Replicated Server with an IMR

Orbix administration is performed through sub-commands of the itadmin utility. Each sub-command performs a small amount of work so you typically need to execute several itadmin commands to complete a useful unit of work, such as registering an Orbix server with the IMR. However, itadmin has a built-in interpreter for an open-source scripting language called Tcl (pronounced Tickle). This makes it possible to write a Tcl script that performs the entire sequence of itadmin commands required to carry out a task. The Orbix Administration Made Simple chapter of the CORBA Utilities package [McH] discusses several useful task-based itadmin Tcl scripts. One of those scripts, orbix_srv_admin, simplifies the work involved in registering a server so that it can be deployed through the IMR.

When registering a server with orbix_srv_admin, you can specify either a single host or a list of hosts on which the server is to be run.³ If you specify a list of hosts then orbix_srv_admin registers the server as a replicated server.

As discussed in Section 17.1.2, the IOR of a object in a replicated server contains the host and port of the IMR; when a client sends its first request to the IMR, the IMR redirects the client to one of the replicas of the desired object. The client sends future requests to that same replica. If the client’s communication with the replica ever fails then the CORBA runtime system in the client switches back to using the original host and port, which is for the IMR, and the IMR can then redirect the client to another replica of the desired object.

The itconfigure utility is used to set up the initial configuration for an IMR. A option in this utility makes it easy to create a replicated IMR (so that the IMR is not a single point of failure). If you do this then, as discussed in Section 17.1.2, an IOR for an object in a server that is deployed through a replicated IMR contains the host and port details for all the IMR replicas. Because of this, the client’s first request goes to one of the IMR replicas. If the client cannot communicate with that IMR replica then the CORBA runtime system in the client switches over to use another replica of the IMR. The IMR replica then redirects the client to the desired (and possibly replicated) server.

The Orbix IMR keeps track of which server replicas are currently running, and it can optionally be used to automatically restart servers that have died. When a client sends its first request to the IMR, the IMR redirects the client to one of the currently running server replicas. The IMR can use either a round-robin or random policy in choosing to which server replica a client is redirected. By doing this, the fault tolerance mechanism also provides a per-client load balancing strategy.

17.2.2.2 Deploying a Replicated Server without an IMR

By default, an Orbix server with persistent POAs (Section 6.1.3) is deployed through the Orbix IMR. If you want to deploy a persistent-POA server without the Orbix IMR then your server has to invoke some Orbix-proprietary APIs.⁴ However, the PoaUtility class (discussed in the Creation of POA Hierarchies Made Simple chapter of the CORBA Utilities package [McH]) encapsulates the use of these proprietary APIs and makes it easy for different deployment options to be chosen through, say, a command-line option.

When deploying a persistent-POA server without an IMR then the server should listen on fixed ports. The fixed ports can be specified in the Orbix runtime configuration file, as illustrated in Figure 17.1. The following points should be noted about this configuration:

Orbix configuration is arranged into (potentially nested) scopes. For example, BankSrv is a scope, and BankSrv.replica_1 is one of three scopes nested within it. When an Orbix application is started, the -ORBname<scope> command-line option can be used to specify the scope from which the application should obtain its configuration.
Orbix allows different POAs within a single server to listen on the same or different ports. Proprietary APIs are required to make use of this flexibility, but the PoaUtility class [McH, Ch. 5] encapsulates the use of these APIs, and allows each POA manager (Section 5.7) to use a different port. The example in Figure 17.1 assumes that the servers use the PoaUtility class and have two POA managers: one identified by the label "core" and the other identified by the label "admin".
Variables with names of the form <label>:iiop:addr_list specify a list of host:port strings, some of which may have an optional "+" prefix. The server listens on any host:port entries that are not preceded with "+". However, all the host:port pairs are embedded in IORs for objects (within POAs controlled by the <label> POA manager) within the server.

If a server replica is started with the -ORBname BankSrv.replica_1 command-line argument then that server listens on both host1:5000 and host1:5001 (one port per POA manager), but the server also embeds the other listed host:port details in exported IORs. By starting a second replica with -ORBname BankSrv.replica_2 and starting a third replica with -ORBname BankSrv.replica_3, the overall effect is that each server listens on its ports but exported IORs contain the host:port details for all the replicas.


BankSrv {
    replica_1 {
        core:iiop:addr_list  = [ "host1:5000",
                                "+host2:6000",
                                "+host3:7000"];

        admin:iiop:addr_list = [ "host1:5001",
                                "+host2:6001",
                                "+host3:7001"];
    };
    replica_2 {
        core:iiop:addr_list  = ["+host1:5000",
                                 "host2:6000",
                                "+host3:7000"];

        admin:iiop:addr_list = ["+host1:5001",
                                 "host2:6001",
                                "+host3:7001"];
    };
    replica_3 {
        core:iiop:addr_list  = ["+host1:5000",
                                "+host2:6000",
                                 "host3:7000"];

        admin:iiop:addr_list = ["+host1:5001",
                                "+host2:6001",
                                 "host3:7001"];
    };
};

Figure 17.1: Orbix fault tolerance configuration for a replicated server

This Orbix mechanism is broadly similar to that offered by omniORB (Section 17.2.1), with the following minor differences:

All the objects in an omniORB server are accessible through the same port, while Orbix allows the possibility for some objects in a server to be accessible through one port and other objects in the same server to be accessible through a different port. The main apparent benefit of this additional flexibility in Orbix is that one port might be accessible across a firewall while access to the other port (and hence the objects accessible via it) might be restricted to clients inside the firewall. In effect, this provides a simple security mechanism. However, this security mechanism is likely to be too basic for the needs of most organizations. The more flexible approach offered by the CORBA Security Service (Chapter 23) has more widespread appeal.
If several omniORB applications run on the same computer but require different configuration then this is achieved by having a separate configuration file for each omniORB application. In contrast, the use of a separate scope for each application in an Orbix configuration file allows several applications to share the same configuration file. Minimizing the number of configuration files can simplify administration. Orbix allows configuration information to be stored in either a textual file or in a configuration repository, which is a database that is accessed via a CORBA server “wrapper”. The use of a configuration repository allows configuration information to be maintained in a centralized location while being accessible from Orbix applications that run on the same or different computers. This centralized maintenance of configuration information can further simplify administration.

The above differences between omniORB and Orbix are relatively minor when deploying replicated servers without an IMR. As far as fault tolerance is concerned, the biggest difference between omniORB and Orbix is that Orbix also supports fault tolerance for servers that are deployed with an IMR (as discussed in Section 17.2.2.1).

17.2.3 Orbacus

Orbacus does not provide any fault tolerance capabilities within its runtime system. Instead, it provides a command-line utility called iormerge. An example of its usage is as follows:


iormerge -f replica1.ior replica2.ior replica3.ior >new.ior

By default, iormerge interprets its command-line arguments as stringified IORs. However if, as shown above, the -f option is given then iormerge interprets its command-line arguments as the names of files that contain stringified IORs. The iormerge utility reads the “contact details” from these IORs and writes to standard output a new stringified IOR that contains all the contact details. If this newly created IOR is made available to client applications then the clients can communicate with any of the replicas.

Use of iormerge is sufficient for replicated servers that contain only a singleton object. However, it is not appropriate for servers that have, say, a factory object (Section 1.4.2.1) that can create new objects on demand. The reason for this is that the IOR of a newly created object contains only one set of contact details (for the server replica in which the object is created) rather than the contact details for all the replicas.

17.2.4 Server-side support for `corbaloc` URLs

As discussed in Section 12.4.2, CORBA has not standardized the server-side support for corbaloc URLs. Instead, most CORBA products provide proprietary APIs that can be used to make an object accessible via a corbaloc URL. Let us assume you use these proprietary APIs so that a server makes an object available under the name "foo". If you deploy three replicas of a server that listen on host1:5000, host2:6000 and host3:7000 then the following corbaloc URL can be used by clients to communicate with the replicated "foo" object:


corbaloc::host1:5000,:host2:6000,:host3:7000/foo

Use of such a corbaloc URL is sufficient for replicated servers that contain only a singleton object. However, it is not appropriate for servers that have, say, a factory object (Section 1.4.2.1) that can create new objects on demand. The reason for this is that the IOR of a newly created object contains only one set of contact details (for the server replica in which the object is created) rather than the contact details for all the replicas.

17.2.5 Critique

As you might expect, the proprietary fault tolerance mechanisms of different CORBA products differ from one other, both in their “look and feel” and in their capabilities. However, they have some characteristics in common:

With the exception of corbaloc URLs, none of the proprietary mechanisms require server-side coding. Rather, they are based on runtime configuration or the use of administrative utilities.
The proprietary fault tolerance mechanisms are concerned solely with getting several sets of “contact details” into the IORs that are exported from replicated servers. The mechanisms neither help you nor hinder you in maintaining state consistency across server replicas.
Although the mechanisms are proprietary, they do not hinder interoperability with client applications that are implemented using other CORBA products. This is because the proprietary mechanisms are concerned solely with embedding several sets of “contact details” into the IORs that are exported from servers. Since having several sets of contact details in one IOR is CORBA-compliant, a client implemented with any CORBA product can communicate with a server that uses a proprietary fault tolerance mechanism.

17.3 Miscellaneous Issues

17.3.1 Fault Tolerance is not Load Balancing

It should be noted that although fault tolerance and load balancing both rely on the use of replicas, a fault tolerance infrastructure does not necessarily imply load balancing. Most of the fault tolerance mechanisms discussed in this chapter do not provide load balancing. Instead, it is likely (though not guaranteed) that most/all clients will communicate with just one server replica; it is only when that server replica dies that clients will seek another replica with which to communicate. One exception to this concerns replicated servers that are deployed through the Orbix IMR (Section 17.2.2.1). In this case, the IMR performs per-client load balancing, that is, the IMR redirects some clients to one server replica, some other clients to another replica and so on.

Although the load balancing mechanism provided by the Orbix IMR can be useful, it is possible for the load from clients to be spread over server replicas in an unequal manner. For example, let us assume that 10 clients connect, via the Orbix IMR, to two replicas of a server, and that there are 5 clients connected to each replica. If each client sends a similar number of requests to servers then the load is balanced over the two servers, at least initially. However, let us assume that after a few minutes, 3 of the clients connected to one of the replicas terminate. We are now left with just 2 clients communicating with one server replicas while there are 5 clients communicating with the other replica. The Orbix IMR does not have any way of re-balancing the number of clients across the server replicas. Likewise, if a third server replica is now started then the IMR does not have any way of re-balancing the number of clients across the now-increased number of server replicas.

The term adaptive load balancing is often used to refer to load balancing mechanisms that periodically try to rebalance load from clients over servers. CORBA has not standardized on a load balancing mechanism and further discussion of load balancing is outside the scope of this book. Several CORBA products provide proprietary load balancing mechanisms. Interested readers are advised to consult documentation of CORBA products for details and to do a search with an Internet search engine for, say, “CORBA adaptive load balancing”.

An alternative approach to load balancing is to invest in a hardware load-balancing switch. The hardware switch acts as a delegation server. It receives requests from client machines and delegates them to server machines specified in the switch’s configuration. Some hardware switches utilize mechanisms that attempt to keep client load balanced across server machines.

17.3.2 Timeout Values in a Fault Tolerant System

A practical issue to keep in mind is that you may need to adjust client-side timeout values when setting up a fault tolerant system. In particular, if a server process is not running then a client’s attempt to connect to the server will fail quickly (typically within a few milliseconds) and so the CORBA runtime system in the client can quickly fail-over to use another set of contact details in an IOR. However, if the server’s computer is turned off or is physically disconnected from the network then the client’s attempt to connect to the server may take a relatively long time (perhaps tens of seconds) to fail. Such a long delay before a fail-over occurs is often unacceptable to users. You should consult your CORBA vendor’s documentation to find out how you can shorten the client’s timeout for establishing connections. If your clients and servers all run on a fast local area network then you can usually shorten the connection timeout to a tiny fraction of a second. Doing this will result in a fast fail-over when a server machine dies.

1: At least one CORBA product, Orbix, allows replication at the mid-level granularity of individual POAs. However, in practice, very few Orbix applications take advantage of this finer level of replication granularity. Instead, most deployed Orbix applications that use replication choose to replicate all the POAs within a server process, so the effect is to replicate the entire server process.
2: The automatic invalidating of a TRANSIENT IOR is usually achieved by having the CORBA runtime system in a server embed a timestamp in the object key (Section 5.6.1). When a server dies and is restarted, the timestamp information in a previously-exported TRANSIENT IOR is out of date; this ensures that a TRANSIENT IOR is not valid across restarts of a server.
3: Technically, you specify a list of node daemons, rather than a list of hosts. However, there is normally one node daemon per host so the distinction between node daemons and hosts is not relevant to the discussion at hand.
4: The need to use proprietary APIs to choose between different deployment models for Orbix servers is slowly disappearing. Orbix 6 allows this decision to be made through runtime configuration values for C++ applications but, at the time of writing, Java applications still require use of proprietary APIs.