Date post: | 21-Jan-2015 |
Category: |
Documents |
Upload: | edwin-hernandez |
View: | 306 times |
Download: | 10 times |
IMPLEMENTATION OF A FAULT TOLERANT NETWORK MANAGEMENT
SYSTEM USING VOTING
High Performance Computing and Simulation Research Lab
Department of Electrical and Computer Engineering
University of Florida
Dr. Alan D. George and Edwin Hernandez
Abstract
Fault Tolerance (FT) can be achieved through Replication and N-Module Redundancy (NMR) systems are
widely applied in software and hardware architectures. NMR Software system could have dynamic or
static voting mechanisms, indeed static voters decrease the complexity of the control algorithms and add
new performance bottlenecks, meanwhile dynamic voters have increased control complexity and some
other performance bottlenecks. Every Network Management System (NMS) requires of replication to
increase reliability and availability. The software implementation of a Fault Tolerant Network
Management System (FT-NMS) shown here uses the NMR approach. The paper contributions are related
to the implementation issues of the system and measurements of the performance at the protocol and
application level. The resulting measurements lead to conclude that the system can improve the bottlenecks
and keep simplicity by taking into account synchronization and buffer management techniques.
Keywords: Network Management, Fault-Tolerant Distributed Applications, SNMP, NMR.
1. INTRODUCTION
Several local-area and wide-area networks rely on Network Management information for decision making
and monitoring. Therefore, a Network Management System (NMS) has to maintain high availability and
reliability in order to complete those tasks. Consequently, the implementation of an NMS has to be fault
tolerant and include software replication. Moreover, fault masking and voting are techniques added to the
system as a consequence of the replication.[Prad97]. In addition, the most prevalent solutions in
distributed systems to create FT server models use replicated servers, redundant servers and event-based
servers [Lan98]. If replicated servers are chosen, the system selects among static or dynamic coordination.
In static coordination, it does not require a leader election mechanism; indeed, fault masking using voting is
easier to achieve. In dynamic coordination, protocol overhead is required and leader election and voting
coordination algorithms have to be included in every transaction. There are several replica control
protocols involving leader election. They depend on the quorum size, availability and system distribution,
for example: Coterie, Quorum Consensus Protocol, Dynamic Voting, Maekawa’s, Grid, Tree and
Hierarchical. Those are found in [WU93] and [Beed95]. The algorithms mentioned above introduce
overhead and can degrade the performance of the system in terms of communication or processing
[SAHA93]. There are some other replica control protocols defined in [Paris94] and [Sing94] in which the
location of the replicas of a database or the location of the replicas in a network are minimized. However,
the protocol complexity is not avoided and remains present. Triple Module Redundancy (TMR) and
masking fault tolerance can be found in [Aro95] and [Brasi95] , a TMR voting protocol does not require a
leader, neither leader election protocols, but it requires message broadcasting from one to all nodes at the
group, therefore the protocol complexity relies on message ordering and coordination. Experimental
measurements of the TMR nodes yielded a total processing time between 50 to 200ms using 100 messages
of 64 bytes long. Static voting architectures will require high reliability of the network node where the
voter reside, because the voter becomes a single point of failure (SPF). However, protocol overhead is light
and fault recovery can be achieved almost instantaneously. Those advantages support the use of a voter in
high performance networks and the implementation of static TMR systems such as the one presented in this
paper.
In addition to the requirements of group communication, the detection of a failure is generally done by
using the combination of a heartbeat and a predefined [Maffe96], [Prad96] [Landi98] or a self-adapted
timeout period [Agui97]. The predefined timeout was used in the FT-NMS implemented. Nevertheless,
there are several other techniques for handling faults in network managers and agents as defined in
[Duar96] with the hierarchical adaptive distributed system-level diagnosis (HADSD).
Moreover, monitoring is one of the main tasks of any NMS. For this purpose, a traditional approach was
used, in which a set of replica managers were organized in a tree structure running at a set network’s node.
The managers are able to monitor a set of agents using Simple Network Management Protocol (SNMP)
request as defined in [Rose90] and [Rose94]. Decentralization and database monitoring were not
implemented for the application [Scho97] and [Wolf91].
This paper is organized in the following manner. First at Section 1, some assumptions are presented and
they are also used for the experiments and system’s architecture design (Section 2). In Section 3,
algorithms and system design are described. Finally, performance measurements were run in different
testbeds such as Myrinet and ATM-LAN as shown in Section 4.
2. Assumptions
The FT-NMS application relies on a simple heartbeat error-detection mechanism and sender-based message
logging to generate monitored information (replica are responsible to initiate the communication is not the
voter). Failures are detected when a failing system stops replying to heartbeats and is considered to have a
transient fault. If a transient fault takes longer than the timeout period the faulty unit is considered “down”,
consequently, the unit will not be able to provide useful information to the system. Basically, a machine
will fail by crashing or fail-stop behavior.
A TCP/IP environment with naming services has to be available for the replica units. In addition to that, it
is assumed that SNMP daemons should be already installed and working correctly in all the agents to be
monitored. The data types that can be retrieved from the agent’s Management Information Base (MIB) are
INTEGER, INTEGER32 and COUNTER as define in the ASN.1 standard in [Feit97] and [OSI87]. Each
node should be in the same sub-network and consequently avoid long time communication delays between
managers, otherwise timeouts have to be modified. And finally, The heartbeat interval used is one second.
3. System’s Model
The System’s model is conformed by two sub-systems: the managers (Section 3.1) and the voter-gateway
(Section 3.2). Managers depend on the voter, who also works as a coordinator. All the applications
mentioned here are multithreaded and client/server. Moreover, the manager makes use of the Carnegie
Melon University SNMP Application Programming Interface (CMU-SNMP API). The voter is being run
in a separate, highly reliable computation node, meanwhile replica managers should run at different
network nodes. The different manager modules are shown in figure 1.
Figure 1. SNMP Manager Application, using the CMU-API to handle SNMP packets for the agents
3.1. Manager
The manager application uses the CMU-SNMP API to handle snmpget packets to the agents. It has access
to a local MIB database (Figure 1.) which is handled by HCS_SNMP Object handler (Figure 2.), this object
is used to support all the different MIBs found in all network management agents. In addition to that, a
simple UDP_Echo server runs concurrently for handling of the heartbeat service provided by the voter.
class HCS_SNMP {private: struct snmp_session session, *ss; struct snmp_pdu *pdu, *response1; struct variable_list *vars; char* gateway, *community; int Port; oid name[MAX_NAME_LEN]; int name_length; int HCS_SNMPCommunication(snmp_pdu* response, char** varname, char** value);public: HCS_SNMP(char* gateway, char* community); ~HCS_SNMP(); int HCS_SNMPGet(char** namesobjid, int number, char** varname, char** value);
};
Figure 2. Class description of the SNMP Object
The main goal of the FT-NMS is distributed method for reliable monitoring of MIBs handled at different
agents. The system is designed to keep a heavy weight process running for each agent (Figure 1.). Each
heavy-weight process is able to handle 64 simultaneous Object Identifiers (OID) from the MIB at any
agent. The SNMP API provides all the libraries and services to convert from the Abstract Syntax Notation
One (ASN.1) to the different data types used in C++. Each manager is designed to read a table of OIDs
that has to monitor from the agents using polling. In addition to that the manager should define a polling
strategy and write all the responses to a file. Furthermore, each manager creates and sends a TCP packet
with the format shown in Figure 5. to the main voter application informing the results gathered from the
agents being monitored.
Also, the sampling time and the number of samples to monitor is defined for every OID. In order to reach
accuracy in the measurement, synchronization is required to achieve concurrency on each poll, otherwise
the differences between the value sampled or measured to agenti at To and any ∆T will have a misalignment
Thread for HeartbeatListening
UDP_echo serverSNMP OBject Handler
CMU - API
MIBDatabase
TCP SocketsUDP Sockets
ReliableCommunication with
GatewayEvents Local
Data
with the values sampled by agentI+1 from another replica. This behavior could not affect large sample
interval in which the sampling time is greater in several orders of magnitude to the misalignment, but
assuming that in high performance networks the sampling time must be really small, synchronization
should be achieved as a priority. Therefore a Two Phase Commit protocol is used to achieve
synchronization between samples and to coordinate groups of replica monitoring applications (Sec. 3.1.1)
3.1.1. Sample synchronization through a Two Phase Commit Protocol (2PC).
As previously stated, the main weakness of a distributed network management system is time
synchronization. Nevertheless, a Network Time Protocol (NTP) could supply some useful information, the
network traffic and lack of precision are not suitable for High Performance Networks (HPN). In fact, a
2PC protocol is easier to implement and it will provide the required synchronization. In addition, the
implementation of the heartbeat was merged with the 2PC avoiding any possibility of deadlock (Figure
3.b.).
Monitoring at the manager is done using the pseudo-code at Figure 4. The manager waits for executing the
sampling action (commit) to the agent as the voter delivers the commit packet to all non-faulty managers.
Consequently, the manager delivers the “SNMP Get” packet to the agent, polling the required information.
Finally the manager transmits the information sampled to the gateway-voter element. Observe that a
correction_factor is introduced to the waiting time between samples and the sampling time. This
modification keeps an accurate sampling time and reduces the error introduced on time delay spend on
synchronization and round-trip communication to the agent.
It also can be drawn from Figure 4., the minimum sampling time at the manager is shown in eq.1.
Tpc + Tsnmpget +TmsgResponse. (eq. 1)
(a) (b)
Figure 3. Two Phase Commit Protocol to achieve synchronization
While (n_samples>I){I++;Start=gethrtime();HCS_MAN->TwoPC(); // TCP socket connectionHCS_MAN->SNMPGet(OID’s, agent, &Response); // UDP socket connectionHCS_MAN->SendResponse(Response, Gateway); // TCP socket connectionCorrection_factor=gethrtime()-Start;Wait(sampling_time - Correction_factor);
}
Figure 4. Pseudo-code executed at each replica manager
3.2. The voter–gateway (GW).
Having different replicas monitoring the same agent, the voter collects all the non-faulty measurements. In
order to achieve congruent results and generate a voted output an instance of the voter-gateway has to be
running in a highly reliable network node. The gateway or voter is shown in Figure 5. As mentioned above,
the use of a voter avoids the implementation of complex leader election schemas in replica management.
Fault masking is easily achieve in transitions of N to N-1 replicas, and processing delays are almost non-
existent. In other words, whenever a failure is injected to the system, and any manager application can fail
with graceful degradation of the overall system.
However the Voter becomes a performance bottleneck in the sense that all the traffic of the Managers is
directed to the Voter and therefore several performance issues have to be found and ways to improve them
should be reached.
Manager
1
2
3
2PC
Msg PacketSNMP GET
VOTE
R
Tpc + Tmsg +Tsnmpget + Interval = TimmingThe Manager only waits (Interval seconds)
before making the next sample
Accepting CommitLWP
CommitBoundedThread
CommitBoundedThread
CommitBoundedThread
Request to Commit
Manager
Manager
Manager
Request to Commit
Request to Commit
//Thread Contentvoid* commit(void* sockdesc){ read (sock_desc, commit_request); P(mutex); n_commits++; V(mutex); wait_until (ncommits==n_available_managers); write(sock_desc, "GO"); P(mutex); n_commits--; V(mutex);}
Mutlthreaded 2PC commitprotocol for synchronization
The LWP monitors the TIMEOUT for the commit Threads,therfore if from the available managers one fails to COMMIT
the whole operation should fail by a TIMEOUT and send aCANCEL but without minding the simultaneously approach
GOAL: Use the 2PC to achieve synchronicitybetween the Manager and the Voting Application
Figure 5. Gateway Application Architecture.
As shown in the Figure 5., there are several objects working together to achieve total monitoring or N
Replica Managers monitoring M agents concurrently. The voter-gateway was tested using two approaches
asynchronous messages from the Managers and total synchronization using the 2PC protocol [Chow97].
The communication frame among the manager and the voter is presented in Figure 6.
Total Length: 560 Octets/bytes
Figure 6. SNMP information frame Manager-to-Gateway.
As show here voter has local objects which are initialized with the information of all the replicas, such as
OIDs monitored by the manager, the agent itself has some information concerning agent name, parameters
to be measured from the agent. The big scenario is drawn in the Figure 7. First, the voter creates
Manager Name(16 bytes)
Structure of Msg (Message from Manager-to-Voter)
TCP HeaderTime Stamp(16 bytes)
Agent Name(16 bytes)
OID IN OCTET FORMAT(256 bytes)
VALUE MEASURED(256 bytes)
Hearbeat Thread(timming = 1sec)
LocalObjects
UDP SocketsTCP Sockets
Thread forRecepction ofInformation
a thread for each TCPconnect()
a thread for each TCPconnect()
Buffer ofreceived data Voting Thread
(one per agentmonitored)
Voting Thread(one per agent
monitored)
Voting Thread(one per agent
monitored)
Server Side
client Side
MainApplica
tion
VotedDB Voted
DBVoted
DB
GeneralLog file
Manager Agent
Figure 7. Distributed Fault Tolerant Network Managers and the Gateway accessing a Network Agent
via RPC calls, NxM instances of Managers (hcs_snmp_man objects) in each network node, where M is the
number of agents to monitor by the voter, and N is the number of replicas. There are K servers to perform
as network nodes or peers for remote execution of the NxM instances.
Then, when all the instances are executed, each replica manager will poll its correspondent agent. Hence,
the replica gathers the responses of all the OIDs defined to monitor, as a consequence the replica also
generates the frame voter-manager with the information gathered (Figure 6).
The tasks done at the voter are resumed as follows:
a) Voter reads all configuration files about agents to monitor and references to the OIDS to be read by the
manager. It generates local objects of the environment driven.
b) The rshells are executed in each of the nodes of the system.
c) The voter activates the threads for heart-beat processing and failure detection, and the TCP port
listener for SNMP results and commits.
d) Concurrently replica managers communicate with the Agents collecting network management
information
e) Replica Managers send the queried information to the voter, using the frame at Figure 6.
f) The voter handles all the arrival information using two techniques: a linear buffer and a hash table.
Simultaneously, local threads are created per agent. The thread is used to dig into the data colected and
generate the voted result for every agent being monitored.
Each message that arrives to the voter is converted to the format of the class type MsgFormat at figure 8.
Voter/ NM Proxy
VOTEDDataBase
Replica ofManager
Replica ofManager
Replica ofManager
Data Base ofLocal Agents
Data Base ofLocal Agents
DataBase ofLocal Agents
Network Agent
MIB
SNMPGETResponse
SNMPGETResponse
SNMPGETResponse
VOTE for OID
Vote forOID
ATM Network
SNMPd running/Agent
Hearbeat/rpcshell and
Collection ofVotes
class Msgformat{public: int marked; // message marked to be deleted char* manager; // manager name char* agent; // agent owner of the information char* timestamp_client; // timestamps client and server (manager and GW) char* timestamp_server; char* OID; // Object Identifier according to the MIB char* SNMPResponse; // Response from the Agent hrtime_t start; // for performance measurements. Msgformat(); ~Msgformat();};
Figure 8. Msgformat class used at the voter to collect the information from the replica managers.
The Msgformat instance is stored whether in the buffer or the hash table. It is expected that a thread
(fillbuffer in figure 9.a) is created per message received from the replica managers, therefore if all the
replica managers sample concurrently, the number of threads created will be N_Agents*N_Mananagers.
The access policy to the shared structure (buffer or hash table) used by the threads is round-robin. In
addition to all of these threads from the Replicas, there are “voter” threads created for each agent. Each
voter-thread is in charge of pulling from the buffer and basically generate the voted output. This process
can be viewed as a join between two tables, the first one composed of agents and OIDs to monitor and the
“table” of messages, which is the buffer. The JOIN operation executed is msg-
>OID==agents[thread_id].OID and msg->agent=agents[thread_id].agentname. (Voter in figure 9.b.)
void* fillbuffer(void* sock_desc){ while (message[1]!='q'){ if (read(sock_desc, &message, SIZE_MSG)==SIZE_MSG) msg = new Msgformat; msg->start = gethrtime(); strncpy(msg->manager, message, 16); strncpy(msg->timestamp_client, message+16,16); strncpy(msg->agent, message+32, 16); strncpy(msg->OID, message+48,256); strncpy(msg->SNMPResponse, message+304,256); gettimeStamp(msg->timestamp_server); msg->marked = 0; P(db); buffer->append((void*) msg); V(db); } } close(sock_desc); thr_exit((void*) 0);}
void* Voter(int agent_id){ while (NOT(cancel)){ for each agent[agent_id].OID do { P(db); while (buffer->length()>=k){ k++; Msg=buffer->pop(); If (msg->agent==agents[agent_id].agentname] && (msg->OID==agents[agent_id].OID[j]){ T_buffer->append(msg); } } if (T_buffer_>length()==getAvailableManagers()){ agent[agent_id].file << Tstamp<<”Average_Values(T_buffer); delete_elements_in_buffer(); } delete T_buffer; V(db); } }}
(a) (b)
Figure 9. Threads for Filling to and removing elements from the linear buffer. (messages from replica
managers to the voter)
The pseudo-code shown in Figure 9.a. for the process of filling up the buffer as well in Figure 9.b. for the
voter’s threads. As seen here the voting function depends upon the number of available managers,
getAvailableManagers(), this is used to determine whether the number of messages in the buffer is valid
or not. In case of failure from the replica managers, the number of messages will be greater or less than the
number of available managers, therefore the sample will be simply be lost and not processed. It won’t be
until the next sequence of messages arriving into the queue that the process will continue normally. As
mentioned before the access method for all the voter’s threads is a round-robin sequence, they also share
the context with all the fillbuffer() threads generated upon the arrival of the SNMP managers-voter
packets. (Figure 6.)
IF instead of using the linear buffer (figure 9a and 9b), this structure is substituted by a double hashed
array, having as hashing functions the OIDS and the Agent Name. (see Figure 10.a and 10.b.)
void* VOTER_h(void* agent_id){while (NOT(cancel)){ j=0; for each agent[agent_id].OID do { P(db); j++; if (buffer[agent_id][j]->length()>=getAvailableManagers()){ agent[agent_id].file << Tstamp<<”Average_Values(T_buffer); delete_elements_in_buffer(agent_id,j); } V(db); }
}
void* fillbuffer(void* sock_desc){ Msgformat* msg; char message[600]; int k, m; while (message[1]!='q'){ if (read((int) sock_desc, &message, SIZE_MSG)==SIZE_MSG) { msg = new Msgformat; msg->start = gethrtime(); strncpy(msg->manager, message, 16); strncpy(msg->timestamp_client, message+16,16); strncpy(msg->agent, message+32, 16); strncpy(msg->OID, message+48,256); strncpy(msg->SNMPResponse, message+304,256); gettimeStamp(msg->timestamp_server); P(db); m=getAgentIndex(msg->agent); //Hashing Functions. k=getOIDIndex(msg->OID, m); if ((m>0) && (k>0)) { Hashed_buf[m][k]->append((void*) msg); } else { ERROR();
elete msg; }
V(db); } } close((int) sock_desc); thr_exit((void*) 0);}
(a) (b)
Figure 10. Threads for Filling to and Reading from the double Hashed Table.
As Shown here the number of iterations is reduced from O(n) for the Linear buffer to approximately O(log
n) of the Hashed array. The dynamic voting is achieved to in a similar way than in the linear buffer, the
difference here is that instead of loosing the sample, the hash structure will have more than the number of
allowed elements, manAvailableManagers(). Therefore in the subsequent iteration an error of a sample
will be introduced to the measurment after a failure but all subsequent measurement will proceed normally.
The Failure detection system and definition of manAvailableManagers() is presented in sect. 3.3
3. 3. Heartbeat and Status of the Nodes, Managers and Agents.
As stated in the assumptions of the system heartbeats are issued to the managers every second (or defined
interval during compiling time). A simple echo server is running per node and a timeout mechanism is
used to switch a manager of a set of managers from NORMAL into FAULTY state. Later on, if a second
timeout is reached the manager or node does not return to a NORMAL state is erased from the group and
declared DOWN. Thus, the number of available managers is decreased by one since a FAULTY state is
detected.
A recovery action to maintain the number of available managers above a threshold, can be easily achieved
by finding the next available network node and running a remote shells command into it with the
monitoring application. In addition, agents are as well considered NORMAL, FAULTY or DOWN. A
FAULTY behavior of an agent is defined after a timeout from which no SNMP or null responses are sent
from the manager. The Agent can switch from the FAULTY state into the DOWN state after a second
timeout. The manager kills itself and the agent is not monitored anymore.
4. Experiments, Fault Injection and Performance Measurements.
Performance experiments were run, these experiments were executed at the HCS, ATM LAN and Myrinet
SAN, using as nodes for managers the following workstation’s architectures:
• Ultra-Station 30/300, 128 MB of RAM (Managers)
• Ultra-Station 2/200, 256 MB of RAM (Gateway station, Managers and agents)
• Ultra-Station 1/170, 128 MB of RAM (Managers and agents)
• Sparc-Station 20/85 , 64 MB of RAM (Managers and Agents)
• An ATM Fore-HUB and a Fore-ATM-switch.
All the measurements were done in terms of the latency added by voting process and monitoring, at both
sides, the manager and the voter-gateway.
4.1. Performance of SNMP at the managers.
Testbed measurements where done to determine the latency of different OIDs using the CMU-SNMP
protocol using Myrinet and ATM-LAN’s
Figure 11. Latency Measurements using CNTR32 and OCTET STRING data types in a Myrinet and ATM
testbeds.
As shown in figure 11, the latency of different combinations of SNMPGET commands using the CMU-
SNMP API grows constantly as the number of OIDs is increased. Figures a) and c) where done using the
Round- t r ip La tency o f SNMP ud ing OCTET STRING da ta types
0
5
1 0
1 5
2 0
2 5
3 0
3 5
1 2 4 8 1 6 3 2
N u m b e r o f o b j e c t i d e n t i f i e r s ( O I D s ) O C T E T S T R I N G
ms Roun t r i p T imming (My r i ne t )
R o u n t r i p T i m m i n g ( A T M )
Round-trip latency of SNMP using CNTR32 data types
0
20
40
60
80
100
120
140
1 2 4 8 16 32 64
Number of object identifiers (OIDs) CNTR32
ms Rountrip Timming (Myrinet)
Rountrip Timming (ATM)
T i m e d i s t r i b u t i o n o f a S N M P G E T r e q u e s t a t t h e M a n a g e r u s i n g C N T R 3 2 d a t a t y p e s
0
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0
A T M1 C N T R 3 2
A T M6 4 C N T R 3 2
D A T A T Y P E S
M Y R IN E T1 C N T R 3 2
M Y R IN E T6 4 C N T R 3 2
n u m b e r o f o b j e c t i d e n t i f i e r s ( O I D S )
ms
P r o t o c o l a n d A g e n ta p p l ic a t i o nd e c o d i n ge n c o d i n g
Time d is t r ibut ion of a SNMP Get request a t the manager us ing OCTECT STRING data types
0
5
1 0
1 5
2 0
2 5
3 0
3 5
4 0
1 O C T E TS T R I N G
A T M
3 2 O C T E TS T R I N G
A T M
1 O C T E TS T R I N GMyr ine t
3 2 O C T E TS T R I N G Myr ine t
N u m b o e r o f o b j e c t i d e n t i f i e r s ( O I D s )
ms
pro toco l and agen t
app l i ca t iondecod ing
encod ing
OCTECT STRING data type, and b) and d) using the CNTR32 Data Type. In average the number of
CNTR32 and INTEGER Requests using the SNMPGET frame cover more than 85% of all the requests. For
these reason the performance experiments run at the agents included only CNTR32 data types.
The percentage of time involved in the processing of each request to the agent is In the process of
encoding/decoding ASN.1 information, the application and the protocol is shown in Table 1. The
protocol/Agent overhead covers from 43.8% to 94.9% of the overhead. This situation turns out to be the
first performance bottleneck found at the whole process. It is important to remember that at the Agent level
the ASN.1 encoding/decoding overhead is also executed and in addition to that the agent should made
access to its Control Status Registers and create the Frame. The experiments where run at the Myrinet link
between two Ultra-2/200 with 256 Mbytes and for ATM, one Ultra-2/200 polling information out the
ATM-Fore switch.
Table 1. Processing overhead distribution at the Manager.
ATM Latency (ns) Distribution 1 CNTR32 64 CNTR32 1 CNTR32 64 CNTR32ATM ATM Myrinet Myrinet
encoding 1.06% 2.41% 1.58% 1.32%decoding 1.78% 4.74% 1.97% 2.33%application 40.98% 3.04% 52.65% 1.44%Protocol and Agent 56.18% 89.80% 43.80% 94.90%No. of samples: 500, Sampling time: 1 second
Figure 12. Average latency at all replica managers having different levels of replication and different
number of monitored agents.
At figure 12, both a) and b) corresponds to the measurements made at the manager, but in this case the
issue mentioned as SNMPGET corresponds to the whole process described at Figure 11. For these
measurement the number of OIDs used was fixed to eight. The eight OIDs selected are those presented in
the Table 2. The main concern for all the experiments is that the sampling time is fixed and exactly the
same in number of samples for all the agents being monitored. As a consequence the values of
TCPConnect and Latency to Voter are the transmission of the eight OIDs transmitted from the replica
manager.
Table 2. OIDs selected for the performance experimentsOID ASN.1 Data Type
.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifInOctets.3 CNTR32
.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifInOctets.4 CNTR32
.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifInOctets.5 CNTR32
.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifInOctets.6 CNTR32
.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOutOctets.3 CNTR32
.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOutOctets.4 CNTR32
.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOutOctets.5 CNTR32
.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOutOctets.6 CNTR32
L a t e n c y a t t h e m a n a g e r ( 1 r e p l i c a )
0 . 0 0
5 . 0 0
1 0 . 0 0
1 5 . 0 0
2 0 . 0 0
2 5 . 0 0
3 0 . 0 0
3 5 . 0 0
0 5 1 0 1 5 2 0
N u m b e r o f A g e n t s
ms
2 P C
S N M P G E T
T C P c o n n e c t
L a n t e n c y t o t h e v o t e r
Latency at the manager (5 replicas)
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
0 2 4 6 8 10
Number of Agents
ms
2PC
SNMPGET
TCPconnect
Lantency to thevoter
Those entries correspond with the number of octets in/out at the different devices monitored by the snmpd at
each workstation being polled by the manager. According to the results from Figure 12, the Two Phase
Commit (2PC) protocol and the SNMGET required more than 95% of the overhead. It is shown also in
Table 3., that the SNMPGET covers more than 65% of the process and in comparison with the 2PC which
only occupies less than 30%. The algorithm is shown in Figure 4.
Table 3. Comparison of 2PC and SNMPGET at the manager using different agents and replication
Percentage of UtilizationNumber of
Agents2 agents 16 agents
Replicas 2PC SNMPGET 2PC SNMPGET1 23.80% 69.50% 32.45% 62.82%2 23.10% 68.10% 24.50% 47.10%3 21.80% 72.07% 25.90% 66.45%4 27.83% 76.87% 28.22% 63.45%5 27.14% 60.60% 29.80% 60.60%
Therefore, this overhead introduced by the manager has to be compensated and taken into account to define
the minimum sampling time stated in section 3.1, equation 1.
4.2. Testbed experiments using the 2PC.
By adding the 2PC protocol to each manager, the overhead introduced to the application represents the 25%
but the total Processing time at the manager. It is important to point out that the “Total Processing Time”
here is related to the voter system and includes the inter-arrival time between SNMP gateway-voter
messages, the searching time at the shared buffer and the correspondent I/O to disk. In addition, the total
processing time is per Thread. In other words N_Agents will be able to process the incoming OID
concurrently at the average time mentioned above.
(a)
(b)
Figure 13. Inter-arrival time and total processing time at the voter.
In both cases the dominant factor is the Inter-arrival time of messages to the queue. This time represents
the time difference between the arrival to the shared buffer or a message identifying an OID and the
moment in which the last message coming from a different manager arrives to the shared structure.
It is important to remember that the system shares the access for the threads of filling up the buffer,
fillbuffer() and the voters (each per agent). Thus, the Inter-arrival rate is being affected by the
processing time of the voters or all the “join” processes mentioned in section 3.2, figure 9.
The Amount of messages received per sample are defined by N_Managers*N_Agents*N_OIDs. In other
words a system with 16 agents, 5 managers and 8 OIDs will send out to the voter 640 messages, each
Total processing time (TPT) of eight concurrent OID at the voter
0
500
1000
1500
2000
2500
3000
3500
0 5 10 15 20
Number of agents
ms
1 TPT 1 Manager2 TPT 2 Managers3 TPT 3 Managers4 TPT 4 Managers5 TPT 5 Managers
Inter-arrival time of eight concurrent OID at the voter
0.00
500.00
1000.00
1500.00
2000.00
2500.00
3000.00
0 2 4 6 8 10 12 14 16 18
Number of agents
ms
1 Interarrival 1 Manager2 Interarrival 2 Managers3 Interarrival 3 Managers4 Interarrival 4 Managers5 Interarrival 5 Managers
message with a fixed size of 560 bytes (figure 6) with represents 358400 bytes received by the Manager, in
this particular iteration. For instance in an iteration with two replicas and 16 agents, the average processing
time is 500ms and having 8 OIDs per agent, the minimum sampling time is 4 seconds. Any other shorter
sampling time measurement will lead to an erroneous monitoring and the results won’t be accurate to the
sampling time. To avoid these problems the sampling time for the 5 managers was defined at 30 seconds,
and the number of samples to 20.
4.2.2. Testbed experiments with the hash table.
In order to reduce the searching time to the shared buffer structure and make that reflect it in the
performance of the whole application, the linear-search was substituted by the hash table (figure 10). The
results of using a Hash Table are shown in Figure 14.
Preliminary the reduction of the search time will allow more time for the other threads to process and
reduce the sections of the buffer in which semaphores are required for access. The Hash functions are very
simple and the relationship is one-to-one since it is the voter which defines the managers and agents of
monitoring.
In comparison with the linear search done at the previous structure an improvement of 50% was achieved
when having three or more replicas, and it remains without change for one or two replicas. An interesting
behavior is shown with 4 or 5 replicas in which the variation between processing times is not more than
hundreds of milliseconds, which is expected given the nature of the search time at the hash table.
(a)
(b)
Figure 14. Inter-arrival and Total Processing Time at the voter using a Hashed Table.
4.2.3 Testbed experiments with an asynchronous system
If the 2PC protocol is not included, the manager will be able to be fully independent and reduce the
overhead involved in more than 20%. However, the overhead of the replica control protocol which is non-
existent will cause the system to degrade drastically in performance. The values to be measured at the voter
have a total processing time of a minimum of 200 ms to a maximum of 2.2 seconds (having 4 replicas an
16 agents). Therefore, in this particular system, the lost of synchronization drastically degrades the System.
Observe that from Table 4 and Figure 15, this degradation is at the order of 60% respect to the Hash Table
method using the 2PC. Preliminary measurements showed that the degradation is even worst if the Has
Table is substituted by the linear search using the shared buffer structure.
Total processing time (TPT) of eight concurrent OID at the voter
0200400600800
10001200140016001800
1 2 4 8 16
Number of Agents
ms
1 TP 1 Manager
2 TP 2 Managers3 TP 3 Managers
4 TP 4 Managers
5 TP 5 Managers
Inter-arrival time of eight concurrent OID at the Voter
0.00200.00400.00600.00800.00
1000.001200.001400.001600.00
1 2 4 8 16
Number of Agents
ms
1 Interarrival 1 Manager
2 Interarrival 2 Managers
3 Interarrival 3 Managers
4 Interarrival 4 Managers
5 Interarrival 5 Managers
Table 4. Total processing time without synchronization using the hash table
Total Processing Time (ms)Number ofReplicas
1 2 3 4
No of Agents No of OIDs1 8 220 436 483 4612 8 297 556 580 9264 8 345 778 890 12078 8 322 800 1137 1659
16 8 344 975 1325 2210
As showed in the previous two sections where the total processing time grows with the number of agents,
in figure 15 the same behavior is obtained but in greater proportions.
Figure 15. Average total processing time (TPT) introduced at the voter in the asynchronous system.
4.3 Comparison between “non-voted” and “voted” outputs.
One of the main goals with a Fault Tolerant System is the achievement of transparency for every
measurement by reducing the addition of noise product of the replication. In order to define whether is the
T o t a l p r o c e s s i n g t i m e o f e i g h t c o n c u r r e n t O I D s a t t h e v o t e r
0
5 0 0
1 0 0 0
1 5 0 0
2 0 0 0
2 5 0 0
0 2 4 6 8 1 0 1 2 1 4 1 6 1 8
N u m b e r o f A g e n t s
Tim
e (m
s) O n e R e p l i c a sT w o R e p l i c a sT h r e e R e p l i c a sF o u r R e p l i c a s
FT system reduces or gracefully degrades the accuracy of the measurement, in figure 16 it’s shown a
sample of voted and non-voted measurements. The experimental conditions for Figure 16 are: three
replicated managers monitoring a FORE-ATM router/hub (hcs-gw).
Figure 15. Fault-free of voted and non-voted measurements at the Gb-ethernet port at hcs-gw (router)
using .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOutOctets.3
As shown in Figure 15, dagger is taken as point of reference. The collected data showed that the error
between the voted and non-voted measurements is not greater than 0.03%. This behavior is reflected in a
qualitative view to the figure.
4.4. Performance Degradation after Fault Injections
One of the main assumptions for the FT-system is the fail-stop model of the system. However after the
failure of one of the managers the monitoring should continue by fault masking. The experiments run at the
test bed consisted in a set of five managers and one agent, and using 8 OID’s per request. The injection of
faults is done by killing the remote shell, after that the system depends in the heartbeat service for failure
detection.
In figure 16, it is shown how the values of the voter are modified after a failure in one of the managers. The
measurement starts with five managers, every minute the number of managers decreased by one. As seen
V o t e d a n d N O N - v o t e d D a t a c o l l e c t e d b y o n e o f t h e R e p l i c a M a n a g e m e n t A p p l ic a t i o n ( D a g g e r - U l t r a 2 )
7 2 6 9 5 0 0 0 0
7 2 7 0 0 0 0 0 0
7 2 7 0 5 0 0 0 0
7 2 7 1 0 0 0 0 0
7 2 7 1 5 0 0 0 0
7 2 7 2 0 0 0 0 0
7 2 7 2 5 0 0 0 0
7 2 7 3 0 0 0 0 0
T i m e ( s e c o n d s )
IfO
utO
ctet
s -
GB
po
rt a
t H
CS
-GW
V o t e d - I f O u t O c t e t s
D a g g e r - I f O u t O c t e t s
here the voted and the real measurement varies slightly, as a matter of fact after a failure the value at the
time of the measurement is lost and an interpolation is required to determine the sample between the gap
when the system had N and N-1 replicas.
Figure 16. Comparison of voted and non-voted results of ifInOctets after fail-stop faults at the managers
C o m p a r a s i o n V o t e d V r s a R e p l i c a M a n a g e r L o c a l I n f o r m a t i o n g iven Fau l t In jec t ion
3 4 9 7 2 6 0 0 0 0
3 4 9 7 2 8 0 0 0 0
3 4 9 7 3 0 0 0 0 0
3 4 9 7 3 2 0 0 0 0
3 4 9 7 3 4 0 0 0 0
3 4 9 7 3 6 0 0 0 0
3 4 9 7 3 8 0 0 0 0
3 4 9 7 4 0 0 0 0 0
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1
N o n f a u l t y m a n a g e r s
ifIn
Oct
ets
(GB
Po
rt)
at H
CS
-GW
(ro
ute
r)
Voted - i f InOctetsDagger - i f InOcte ts
The Behavior of the Throughput is presented in Figure 17a and 17b, in both cases the reference is the
manager at the node dagger. In both cases the graph generated by “dagger” is followed by graph generated
(a) (b)
Figure 17. Input and Output Throughput measured from one of the router ports with Fault Injection
by the voter. Observe that the graceful degradation is achieved by avoiding gaps between two or more
samples. Moreover, when there is only one manager, “dagger”, left it is obvious to expect that both graphs
will be exactly the same as shown on the figure.
5. Future Work
Group communication is very important to keep concurrency at the application level. There are several
improvements for the system. The use of lightweight agents by replacing UDP sockets by XTP [ ] or
SCALE Messages [George98a] will decrease the overhead at the protocol layer in every manager. In
addition to this, the message interchange between voter-managers can be done using multicast –
lightweight communication. Reducing the thread context switching and speeding up the shared buffer
access at the voter can also be achieved.
Output Throughput Measured at the Voter and Local Information at Replica (dagger)
0
200
400
600
800
1000
1200
1400
1600
5 5 5 5 5 5 5 5 4 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1
voters (slots of 10 sec)
Oct
ets/
sec
Voted-IfOutOctets/sDagger-IfOutOctets/s
Input Throughput Measured from the Voted and Local information in Replica (dagger)
0
100
200
300
400
500
600
700
800
900
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1
voters (slots of 10 seconds)O
ctet
s/se
c
Voted-IfInOctets/s
Dagger - IfInOctets/s
As a NM application, the voter should be able to re-run or re-schedule one of the replicas after detecting a
dead node. A combination of the Leader Election Replica Management and Voting can also reduce the load
of every Management Node.
Adding a Lightweight CORBA framework [George98b] to communicate with agents instead of using the
SNMP only. Finally, one of the major improvements to achieve for a FT application is the use of faster data
structures and a SQL [Wolf91] engine to relate parameters from replications and original and work as a
failure detector, which will already include built-in check-pointing and error recovery.
6. Conclusions
• FT Distributed Applications required a well defined-efficient replica communication protocol. The
combination of fault-detection and 2PC protocol provides an easy methodology to achieve
synchronization
• Timing and Latency overhead at the Voter and Manager have to be taken into consideration. Specially
when defining small intervals of sampling in High Performance Networks.
• The use of a voting system provides an efficient way to monitor and gracefully degrade measurements
as a Network Management application
7. Acknowledgements
HCS Lab for their comments and reviews.
8. References
[Agui97] M. Aguilera, W. Chen, S. Toueg.” Heartbeat: A timeout-Free Failure Detector forQuiescent Reliable Communication”, Cornell University, July 1997.
[Alvi96] L. Alvisi., K. Marzullo. “ Message Logging: Pessimistic, Optimistic and Casual”, IEEEInt. Symp of Distributed Computing, pp 229-235 1995.
[Aro95] A. Aurora, S. Kulkarni “Designing Masking Fault via Non-Masking Fault Tolerance”,IEEE Symposium on Reliable Distributed Systems, 1995, pp 174-185.
[Beed95] G. Beedubahil, A. Karmarkar, U. Pooch.” Fault Tolerant Object Replication Algorithm”,TR-95-042, Dept. of Computer Science, Texas A&M, October, 1995
[Begu97] A. Beguelin, E. Seligman, P. Stephan.” Application Level Fault Tolerance inHeterogeneous Networks of Workstations”, Journal of Parallel and distributedComputing, Vol 43, pg 147-155, 1997.
[Bras95] F. Brasileiro, P. Ezhichelvan. “TMR Processing without explicit clock synchronisation”,IEEE Symposium on Reliable Distributed Systems 1995, pp 186-195.
[Doer90] W. Doeringer, D. Dykeman, et.al. “A Survey of Light-Weight Transport Protocols forHigh-Speed Networks”, IEEE Trans. On Communications, Vol. 38, No.11, pp 2025-2035.
[Dol97] S. Dolev, A. Israeli, S. Moran. “ Uniform Dynamic Self-Stabilizing Leader Election”,IEEE Transactions on Parallel and Distributed Systems, vol. 8., No. 4, April, pp 424-440,1997.
[Duar96] E. Duarte, T. Nanya. “Hierarchical adaptive distributed system-level diagnosis appliedfor SNMP-based network fault management”, IEEE Symposium on reliable distributedsystems, 1996, pp 98-107.
[Feit97] S. Feit. “ SNMP”, McGraw Hill, NY, 1997.[Georg98a ] Paper being fixed by Dave and Tim.[Georg98b] Luises Thesis…..[John97] D. Johnson. “Sender-Based Message Logging”, IEEE Fault Tolerant Computing, pp 14-
19, 1987.[Landi98] S, Landis, R. Stento “CORBA with Fault Tolerance”, Object Magazine, March 1998,[Maffe96] S. Maffeis. “Fault Tolerant Name Server”, IEEE Symposium on reliable distributed
systems, 1996, pp 188-197.[OSI87] OSI, Information Processing Systems – Specification of the Abstract Syntax Notation
One ASN.1, ISO 8824, December 1987.[Paris94] J. Franciois, Paris.” A highly Available Replication Control Protocol Using Volatile
Witnesses”, IEEE Intl Conference of Distrib. Comp. Systems, 1994, pp 536-543,[Prad96] D. Pradhan. “Fault Tolerant Computer System Design”, Prentice Hall: NJ, 1995.[Rose90] M. Rose and K. McCloghrie. “Structure and Identification of Management Information
for TCP/IP based Internets”, RFC 1155, 1990[Rose94] M. Rose. “The Simple book – An Introduction to Internet Management”, 2nd edition,
Prentice Hall, Englewooed Cliffs, NJ, 1994.[Saha93] D. Saja, S. Rangarajan, S. Tripathi.” Average Message Overhead of Replica Control
Protocols”, 23th IEEE Intl Conf on Distrib Computing Systems, pp 474-481, 1993[Scho97] J. Schowalder. “Network Management by delegation” , Computer Networks and ISDN
Systems, No. 29, 1997. Pp 1843-1852[Sing94] G. Singh, M. Bommareddy. “Replica Placement in a Dynamic Network”, IEEE Intl
Confernce of Distributed Computing Systems, pp 528-535, 1994.[Wolf91] O. Wolfson, S. Sengrupta, Y. Yemini. “Managing Communication Networks by
Monitoring Databases” IEEE Transactions on Software Engineering” Vol. 17 No. 9. Sep1991, pp 944-953.
[WU97] C. Wu.” Replica Control Protocols that guarantee high availability and low access cost”,Thesis – Ph.D., University of Illionis – Urbana Campaign, 1993.
[XU96] J. Xu, B. Randell, et.al.” Fault Tolerant in Concurrent Object Oriented Software throughCoordinated Error Recovery.”, FTCS-25 Submission, University New-Castle, 1996.