EVMS CLUSTERED ENGINE(ECE) API version 1.0 1.0 Introduction ECE is a plugin that acts as a conduit between a cluster manager and EVMS. An application loads and initializes the ECE plugin to access cluster features. Application in EVMS context is the EVMS engine or the proposed EVMS Cluster Daemons(ECD). ECE modules reside in '$prefix/lib/evms/' where $prefix is set by the Engine's configure script. A ECE plugin exists for each of the different cluster managers supported. Each ECE plugin provides its description in a plugin_record_t. The plugin exports the variable evms_plugin_records which is a pointer to a NULL terminated array of pointers to plugin records. The ECE plugin type is EVMS_CLUSTER_MANAGER_INTERFACE_MODULE. However the union 'functions' within plugin_record_t is augmented with cluster_functions_s structure. This structure defines the cluster functions provided by the ECE plugin. The definition of cluster_functions_s is defined below: typedef struct cluster_functions_s { int (*setup_evms_plugin)(ece_mode_t, engine_functions_t); /* initialize the ECE */ int (*cleanup_evms_plugin)(void);/* terminate services */ int (*register_callback)(ece_event_type_t type, ece_cb_t cb); /*register*/ int (*unregister_callback)(ece_cb_t cb); /*unregister*/ int (*send_msg)(ece_msg_t *msg);/* send a message */ int (*get_my_nodeid)(ece_nodeid_t *nodeid); /* get my nodeid */ int (*get_clusterid)(ece_clusterid_t *clusterid); /* get my nodeid */ int (*get_num_config_nodes)(int *num_nodes); /* get potential number of nodes configured */ int (*get_all_nodes)(int *num_entries, ece_nodeid_t *nodes); /* get all the configured nodes */ int (*get_membership)(ece_event_t *event); /* get the current membership */ int (*nodeid_to_string)(ece_nodeid_t , char *); /* return ASCII representation of nodeid */ int (*string_to_nodeid)(char *, ece_nodeid_t); /* return the nodeid from its ASCII representation */ int (*get_plugin_info)(char * info_name, extended_info_array_t * * info); /* provide extended plugin identification */ } cluster_functions_t; 2.0 Initialization The ECE does its initialization on the call to setup_evms_plugin(). int setup_evms_plugin(ece_mode_t mode, engine_functions_t functions) typdef enum { SLAVE, MASTER } ece_mode_t; NOTE: A correct application is expected to load and initialize only one ECE plugin at any given point in time. The setup_evms_plugin() call sets up a communication channel with the corresponding cluster manager installed on the machine and initializes its internal data-structures, including spawning internal threads to handle cluster messages and events. 'mode' specifies if the ECE has to run in MASTER mode or in slave mode. In MASTER mode the ECE connects to the cluster manager and route cluster manager service to the slave ECE. In SLAVE mode ECE connects to the master ECE for cluster services. NOTE: the 'mode' argument is architecturally not clean. This shall be cleaned up in due course. The return value is 0 on success. ENOSERVICE is returned if the corresponding cluster manager service is not available. EFAIL is returned if the corresponding cluster manager is running but but connection to that service failed. 3.0 Registration The application registers with ECE and configures the following parameters through a call to int register_callback(ece_callback_type_t type, ece_cb_t cb) 'type' is the type of membership events i.e incremental membership or full membership. 'cb' is the callback function to deliver membership and messaging events. The return value is 0 on success, else an error code on failure. The 'type' argument indicates the style of membership information requested. typedef enum { DELTAS, FULL_MEMBERSHIP } ece_callback_type_t; details of each type are explained in section 4.1.1 'cb' is a callback function to deliver events and messages. The prototype definition of which is defined below: typedef void (*ece_cb_t)(const ece_class_t class, const size_t size, const void *data) The semantics of this callback are defined in section 4.0 4.0 Callback description The callback is called to deliver membership events as well as messaging events to the application. It is an asynchronous delivery mechanism. This callback is called in the context of the ECE spawned thread. typedef void (*ece_cb_t)(const ece_class_t class, const size_t size, const void *data) The ece_class_t argument, distinguishes membership class events from the messaging class events. typedef enum ece_class_s { CALLBACK_MEMBERSHIP, CALLBACK_MESSAGE, } ece_class_t; The size_t parameter indicates the size of the class specific data in bytes. The interpretation of 'data' is dependent of the event class as indicated by ece_class_t parameter. The semantics for each class are described in section 4.1 and 4.2 4.1 Membership Event semantics The membership events are delivered through the callback. For a callback associated with membership events, the value of class is CALLBACK_MEMBERSHIP. 'data' contains a pointer to a ece_event_t structure as shown below. typedef enum { DELTA_JOIN, DELTA_LEAVE, MEMBERSHIP } struct ece_event_type_t; typedef struct ece_event_s { ece_event_type_t type; uint transid; uint quorum_flag; uint num_entries; ece_nodeid_t node[1]; } ece_event_t; 1. The 'type' field indicates the type of event. It is either DELTA_JOIN or DELTA_LEAVE or MEMBERSHIP. The semantics of each of these types are explained in section 4.1.1 2. 'transid' is the membership identifier for the membership event. 3. 'quorum_flag' indicates if the current transition has quorum or not. 4. 'num_entries' is the number of nodes present in the node[] array. 5. node[] is the array that holds either all the new nodes or all the lost nodes or all the nodes in the current membership depending on 'type' 4.1.1 ece_event_type_t and ece_callback_type_t semantics The application specifies in the register_callback() call, the type of events its interested in. The 'type' is one of the following: DELTAS: For every transition two callbacks are delivered, one containing the lost nodes and the other containing new nodes. The new and the lost nodes are with respect to the membership in the last transition. NOTE: DELTA_JOIN and DELTA_LEAVE events are generated even when the transition has only joined nodes and no lost nodes, or vice versa. The number of entries in the event is 0 if no nodes of that kind exist. FULL_MEMBERSHIP: Return the entire membership. The events generated are of type MEMBERSHIP. 4.2 Messaging Event semantics The messages are delivered through the same callback as the membership events. For callbacks associated with messaging events, the value of class is CALLBACK_MESSAGE. And data is a pointer to a ece_msg_t structure as shown below. 4.2.1 Semantics of ece_msg_t typedef struct ece_nodeid_s { unsigned char bytes[128]; } ece_nodeid_t; typedef struct ece_msg_s { ece_nodeid_t node; u_int32_t correlator; u_int32_t evmsdata; /* place to tag in * any application specific * data */ size_t size; void *msg; } ece_msg_t; 'node' is the node id to/from which the message is sent/received respectively. 'correlator' is explained in section 4.2.1.1 'evmsdata' is any application specific metadata to help interpret the 'msg'. 'size' is the size of the message in bytes. 'msg' is the actual message in network byte order. 4.2.1.1 Correlator semantics ECE facilitates the application with a mechanism to relate 'sent messages' with 'messages responded' to the sent messages. When a application provides a non-zero correlator, ECE simply delivers the correlator to the recipient. However when a application provides a zero correlator, ECE generates a correlator and delivers the correlator to the recipient, as well as returns the correlator to the sender. 4.2.2 Send API The following routine sends messages: int send_msg(ece_msg_t *msg); send or broadcast a message contained in msg->msg. This call returns 0 whenever the message is successfully delivered to recipient. However it does not guarantee that the message has been successfully processed by the recipient. if ECE_ALL_NODES is specified in the ece_nodeid_t field the message is multicast to all current members of the cluster. #define ECE_ALL_NODES { \ bytes:{0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff, \ 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff, \ 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff, \ 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff, \ 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff, \ 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff, \ 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff, \ 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff, \ 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff, \ 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff, \ 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff, \ 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff}}; If correlator is set to 0, ECE replaces it with a generated correlator. The sender can reference the new correlator in the ece_msg_t upon successful completion of the send_msg(). The correlator is delivered to the recipient, along with the message. evmsdata is not interpreted by ECE. It is treated as a opaque object and delivered to the recipient as a opaque object. 4.3 Miscellaneous functions ECE provides the following helper functions. 4.3.1 My nodeid Get the nodeid of the node running the calling application. In simple terms, get the nodeid of my node. :) int get_my_nodeid(ece_nodeid_t *nodeid); returns the nodeid in 'nodeid' argument. On failure a non-zero value is returned and the contents of 'nodeid' is undefined. 4.3.2 My clusterid Get the clusterid of the cluster. typedef struct ece_clusterid_s { unsigned char bytes[128]; } ece_clusterid_t; int get_clusterid(ece_clusterid_t *clusterid); returns the clusterid in 'clusterid' argument. On failure a non-zero value is returned and the contents of 'clusterid' is undefined. 4.3.3 Convert nodeid to human readable string int nodeid_to_string(ece_nodeid_t *nodeid, char *str); convert 'nodeid' to a human readable string. if the call fails returns a non-zero value. 4.3.4 Convert the human readable string to nodeid int string_to_nodeid(char *str, ece_nodeid_t *nodeid); convert the human readable string 'str' to 'nodeid' if the call fails returns a non-zero value. 4.3.5 get plugin identification int (*get_plugin_info)(char * info_name, extended_info_array_t * * info); return plugin identification. extended_info_array_t is defined in the engine external API document. 4.3.5 Number of Cluster configured nodes Get the number of nodes configured in the cluster. int get_num_config_nodes(int *num_nodes); The number of configured nodes in the cluster is returned through the 'num_nodes' argument. On failure a non-zero value is returned and the contents of 'num_nodes' is undefined. 4.3.6 Cluster configured nodes Get all the configured nodes in the cluster. int get_all_nodes(int *num_entries, ece_nodeid_t *nodes); The memory for 'nodes' and 'num_entries' is allocated by the caller. The caller has to specify in '*num_entries' argument, the space in terms of number of ece_nodeid_t entries, allocated in nodes array. If the '*num_entries' allocated is less than the number of configured cluster nodes, then the function returns updating the '*num_entries' with the actual number of entries required, does not fill in any entry in the nodes array and returns ENOSPC. However if the entries are sufficient, returns updating the '*num_entries' with the actual number of configured nodes and fills in the nodes array with the nodeid of all the configured nodes in the cluster. if 'num_entries' is NULL, get_all_nodes() does not update the number of configured nodes. if the call fails because of other reasons returns a non-zero value. 4.3.7 Current Membership Get the current membership. int get_membership(ece_event_t *event); returns the current membership. The caller fills in event->num_entries with the number of nodes entries that can be filled in the provided space in event->node array. If the entries provided is not sufficient fills the event->num_entries with the number of nodes currently in the membership, and return ENOSPC. However if the space provided is sufficient, it fills the event->node[] array with the nodeid of all the members, as well as fills all the other fields of the event data-structure, and returns 0. The call returns 0 on success, else a error code on failure. In cases where new nodes progressively join the membership, the required number of entries returned in a previous call to get_membership() can be less then what is required in the next call to get_membership(). To be on the safer side it is wise to allocate a number of entries equal to the configured number of cluster nodes, when calling get_membership(). 5.0 Unregisteration The application can cancel its registration for callbacks by calling unregister(). int unregister_callback(ece_cb_t cb); The callback address, cb, must be the same address used on the call to register_callback(). The ECE removes the callback from its table of callback addresses. The ECE will no longer call the application's callback. The ECE will return EINVAL if the callback address is not currently registered. 6.0 Termination The Application can terminate the cluster services by calling cleanup_evms_plugin(); void cleanup_evms_plugin(void); This function terminates ECE and returns after cleaning up all its internal datastructures including termination of all its internal threads. The ECE plugin however continues to reside in the applications memory. 7.0 Example static void fn(const ece_class_t class, const size_t size, const void *data) { switch(class) { case CALLBACK_MEMBERSHIP: memdata = (ece_event_t *)data; switch(memdata->type) { case DELTA_JOIN: /* process joined nodes */ break; case DELTA_LEAVE: /* process lost nodes */ break; } break; case CALLBACK_MESSAGE: msgdata = (ece_msg_t *)data; /* handle the message in msgdata->msg */ /* generate a new message and respond to the received message */ msgdata->msg = newmsg; send_msg(msgdata); break; } } main() { /* load the ECE module and its corresponding plugin */ /* get its function pointer table in 'table' */ table->setup_evms_plugin(MASTER, engine_functions); table->get_my_nodeid(&my_nodeid); table->get_num_config_nodes(&nconfig); /* allocate memory to hold nconfig nodes */ nodeids = (ece_nodeid_t *) malloc(nconfig*sizeof(ece_nodeid_t)); table->get_all_nodes(&nconfig, nodeids); table->register_callback( DELTAS, fn); /* do internal processing */ /* terminate */ table->cleanup_evms_plugin(); } 6.1 Revision history 0.1 initial revision only for comments from steved and Ben 0.2 after incorporating steved's comment. 0.3 after incorporating luciano's comment. 0.4 after review from the group. 0.5 after second round of group review. 0.6 after steved and stevep's corrections. 1.0 official document for community review.