Welcome to the Harris Geospatial product documentation center. Here you will find reference guides, help documents, and product libraries.


Harris Geospatial / Docs Center / Geospatial Services Framework / GSF Tutorial: Clustering

Geospatial Services Framework

GSF Tutorial: Clustering

GSF - Tutorial - Clustering

Clustering allows a network of machines, or nodes, to share the workload and run jobs in parallel. Each node contains a server instance which includes a Job Manager. These Job Managers are responsible for pulling jobs from a queue. A cluster is formed when more than one node pulls from the same job queue.

Setting up a basic cluster requires making very few changes to the default configuration. The only required change is to configure the job manager to connect to the same Redis server as all machines in the cluster. See Basic Cluster Setup for more details.

While the basic setup is quick and easy, there are additional considerations to setting up a cluster.

Basic Cluster Setup

The Kue Job Manager is the only module that needs to be configured in order to connect a node to the cluster. The Kue Job Manager needs only a shared Redis server to distribute jobs across the cluster.

For clustering to work, all nodes need to be able to work on the same list of jobs. Kue is a Node.js library for managing a priority job queue, and it uses a Redis server to store the job information. As long as all nodes are configured to use the same Redis server, they are part of the cluster. Each node in the cluster pulls jobs from the Redis queue when they are available. This allows for a dynamically-scalable cluster in which nodes can be added or removed without shutting down the whole cluster.

To enable a Redis server to be used in a cluster it needs to be available to all of the nodes in the cluster. The default Redis configuration prevents connections from other nodes.

For Microsoft Windows users, the default Redis installation is located at C:\Program Files\Redis and the configuration file for the Redis service is named redis.windows-service.conf.

For linux users, the default Redis service installation is located at /etc/redis/ and the configuration file for the Redis service is named portnumber.conf (e.g. 6379.conf).

To run Redis in a cluster, update the Redis configuration file to listen on the correct interface. The easiest option is to comment out the bind line to allow it to listen on all interfaces. The more secure option is to bind a cluster specific interface.

# bind 127.0.0.1

When listening on all interfaces, Redis also prevents unauthenticated clients from connecting. The best option is to enable a password. However, if you have your cluster in a secure network, you can just disable Redis’s protected mode.

protected-mode no

If using a password in redis, be sure to add it to the Job Manager’s configuration.

"jobManager": {
    "type": "gsf-kue-job-manager",
    "redisPort": 6379,
    "redisHost": "myRedisHost",
    "redisOpts": {
      "auth": "myRedisPass"
    }
  }

The Kue Job Manager on each node can now be configured to use the same Redis server on the network. The config.json file makes it easy to configure the Redis server.

To configure the Redis server for a particular node from a command line, start a command prompt in the GSFxx directory and execute the following command (replacing MyRedisServer and MyRedisPort with the actual address and port of your Redis server):

node updateConfig.js config.json --set jobManager.redisHost=MyRedisServer
node updateConfig.js config.json --set jobManager.redisPort=MyRedisPort

The updateConfig.js script will automatically back up the original config.json file for you.

You may also manually update the Redis server by editing the config.json file and updating the redisPort and redisHost properties of the jobManager module as shown below. It is recommended that you back up your config file before making any changes manually.

"jobManager": {
  "type": "gsf-kue-job-manager",
  "redisPort": "MyRedisPort",
  "redisHost": "MyRedisServer"
}

Restart the server any time this file is changed so it reflects the new configuration.

Configuring Client Access

When clustering machines, it is important to understand networking and the scope of client access versus cluster access to files and information.

Clusters of machines often have an internal network so that the nodes can communicate, but a client may not have access to all the nodes in the cluster. A client may only have access to a specific node or to an HTTP server performing load balancing and distributing requests to nodes within the cluster.

To handle both needs, each node's server has an external configuration (client accessible) and an internal configuration (the address used to reach a node from within the cluster).

External

Adding an external address/port to the node's server configuration will allow the modules to generate URLs that are client accessible. These values should be added to all nodes in the cluster so that all nodes produce URLs usable by clients. These values are made available to all modules in the constructor. Below is an example of updating the server configuration for an https proxy in front of the cluster.

To configure the external address and port information from a command line, start a command prompt in the GSFxx directory and execute the following command (inserting the correct information for your public address and port):

node updateConfig.js config.json --set jobManager.externalAddress=MyPublicAddressIP
node updateConfig.js config.json --set jobManager.externalPort=MyPublicAddressPort
node updateConfig.js config.json --set jobManager.externalScheme=https
node updateConfig.js config.json --set jobManager.externalPath=/ESECluster

You may also manually update the external address and port information by editing the config.json as shown below. It is recommended that you back up your config file before making any changes manually.

"externalAddress":"publicly accessible address or IP",
"externalPort":443,
"externalScheme":"https",
"externalPath":"/ESECluster"

Restart the server any time this file is changed so it reflects the new configuration.

The envi-data-parameter-mapper would use these values to reverse translate output files into a URL that looks like this:

"outputFile":{
    "url": "https://publicAddress:443/ESECluster/ese/jobs/1/output.file",
    "factory": "URLRaster"
}

Internal

The internal address is used by the cluster for internal communication. All nodes in the cluster should have an address that is reachable from all the other nodes so that data can be transferred as needed. As each node pulls a job off the queue, it registers its internal address with the job. This helps track which nodes performed which jobs and allows the basic workspace to find the output data later.

To configure the internal address from a command line, start a command prompt in the GSFxx directory and execute the following command (replacing myNodeAddress with address of that node):

node updateConfig.js config.json --set nodeAddress=myNodeAddress

You may also manually update the internal address by editing the config.json as shown below. It is recommended that you back up your config file before making any changes manually.

{
  //  Other configuration settings.
  "nodeAddress": "myNodeAddress"
}

For convenience, use the setting "nodeAddress":"getFromHostname" to use the system's hostname as the nodeAddress. This is the default setting.

The normal server port is expected to be accessible from within the cluster, so there is not a "nodePort" option.

Configuring Job Output

When clustering machines, one must consider where job output is written. There are two basic approaches to handling output in a cluster: shared and distributed. There is no difference between a shared and distributed workspace from the client perspective, but it does affect performance and system resource usage.

By default, the server functions as a distributed workspace using the gsf-basic-workspace-manager.

Distributed Workspace

A distributed workspace is useful when running the system on a collection of commodity hardware. This requires the least amount of configuration and network setup. With a distributed workspace, the job data is only accessible when the node that ran the job is connected to the cluster. Nodes can still be added and removed at any time; however, the data for the jobs that ran on that node is inaccessible if the node is removed.

The gsf-basic-workspace-manager is configured by default to act as a distributed workspace. Each node in the cluster will create a folder on the local file system where jobs should write output. This folder exists on each node independently and the other nodes cannot access that data directly. To allow other nodes to access that data, the gsf-cluster-request-handler provides REST endpoints that the workspace manager can use to transfer data. This happens on-demand when downloading job data to a client or when copying data from one node to another for further processing.

By default, gsf-cluster-request-handler and gsf-basic-workspace-manager are enabled and distribute the workspace data.

  "requestHandlers": [
    ...,
    {
      "type": "gsf-cluster-request-handler"
    }
  ],
  "workspaceManager": {
    "type": "gsf-basic-workspace-manager"
  },

Shared Workspace

A shared workspace is a central data storage location that is accessible from all nodes in the cluster. There is no need to copy data from one node to another as they all have access to the data directly. This setup has the benefit of maintaining data availability even after a node leaves the cluster.

The gsf-amazon-s3-workspace-manager is a good example of a shared workspace. All of the job data is stored in S3 and all of the nodes can access that data. This means that nodes can be added and removed from the cluster with minimal effect on the health of the cluster. For more information on setting up the Amazon S3 workspace manager, please see Amazon S3 Workspace tutorial.

The gsf-basic-workspace-manager can also act as a shared workspace. If the root folder is set to a shared file system (such as NFS), then there is no need to copy data across nodes. In this scenario, set "isSharedWorkspace":true in the config file to prevent the gsf-basic-workspace-manager from trying to copy data from the original node.

To enable the shared workspace from a command line, start a command prompt in the GSFxx directory and execute the following command:

node updateConfig.js config.json --set isSharedWorkspace=true

You may also manually update the shared workspace by editing the config.json as shown below:

  "isSharedWorkspace":true



© 2017 Exelis Visual Information Solutions, Inc. |  Legal
My Account    |    Buy    |    Contact Us