# Scalaris on SLURM

Scripts to run Scalaris on SLURM. The idea is to completely script all performed
tasks on the cumulus cluster. On a frontend machine (e.g. cumulus.zib.de) a script
sets the desired parameters and enqueues the tasks via `sbatch`. For an example
see `bench.sh`.

The queued tasks execute another script on the nodes
of the cumulus cluster (cumu01-00 - cumu01-15 and cumu02-00-cumu02-15), performing
the actual tasks. For examples see `example-job-script` and `increment-bench.slurm`.

The rest of the scripts are scripts for default tasks, like starting and stopping
Scalaris.

# Scripts

## bench.sh
* runs a whole series of benchmarks
* to start, execute `bench.sh`, e.g. on `cumulus.zib.de`
        $ ./bench.sh
* runs in the context of cumulus.zib.de
* adjust number of nodes, VMs per Node, number of DHT nodes per vm,
  number of repetitions (iterations) in the script
* calls `sbatch` with `increment-bench.slurm` as script
* the output of `increment-bench.slurm` is returned to the calling `sbatch` command
    and saved in the directory from which `bench.sh` was called
* the name of the outputfile is defined in `bench.sh` in the `sbatch` call with
    the '-o' parameter and has the default form of

    ```slurm-<ErlangVersion>-<Nodes>-<VMs/Node>-<DHTNodes/VM>-<jobid>.out```


## increment-bench.slurm
* runs in the context of exactly one node of the cumulus cluster
* sets up a Scalaris ring through calling `start-scalaris.sh` with the number of
    nodes, vms and dht-nodes specified in `bench.sh`
* executes the set of comands specified in the designated area with one set
    of parameters from `bench.sh`
* shuts down the Scalaris ring via stop-scalaris.sh

## start-scalaris
* starts a Scalaris ring

## stop-scalaris
* stops a Scalaris ring

## env.sh
* provides default values for environment variables controlling Scalaris and Slurm

# Web Interface

The web interface of a running Scalaris node can be accessed through

    `http://cumu01-<Node-Nr>.zib.de:8000/index.yaws`
e.g.
    [http://cumu01-n03.zib.de:8000/index.yaws](http://cumu01-03.zib.de:8000/index.yaws)

from every PC that has access to the cumulus cluster.

# Access nodes running Scalaris
* `ssh`-ing into a node is not possible with slurm
* start the cmd or script with the `--share` instead of ``--exclusive` (default)
    * Beware: Anyone can access this node now
* an interactive shell can now be opened on the respective node with

        $ srun -p CUMU -A csr --nodelist=cumu01-00 --pty bash

* attach to the screen session running the Scalaris node

# Canceling jobs

A job can be canceled with `scancel <job-id>` by the user. The cmd or script of
the canceled job will first get a SIGTERM and later a SIGKILL by slurm.

It is possible to start processes through the cmd or script which are not
terminated when the job is terminated. For Scalaris, this is true for example
for all the Erlang VMs and epmd. These have to be stopped manually on every node
in case of cancellation.

A script (`cleanup.sh`) is provided for this purpose. The script needs to be
executed on all nodes used by the canceled
job, run e.g.

        $ srun -p CUMU -A csr --nodelist="cumu01-03,cumu01-04" cleanup.sh

to cleanup the nodes cumu01-[03,04]. The list of used nodes can be found in the
logfile of the cancelled job.

The `start-scalaris.sh` script also cancels still running Scalaris nodes, so if
Scalaris is started on the same ring immediately afterwards, running the cleanup
script is not necessary.

# Quick and Dirty

Run a script on a two-node setup

    $ sbatch -N2 example-job-script.slurm
Most of the parameters will be set through default values in `env.sh`.

Individual parameters can be set on a per call basis, e.g. specify the number of
nodes per VM (this would lead to 16 Scalaris nodes, 2 machines with 8 Erlang-VMs each)

    $ VMS_PER_NODE="8" sbatch -N2 example-job-script.slurm
