Choosing computational resources for simulations with FLEUR

Objectives

Understand how to choose different parallelization levels
Using GPUs with FLEUR
Specifying resources within AiiDA workflows

Introduction

FLEUR is able to use different parallelization levels and strategies. If you run larger calculations and/or want to use larger computers, you should understand the basic concepts to be able to use a proper parallelization setup.

On the hardware level, we usually talk about: 1. Cores of a processor: Modern CPUs can execute multiple chains of instructions in parallel by providing multiple corresponding hardware units. This capability is usually called "core" in which each of this cores is capable of executing independently. Please be aware, that these might be "logical cores" which oversubscribe some hardware units. In general, one should try to utilize all of these "cores" to rum most efficiently. There might be exeptions, e.g. if you have "cores" of different kind (performance/efficiency cores) and in which the use of all cores can lead to load balancing issues and degrade performance. 2. Nodes of a computer: Modern supercomputers/clusters usually consist of several independent machines connected by a (more or less) fast network. These nodes are then used together, i.e. a single run of FLEUR can utilize multiple nodes. 3. GPUs: Attached to a node might be one or more GPUs usable for computation. These are very fast specialized processors that can be used to speed up parts of the code. Currently FLEUR will only support NVIDIA GPUs.

When parallelizing a FLEUR run, we correspondingly talk about different programming paradigms: 1. FLEUR can use multiple "threads". These run on a single "node" and are distributed over the "cores". We use the OpenMP framework for this kind of parallelism and therefore we are also talking about OMP-THREADS. 2. FLEUR can utilize the MPI framework to run multiple independent instances (MPI-tasks) that communicate with each other. These different MPI-tasks can also be distributed over multiple "core" on a single node and/or be distributed over several "nodes". 3. FLEUR can use the GPUs. The usage of multiple GPUs is tied to the usage of MPI.

Info

To find a suitable parallelization strategy: 1. You should know your hardware, 2. You should choose a proper parallelization scheme for MPI/GPUs, 3. You should use multiple threads when available.

## FLEUR parallelization

### Decide on the MPI parallelization
As a first step you should determine the MPI parallelization to use. Here you should consider:

- How many k-point do you have. Usually, you should try to choose your number of MPI tasks such that the k-points can be distributed. That means your number of k-points should be a multiple of the number of MPI tasks. The reasoning here is the nearly optimal scaling of the MPI distributed calculation of the k-point loop.
- Is the system small enough such that a single k-point can be calculated with a single MPI tasks? If your system size is larger, you might want to use several nodes for a single eigenvalue problem.
!!! info Using multiple MPI tasks to calculate a single k-point is usually not a good idea on a single node. As the distributed solution of the eigenvalue problem uses a different solver (the simplest being solver from the SCALAPACK library) it is usually preferable to use the OpenMP parallelism instead if you stay on a single node.

### Decide on the OpenMP parallelization
After you have decided on the number of MPI tasks to put on a node, you then should adjust the number of OpenMP threads accordingly to utilize the cores of the node fully. For example, if you have 128 cores and you use 8 MPI tasks per node you should use 32 OpenMP threads.

### FLEURist
As all these considerations can be very difficult, we provide a tool to suggest a suitable parallelization scheme. This is called `FLEURist` and can be found in the build directory together with the `fleur` and `inpgen` executables. In this tutorial it is also installed.


```bash
FLEURist

Before the FLEURist tool can be used it has to be configured such that it knows about your computer hardware. This is done with the command FLEURist computer add. Some (few) machines are predefined but there are also two generic options:

localhost: This is the computer you are running currently on. The tool will examine the machine and choose some settings automatically.
Slurm cluster: This is a machine that will use the slurm queuing system and consists of several nodes. Here you will have to enter some more details about the machine.

!!! warning As this configuration is performed interactively, you should use a terminal session to call FLEURist computer add

The basic questions to answer are: - the type: should be 0 here to use the container - the name/path of FLEUR: can simple be fleur_MPI as use this command to run FLEUR - the parallel eigensolver: this is available in the executable, so you can choose yes.

cd SiBulk

#the following only works after the configuration step was performed in a terminal
FLEURist suggest

Here you can see the kind of suggestion the tool produces. The exact results of this call will depend on your environment, i.e. on the number of cores available in the container you run the tutorial in.

You can also immediately submit the job. That means instead of using the command fleur_MPI to start FLEUR on your input, you can use FLEURist submit which will take the suggestion and run FLEUR accordingly.

FLEURist submit

Using GPUs

If you compiled a GPU version of FLEUR and have a computer with supported GPUs (currently only NVIDIA GPUs are supported), you can use these GPUs. Usually, the GPUs will be used automatically in this case but you should consider the following guidelines:

The GPU version will only perform well if the workload on the GPU is sufficiently high. Hence, you have to tune your job such that you assign enough work per GPU. I.e. the eigenvalue parallelism, if needed such be used such that the chunks of the matrix fill the GPU memory. Otherwise, you might need to use multiple k-point per GPU.
If you have more than a single GPU per node you should use multiple MPI-tasks per node. This means the number of MPI-tasks per node should be a multiple of the number of GPUs. Do not restrict the GPUs visible to the tasks by setting the environment variable CUDA_VISIBLE_DEVICES, if it is set automatically make sure you unset it or set it to include all GPUs. FLEUR will take care of the distribution of jobs itself.
If you use multiple k-point per GPU (this means you have more than a single MPI-task per GPU), you should use the MPS (See the NVIDIA docu here ).

Specification of parallelization in an AiiDA workflow

In AiiDA the computational resources used by a workflow are specified by a python dictionary. Using the aiida-fleur command line interface this is done using the -opt option analogue to the -wf option we use for workflow options.

Again, a template creation mechanism exists:

aiida-fleur launch scf -opt template.json

cat wf_options.json

By modification of these setting one can modify the parallelisation and the use these in a launch of aiida-fleur

aiida-fleur launch scf -opt wf_options.json

Further learning:

There are some more details available on: - https://www.flapw.de/MaX-7.0/documentation/parallelizationSchemes/

You might also want to check the -h option of FLEUR, study its output while starting up and the documentation of your computer.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search