MATE - MPI C/C++ Translator and Runtime System Framework

III - Tutorials

Tutorial I: Configuring and Installing MATE

Step I) Download and extract MATE

Download the latest version of MATE here: Source Code
Extract MATE on any folder.

Step II) Configure MATE

Installation setup is performed by executing the ./configure command included in the source code. Recommended: Read MATE Configure Manual Page for a full list of options.
Here is a quick recap of configure options you can use:

--with-mpi=MPI_PATH (Mandatory)
Where MPI_PATH is the path to the MPI library to use.
--prefix=INSTALL_DIR (Optional)
Where INSTALL_DIR is the path where mate will be installed.
Default Value: $PWD/bin
--build-dir=BUILD_DIR (Optional)
Where BUILD_DIR is the path to build the MATE.
Default Value: $PWD/build
--build-jobs=JOBS_NUMBER (Optional)
JOBS_NUMBER is the number of parallel jobs used to build MATE and its dependencies.
Recommended value: 16 (Fast) - Default Value: 1 (Slow)

Here is an example of how to configure MATE:

./configure --with-mpi=/apps/mpich2 --prefix=/usr/bin/mate --build-dir=/largefs/buildmate --build-jobs=16

Step III) Compile, test, and Install MATE

You can now compile MATE and its dependencies by executing the make command. This may take a while!
After compilation is ready, you can execute make check to verify that MATE is working correctly.
To install MATE, execute make install

Here is an example of how to install MATE:

make
make install

Finally, add the installation folder to your $PATH environment variable

export PATH=/usr/bin/mate:$PATH

Tutorial II: Annotating your code with MATE #pragmas

Step I) Warmup

During this tutorial, we will be using MATE #pragma annotations. The syntax and rules for their usage can be found here: MATE Syntax
We recommend that you familiarize with the syntax and keep it at hand in another tab/window while you follow these steps.

Throughout this tutorial we will use the following example code sample:

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&c);
MPI_Comm_rank(MPI_COMM_WORLD,&p);
MPI_Request *reqs = (MPI_Request*) malloc (sizeof(MPI_Request) * 4);
int req_count = 0; 

for (int counter = 0; counter < maxiter; counter++)
{
 if (p < 0) pack(buffer3);
 if (p < c) pack(buffer4);

 if (p < c) MPI_Irecv (buffer1, n, MPI_DOUBLE, p+1, 0, MPI_COMM_WORLD, &reqs[req_count++]);
 if (p > 0) MPI_Irecv (buffer2, n, MPI_DOUBLE, p-1, 1, MPI_COMM_WORLD, &reqs[req_count++]);

 if (p > 0) MPI_Isend (buffer3, n, MPI_DOUBLE, p-1, 0, MPI_COMM_WORLD, &reqs[req_count++]);
 if (p < c) MPI_Isend (buffer4, n, MPI_DOUBLE, p+1, 1, MPI_COMM_WORLD, &reqs[req_count++]);

 MPI_Waitall(req_count, reqs, stat);

 if (p < c) unpack(buffer1);
 if (p < 0) unpack(buffer2);
 req_count = 0;
 
 compute();
}

MPI_Reduce(&total, buffer, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
MPI_Finalize();

Step II: Identify MPI Calls

MATE requires that only point to point communication i.e. sends and receives as well as their immediate mode variants, be annotated. So before inserting annotations in the code, we need to identify where these calls are located.
Control and collective functions such as MPI_Init, MPI_Comm_size, MPI_Comm_rank, and MPI_Finalize are not annotated. Therefore, we can just ignore them and just focus on the communication calls:

for (int counter = 0; counter < maxiter; counter++)
{
 if (p < 0) pack(buffer3);
 if (p < c) pack(buffer4);

 if (p < c) MPI_Irecv (buffer1, n, MPI_DOUBLE, p+1, 0, MPI_COMM_WORLD, &reqs[req_count++]);
 if (p > 0) MPI_Irecv (buffer2, n, MPI_DOUBLE, p-1, 1, MPI_COMM_WORLD, &reqs[req_count++]);

 if (p > 0) MPI_Isend (buffer3, n, MPI_DOUBLE, p-1, 0, MPI_COMM_WORLD, &reqs[req_count++]);
 if (p < c) MPI_Isend (buffer4, n, MPI_DOUBLE, p+1, 1, MPI_COMM_WORLD, &reqs[req_count++]);

 MPI_Waitall(req_count, reqs, stat);

 if (p < c) unpack(buffer1);
 if (p < 0) unpack(buffer2);
 req_count = 0;

 compute();
}

We are now left with MPI_Isend, MPI_Irecv, and MPI_Waitall. All of these involve point-to-point MPI communications, and need to be annotated before we can translate them with MATE.

Step III: Identify opportunities for overlap

In our example, the region involving communication and computation is well-defined as the body of the for loop. We then proceed to define it using #pragma directives.
In MATE, all point to point communication must be enclosed within a #pragma mate graph block. This means that all communication performed within this block will be translated into a data-dependency driven code, scheduled by MATE.

#pragma mate graph for
{
 for (int counter = 0; counter < maxiter; counter++)
 {
  if (p < 0) pack(buffer3);
  if (p < c) pack(buffer4);

  if (p < c) MPI_Irecv (buffer1, n, MPI_DOUBLE, p+1, 0, MPI_COMM_WORLD, &reqs[req_count++]);
  if (p > 0) MPI_Irecv (buffer2, n, MPI_DOUBLE, p-1, 1, MPI_COMM_WORLD, &reqs[req_count++]);

  if (p > 0) MPI_Isend (buffer3, n, MPI_DOUBLE, p-1, 0, MPI_COMM_WORLD, &reqs[req_count++]);
  if (p < c) MPI_Isend (buffer4, n, MPI_DOUBLE, p+1, 1, MPI_COMM_WORLD, &reqs[req_count++]);

  MPI_Waitall(req_count, reqs, stat);

  if (p < c) unpack(buffer1);
  if (p < 0) unpack(buffer2);
  req_count = 0;

  compute();
 }
}

Step IV: Identifying regions

We need to identify what sub-regions of our graph block contain contiguous code that are data-independent.
In our example, we can identify 5 such regions. We shall name them: pack, receive, send, unpack, and compute.
We add a #pragma mate region, and enclose these sub-regions with new { } scopes:

#pragma mate graph for
{
 for (int counter = 0; counter < maxiter; counter++)
 {
  #pragma mate region(pack)
  { if (p < 0) pack(buffer3);
    if (p < c) pack(buffer4); }

  #pragma mate region(receive)
  { if (p < c) MPI_Irecv (buffer1, n, MPI_DOUBLE, p+1, 0, MPI_COMM_WORLD, &reqs[req_count++]);
    if (p > 0) MPI_Irecv (buffer2, n, MPI_DOUBLE, p-1, 1, MPI_COMM_WORLD, &reqs[req_count++]); }
  
  #pragma mate region(send)
  { if (p > 0) MPI_Isend (buffer3, n, MPI_DOUBLE, p-1, 0, MPI_COMM_WORLD, &reqs[req_count++]);
    if (p < c) MPI_Isend (buffer4, n, MPI_DOUBLE, p+1, 1, MPI_COMM_WORLD, &reqs[req_count++]); }

  #pragma mate region(unpack)
  { MPI_Waitall(req_count, reqs, stat);
    if (p < c) unpack(buffer1);
    if (p < 0) unpack(buffer2);
    req_count = 0; }

  #pragma mate region(compute)
   { compute(); }
 }
}

Step V: Identify Data Dependencies

We need to identify the data dependencies between our newly defined regions
These dependencies can be compute or data. (See: I.2 Dependencies)
We can see that the pack region can only be executed once the compute region from the previous iteration has finished (compute dependency), producing new data to be sent. We illustrate this by adding depends(compute*), where * implies that the dependency belongs to the instance of compute from the previous iteration, if iteration > 0. Pack also requires that the send buffers have been cleared (data dependency). Therefore we also add a dependency with send*.
The receive region can only start requesting data after the receive buffers have been read (compute dependency). This means that receive depends on unpack from the previous iteration. Therefore, we add depends(unpack*).
The send region can only start MPI send requests after the buffers have been filled (compute dependency). These buffers are filled during the same iteration by the pack region. Therefore, we add depends(pack).
The unpack region can only execute after the requests produced in the receive region have finished (data dependency), but also needs to wait for the compute region of the last iteration to finish, otherwise it would overwrite its input data (compute dependency). Therefore we add depends(receive, compute*)
Lastly, the compute region only needs to wait on unpack to execute the current iteration. We add depends(unpack).

#pragma mate graph for
{
 for (int counter = 0; counter < maxiter; counter++)
 {
  #pragma mate region(pack) depends(send*, compute*)
  { if (p < 0) pack(buffer3);
    if (p < c) pack(buffer4); }

  #pragma mate region(receive) depends(unpack*)
  { if (p < c) MPI_Irecv (buffer1, n, MPI_DOUBLE, p+1, 0, MPI_COMM_WORLD, &reqs[req_count++]);
    if (p > 0) MPI_Irecv (buffer2, n, MPI_DOUBLE, p-1, 1, MPI_COMM_WORLD, &reqs[req_count++]); }
  
  #pragma mate region(send) depends(pack)
  { if (p > 0) MPI_Isend (buffer3, n, MPI_DOUBLE, p-1, 0, MPI_COMM_WORLD, &reqs[req_count++]);
    if (p < c) MPI_Isend (buffer4, n, MPI_DOUBLE, p+1, 1, MPI_COMM_WORLD, &reqs[req_count++]); }

  #pragma mate region(unpack) depends(receive)
  { MPI_Waitall(req_count, reqs, stat);
    if (p < c) unpack(buffer1);
    if (p < 0) unpack(buffer2);
    req_count = 0; }

  #pragma mate region(compute) depends(unpack)
   { compute(); }
 }
}

This was a simple example of a program with a single overlap region. Your code can have multiple overlap regions over multiple source code files. Also, more complex dependency graphs can be designed where data-dependencies can be discriminated by neighbor (e.g. receive_left, receive_right).

Tutorial III: Translating and compiling your program

Step I) Warmup

Important: It is necessary that all your program files containing MPI functions are properly annotated before translating them. See the Tutorial II for more information.
During this tutorial, we will be using the mate-translate and
mate-cxx commands.
We recommend that you familiarize with their manual pages and keep them at hand in another tab/window while you follow these steps.

Step II) Translate your annotated code

MATE requires that all your program files are translated at once, and that one of them contains the main() function.
Note: In MATE, functions containing MPI calls are inlined. This means that calls pointing to them are replaced by the actual contents of the function. This is applied recursively until all functions containing MPI code are inlined into the main() function.
To translate your code using MATE, you can execute the mate-translate command.

You can add the --mate-verbose flag to follow the progress of MATE as it translates your code.

Here is an example (taken from examples/7_point_stencil) of how to translate a multi-file program using MATE:

mate-translate getTime.cpp 7_point_stencil.cpp residual.cpp

Step III) Compile and Link

After translation, MATE will generate a single .cpp file. The name of the resulting file will be the same as the source file containing the main() function plus a "mate_" prefix.
Although it is possible to compile and link in the same step, it is recommended that all object files are built first and then linked.
To compile and link your MATE translated code, use mate-cxx

Following our example, this is how to compile and link the MATE translated code and other source files:

mate-cxx -c -O2 getTime.cpp
mate-cxx -c -O2 residual.cpp
mate-cxx -c -O2 mate_7_point_stencil.cpp
mate-cxx -o mate_7_point_stencil getTime.o mate_7_point_stencil.o residual.o

Alternatively, you can compile and link in the same step:

mate-cxx getTime.cpp mate_7_point_stencil.cpp residual.cpp -o mate_7_point_stencil

Tutorial IV: Running your program

Step I) MATE execution model

A MATE program runs as a set of processes comprised of multiple worker threads.
Each worker thread runs 1 or more tasks, each corresponding to a virtual MPI process.
Each worker thread is mapped to a kernel thread, which runs on one core.
A task runs to completion, although it may yield to another task assigned to the same worker thread.

Step II) Define Worker Threads

The argument, --mate-threads indicates how many worker threads (cores) will be used to run your program.
The simplest value for this argument is the exact number of cores in your machine (no hyperthreading).

Step III) Define MATE Task Count

The argument, --mate-tasks indicates how many MATE Tasks (equivalent to MPI tasks) will be used to run your program.
The simplest value for this argument is to use the same amount you use for the -n argument when you run the pure MPI version of your program using mpirun. However, more advanced configuration of how to perform a virtualized execution is shown in Tutorial V

Step V) Run your program

In this example, we assume a dual-socket processor node where each sockets contains 8 physical cores. Here is an example of how to run your MATE translated program (mate_example) on a single node and a SLURM (srun) job manager:

srun -n 2 ./mate_example --mate-workers 8 --mate-tasks 16 program_arguments

The command above will execute the program using 16 MATE tasks using 16 cores in a single node.

You can also add the --mate-verbose flag to obtain information about the execution:

srun -n 2 ./mate_example --mate-workers 8 --mate-tasks 16 --mate-verbose program_arguments

Tutorial V: Virtualization, NUMA aware, and Large Scale Execution

Step I) Warmup

We recommend that you familiarize with their manual pages and keep them at hand in another tab/window while you follow these steps.
For the purpose of the examples used in this tutorial, we assume that we have use the following execution cluster:

256 Computing Nodes.
4 Physical Processors per Computing Node.
4 Cores per Physical Processor (no Hyperthreading).

Step II: Virtualization

MATE tasks (the equivalent to MPI tasks) run virtualized. This means that they are not attached to an individual process or thread. Any task in MATE can execute in any worker thread (although scheduling in MATE is not random, as it is designed for performance).
Furthermore, the user can run a program with more tasks than worker threads, which is not possible in pure MPI. Running more (typically twice) tasks than worker threads increases MATE's capability for overlapping communication, as work granularity decreases and worker threads can choose from a bigger pool of tasks while others are still communicating.

In the last step of Tutorial IV we saw how to execute our mate translated example using all our system's cores in 1 Node:

srun -n 2 ./mate_example --mate-workers 8 --mate-tasks 16 program_arguments

The above example specifies a virtualization level = 1, that is: one task per worker thread.
If we want to achieve a virtualization level = 2 -- that is, two tasks per worker thread -- we need to duplicate the task count:

srun -n 2 ./mate_example --mate-workers 8 --mate-tasks 32 program_arguments

Step III: NUMA aware execution

Some computing nodes contain two or more processor sockets per node. Although the operating system reports the total amount of cores available for computing, communication and memory access latency among them is non uniform. This processor/memory design pattern is called Non-uniform Memory Access (NUMA).
The following image illustrates this how CPU cores and memory are distributed among our example NUMA setup:

Image source:Lawrence Livermore National Laboratory

Again, this is how we launched our example in Tutorial IV:

srun -n 2 ./mate_example --mate-workers 8 --mate-tasks 16 program_arguments

This is, however, a NUMA-unaware execution of MATE since shared-memory synchronization is unrestricted across all cores.
A NUMA-aware configuration of MATE will reduce synchronization and communication across the bus interconnecting different processors, thus increasing performance.
To achieve this, we need to execute one MATE process per physical processor, using mpirun. The following command runs 4 MATE processes with 4 worker threads (cores) each:

srun -n 4 ./mate_example --mate-workers 4 --mate-tasks 16 program_arguments

Step IV: Large Scale Executions

To execute a MATE application across multiple nodes, it is necessary to indicate a larger number of MATE processes to the SLURM launcher.
For example, if we wanted to execute our application across all 256 computing nodes (using 4096 tasks), we would use the following command:

srun -n 256 ./mate_example --mate-workers 16 --mate-tasks 16 program_arguments

Note that only one mate process per node needs to be launched. It is necessary to configure the MPI launcher to allocate only once MPI task per node, instead of allocating them in contiguous cores across the minimal amount of nodes.
Most supercomputers have a SLURM job submission system that allows the user specifying how many MPI tasks are created per node. This is how we would configure it in our case:

#SBATCH -n 256 # total number of MATE processes requested
#SBATCH -N 256 # total number of nodes requested

Final Step: Putting it all together

Our goal is to execute a NUMA-aware MATE application with virtualization factor=2, across all our 256 nodes.

To achieve a NUMA-aware execution, we will execute 4 MPI processes per node with 4 worker threads each.
To achieve a virtualization factor = 2, we will execute 32 tasks per node.
In total we will need: 256x4 = 1024 MPI tasks, and 256x32 = 8192 MATE tasks.

This can be achieved by executing the following command:

srun -n 1024 ./mate_example --mate-workers 4 --mate-tasks 8192 program_arguments

We would then need the following configuration for our SLURM job specification:

#SBATCH -n 1024 # total number of MATE processes requested
#SBATCH -N 256 # total number of nodes requested