III - Tutorials


Tutorial I: Configuring and Installing MATE

Step I) Download and extract MATE

Step II) Configure MATE

Here is an example of how to configure MATE:
./configure --with-mpi=/apps/mpich2 --prefix=/usr/bin/mate --build-dir=/largefs/buildmate --build-jobs=16

Step III) Compile, test, and Install MATE

Here is an example of how to install MATE:
make
make install
export PATH=/usr/bin/mate:$PATH


Tutorial II: Annotating your code with MATE #pragmas

Step I) Warmup

Throughout this tutorial we will use the following example code sample:


Step II: Identify MPI Calls



Step III: Identify opportunities for overlap



Step IV: Identifying regions

Step V: Identify Data Dependencies



Tutorial III: Translating and compiling your program

Step I) Warmup

Step II) Translate your annotated code

Here is an example (taken from examples/7_point_stencil) of how to translate a multi-file program using MATE:
mate-translate getTime.cpp 7_point_stencil.cpp residual.cpp

Step III) Compile and Link

Following our example, this is how to compile and link the MATE translated code and other source files:
mate-cxx -c -O2 getTime.cpp
mate-cxx -c -O2 residual.cpp
mate-cxx -c -O2 mate_7_point_stencil.cpp
mate-cxx -o mate_7_point_stencil getTime.o mate_7_point_stencil.o residual.o

Alternatively, you can compile and link in the same step:
mate-cxx getTime.cpp mate_7_point_stencil.cpp residual.cpp -o mate_7_point_stencil


Tutorial IV: Running your program

Step I) MATE execution model

Step II) Define Worker Threads

Step III) Define MATE Task Count

Step V) Run your program

In this example, we assume a dual-socket processor node where each sockets contains 8 physical cores. Here is an example of how to run your MATE translated program (mate_example) on a single node and a SLURM (srun) job manager:
srun -n 2 ./mate_example --mate-workers 8 --mate-tasks 16 program_arguments

The command above will execute the program using 16 MATE tasks using 16 cores in a single node.

You can also add the --mate-verbose flag to obtain information about the execution:
srun -n 2 ./mate_example --mate-workers 8 --mate-tasks 16 --mate-verbose program_arguments


Tutorial V: Virtualization, NUMA aware, and Large Scale Execution

Step I) Warmup

Step II: Virtualization

  • In the last step of Tutorial IV we saw how to execute our mate translated example using all our system's cores in 1 Node:
  • srun -n 2 ./mate_example --mate-workers 8 --mate-tasks 16 program_arguments
    srun -n 2 ./mate_example --mate-workers 8 --mate-tasks 32 program_arguments

    Step III: NUMA aware execution


    Image source:Lawrence Livermore National Laboratory
    srun -n 2 ./mate_example --mate-workers 8 --mate-tasks 16 program_arguments
    srun -n 4 ./mate_example --mate-workers 4 --mate-tasks 16 program_arguments

    Step IV: Large Scale Executions

    srun -n 256 ./mate_example --mate-workers 16 --mate-tasks 16 program_arguments
    #SBATCH -n 256 # total number of MATE processes requested
    #SBATCH -N 256 # total number of nodes requested

    Final Step: Putting it all together

    srun -n 1024 ./mate_example --mate-workers 4 --mate-tasks 8192 program_arguments

    We would then need the following configuration for our SLURM job specification:
    #SBATCH -n 1024 # total number of MATE processes requested
    #SBATCH -N 256 # total number of nodes requested