slurm 和 mpi 软核速成

slurm 可以说是最亲民的跨系统任务调度器了，很多超算也在开心地使用它。因为它是 “Simple Linux Utility for Resource Management”——Linux！谁不爱呢。

它提供了三大功能：

给用户分配互斥/非互斥的算力片段
在分配的算力上启动、执行、监控任务的框架
在资源不足时的任务队列管理

总结起来就是让你能在一堆计算节点上轻松地提交一个多进程任务。功能少，学起来也轻松。

一个 slurm 系统中，有一个中央管理节点（sjtu hpc 叫他登录节点），还有一堆计算节点。每个计算节点都属于一个 partition （sjtu hpc 管他叫队列。一个 partition 会包括一堆节点），我们执行任务的时候可以（应当）指定 partition。

大多数的操作可以借鉴 Slurm 作业调度系统 | 上海交大超算平台用户手册和提交作业 | 未名。本文将关注于 batch file 的编写；写完后 sbatch BATCHFILE && watch -n0.5 squeue -l 一把梭就行。

Hello, world

main.slurm

#!/bin/bash

#SBATCH --job-name=hello-world # squeue/sacct 显示的任务名
#SBATCH --partition=arm128c256g # 选择 partition
#SBATCH --ntasks=4 # srun 并行数
#SBATCH --nodes=1 # 使用节点数
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=stdout

echo hello, world

其中，#SBATCH 开头的是传给 sbatch 的参数，这样就可以避免手动输入。

运行

sbatch main.slurm && watch -n0.5 squeue -l
cat stdout

你会看到一个 hello, world，说明执行成功了！

这个过程中，发生了什么？

首先 sbatch 读取 #SBATCH 开头的行，作为参数。在这个例子中，它设定了

job-name: 任务名
partition: 执行任务的 partition
ntasks: 供 srun/mpirun 使用的程序最大并行数
nodes: 申请的节点数
output: stdout 会转写到的文件
cpus-per-task: 单进程 cpu 数。此选项对多进程很有用
ntasks-per-node: 单节点进程数。此选项对多进程很有用

一些资源设置经验 (src)

you use mpi and do not care about where those cores are distributed: --ntasks=16

you want to launch 16 independent processes (no communication): --ntasks=16

you want those cores to spread across distinct nodes: --ntasks=16 and --ntasks-per-node=1 or --ntasks=16 and --nodes=16

you want those cores to spread across distinct nodes and no interference from other jobs: --ntasks=16 --nodes=16 --exclusive

you want 16 processes to spread across 8 nodes to have two processes per node: --ntasks=16 --ntasks-per-node=2

you want 16 processes to stay on the same node: --ntasks=16 --ntasks-per-node=16

you want one process that can use 16 cores for multithreading: --ntasks=1 --cpus-per-task=16

you want 4 processes that can use 4 cores each for multithreading: --ntasks=4 --cpus-per-task=4

然后，它申请了一个节点和 4 个核心，让第一个节点在当前目录下执行整个脚本¹。这就有了我们看到的 stdout 文件。

slurm 会默认所有计算节点都共用一套储存，挂载到同一位置。这样好处很多：

一次编译，所有节点共用
可以通过文件共享较大的东西
共用配置文件

多进程

如果要多进程执行单个可执行文件，需要用 srun 或 mpirun。他们会自动读取 slurm 配置，自动分配到可用的节点上，创建指定个进程。

下面用一个 mpi 示例来多进程执行。

代码

mpihello.c

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define MAX_HOSTNAME_LENGTH 256

int main(int argc, char *argv[])
{
    int pid;
    char hostname[MAX_HOSTNAME_LENGTH];

    int numprocs;
    int rank;

    int rc;

    /* Initialize MPI. Pass reference to the command line to
     * allow MPI to take any arguments it needs
     */
    rc = MPI_Init(&argc, &argv);

    /* It's always good to check the return values on MPI calls */
    if (rc != MPI_SUCCESS)
    {
        fprintf(stderr, "MPI_Init failed\n");
        return 1;
    }

    /* Get the number of processes and the rank of this process */
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    /* let's see who we are to the "outside world" - what host and what PID */
    gethostname(hostname, MAX_HOSTNAME_LENGTH);
    pid = getpid();

    /* say who we are */
    printf("Rank %d of %d has pid %5d on %s\n", rank, numprocs, pid, hostname);
    fflush(stdout);

    /* allow MPI to clean up after itself */
    MPI_Finalize();
    return 0;
}

mpi.slurm

#!/bin/bash

#SBATCH --job-name=mpihello # squeue/sacct 显示的任务名
#SBATCH --partition=arm128c256g # 选择 partition
#SBATCH --ntasks=4 # srun 并行数
#SBATCH --nodes=1 # 使用节点数
#SBATCH --output=mpi.out

# module load gcc openmpi # 不必要，因为已经默认启用了

# compile on need
if [ ! -f mpihello ]; then
    mpicc mpihello.c -o mpihello
fi

mpirun ./mpihello

运行 sbatch mpi.slurm && watch -n0.5 squeue -l 等待完成

输出

Rank 1 of 4 has pid 1636356 on kp029.pi.sjtu.edu.cn
Rank 2 of 4 has pid 1636357 on kp029.pi.sjtu.edu.cn
Rank 3 of 4 has pid 1636358 on kp029.pi.sjtu.edu.cn
Rank 0 of 4 has pid 1636355 on kp029.pi.sjtu.edu.cn

当然，由于目前没有配额，无法实验多节点的。咕咕咕！mpirun 应该是能自动读取配置的²。

登录调试

可以顶着配额申请一个临时的调试环境

salloc -p arm128c256g -N1 -n2 --time=1:00:00
ssh NODE # 上一个命令会输出分配的节点
# 在计算节点执行命令
exit # 退出计算节点，但尚未结束 salloc
ssh NODE1 # 可以登录其他节点
exit # 退出计算节点
exit # salloc，资源自动回收

这样就申请了一个单节点、两核的临时环境。这对于确认可用模块、编译等调试工作很有用。注意不要滥用！

MPI

MPI 是一套接口标准：

The Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on parallel computing architectures. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several open-source MPI implementations, which fostered the development of a parallel software industry, and encouraged development of portable and scalable large-scale parallel applications.

标准要求提供以下功能¹：

点对点通信
集合操作 (?)
进程组
传输域 (VLAN?)
进程拓扑 (?)
环境管理与查询
性能测试接口
C 与 Fortran 语言绑定

它并不提供

内存共享
需要系统支持的高级功能（如中断驱动，远程执行）
任务管理调度
…

诸如 openmpi, hypermpi 或许提供了某些额外的功能，这是具体实现的事。

当然，作为一篇“软核”文章，我们就不细究实现方式，而专注于用法。

You don’t need to know where the ranks end up in most cases. The MPI standard doesn’t want you to worry about it. The mechanics inside an MPI program are generally independent of the arrangement, though the library itself has to keep track to send messages to the right place behind the scenes. The standard is designed so that you don’t have to worry about it for correctness, though there may be some performance differences.²

MPI 基础

重要概念³⁴

进程组 process group: 指 MPI 程序的全部进程集合的一个有序子集且进程组中每个进程被赋予一个在该组中唯一的序号 (rank)，用于在该组中标识该进程。序号的取值范围是 [0,进程数- 1]
通信器 communicator: 是进程组的一个子集。只有一个通信器内的进程才可以进行通信。MPI 提供默认的通信器 MPI_COMM_WORLD
进程序号 (rank): 用来在一个进程组或通信器中标识一个进程。同一个进程在不同的进程组或通信器中可以有不同的序号，要注意语境。 MPI_PROC_NULL 序号是一个黑洞（虚拟）进程。
消息 message: 分为数据 data 和包装 envelope 两个部分。包装由接收进程序号/发送进程序号、消息标号和通信器三部分组成。数据包含用户将要传递的内容。
内部对象 opaque objects: 是 MPI 内部要用的对象，用户通过 handle 与之交互。
函数一般形式：几乎所有参数都是指针传递；返回值为整数，成功时 MPI_SUCCESS。
从而参数分为三种：
- IN: 调用只会读参数，不会写
- OUT: 调用会写参数，不会读初始值
- INOUT: 调用可能又读又写

数据类型：

基础类型：int, char, float…
byte 类型 MPI_BYTE: 如其名，可当作无类型无意义的一个 byte。
自定义类型 MPI_PACKED: 可以通过 MPI_PACK 创建。

程序基础结构：初始化 MPI 环境=>MPI主程序=>结束 MPI 环境

基本函数：

int MPI_Init(int *argc(IN), char ***argv(IN)): 初始化环境
int MPI_Finalize(): 结束环境
int MPI_Abort(MPI_Comm comm(IN), int errorcode(IN)): 异常强行关闭通信器
int MPI_Comm_size(MPI_Comm comm(IN), int *size(OUT)): 获得通信器总进程数
int MPI_Comm_rank(MPI_Comm comm(IN), int *rank(OUT)): 获得通信器内当前进程序号

MPI 通信

通信可谓 MPI 的核心功能。无论是单操作系统内通信，还是跨网络通信，都由具体 mpi 的实现来实现，用户无需关注实现细节。通过 MPI 的统一通信接口，多进程就可以进行任务分配调度、同步、以及数据传输。

最基础的阻塞通信有以下函数：

int MPI_Send(const void *buf(IN), int count(IN), MPI_Datatype datatype(IN), int dest(IN), int tag(IN), MPI_Comm comm(IN))
int MPI_Recv(void *buf(OUT), int count(IN), MPI_Datatype datatype(IN), int source(IN), int tag(IN), MPI_Comm comm(IN), MPI_Status *status(OUT))

其中有

buf: buffer
count: 数据个数。实际读写良为 count*sizeof(datatype)。收发要匹配
datatype: 数据类型。不过以 void* buffer 感觉除了 sizeof 没啥用
dest/sourse: 通信对象的序号，注意与通信器相关；source 取 MPI_ANY_SOURCE
tag: 消息序号，允许乱序接受/发送；tag 可以取 MPI_ANY_TAG
comm: 通信器
status: 接收状态，好像不太好用

用法如参数所示。

代码

main.c

#include <mpi.h>
#include <stdio.h>

int numprocs=-1, rank=-1;
typedef int mpierr_t; // MPI_SUCCESS for succ and other for err

#define CHECK(expr) {mpierr_t rc = expr; if (rc != MPI_SUCCESS) {printf("[Node %d] MPI failure %d\n", rank, rc); return rc;}}

mpierr_t main0() {
    CHECK(MPI_Send("Hello, world!", 14, MPI_BYTE, 1, 0, MPI_COMM_WORLD));
    printf("[Node 0] MPI Message sent\n");
    return MPI_SUCCESS;
}

mpierr_t main1() {
    char buf[16];
    MPI_Status status;
    CHECK(MPI_Recv((void*)buf, 14, MPI_BYTE, 0, 0, MPI_COMM_WORLD, &status));
    printf("[Node 1] MPI Message received: %s\n", buf);
    return MPI_SUCCESS;
}

int main(int argc, char *argv[]) {
    CHECK(MPI_Init(&argc, &argv));

    CHECK(MPI_Comm_rank(MPI_COMM_WORLD, &rank));

    switch (rank) {
    case 0: main0(); break;
    case 1: main1(); break;
    default: printf("[Node %d] Unknown ranking\n", rank);
    }

    MPI_Finalize();
    return 0;
}

还是 4 task 运行，输出

[Node 3] Unknown ranking
[Node 0] MPI Message sent
[Node 1] MPI Message received: Hello, world!
[Node 2] Unknown ranking

非阻塞/异步通信

I 开头的函数是异步通信。

int MPI_Isend(const void *buf(IN), int count(IN), MPI_Datatype datatype(IN), int dest(IN), int tag(IN), MPI_Comm comm(IN), MPI_Request *request(OUT))
int MPI_Irecv(void *buf(OUT), int count(IN), MPI_Datatype datatype(IN), int source(IN), int tag(IN), MPI_Comm comm(IN), MPI_Request *request(OUT))
int MPI_Wait(MPI_Request *request(IN), MPI_Status *status(OUT))
int MPI_Waitany(int count(IN), MPI_Request array_of_requests[](INOUT), int *index(OUT), MPI_Status *status(OUT))
int MPI_Waitall(int count(IN), MPI_Request array_of_requests[](INOUT), MPI_Status array_of_statuses[](OUT))

他们配套起来可以实现非阻塞地发送接收，例如在发送之后计算下一轮，在下一轮计算完之后才等待发送完毕；从而充分利用 CPU 时间片；或是发送接收很多组数据然后统一等待。

通信模式

看别的博客吧。一般用标准模式就够了；除非能保证收发时序，用 MPI_RSEND（但其实基本没法保证）。

wsmの随记