亚马逊AWS官方博客

基于 SoftRCoE 部署 DeepSeek Fire-Flyer File System (3FS)

DeepSeek 于 2025 年 2 月 24 日至 28 日开展了为期 5 天的“开源周”行动,并于收官当日开源了 Fire-Flyer File System(3FS)并行文件系统, 揭秘了 3FS 如何“榨干” SSD 和远程直接内存访问(RDMA)网络,全面支持大模型的训练和推理的 KVCache 转存以及向量数据库等能力,为 AI 训练和推理工作负载带来低成本高吞吐量的数据访问体验。

主要功能

  • 高性能数据访问:聚合数千个 SSD 的吞吐量和数百个存储节点的网络带宽,提供高达 6.6 TiB/s 的读取吞吐量。支持大规模集群中的高吞吐量并行读写,优化 AI 训练和推理任务中的数据加载效率。
  • 通用文件接口:提供无状态元数据服务,支持事务性键值存储(如 FoundationDB),用户无需学习新的存储 API。
  • 优化 AI 工作负载:
    • 数据准备:高效管理大量中间输出,支持层次化目录结构。
    • 数据加载:支持跨计算节点的随机访问,无需预取或数据集洗牌。
    • 检查点支持:为大规模训练提供高吞吐量并行检查点功能。
    • KVCache:为推理任务提供高吞吐量、大容量的缓存替代方案,优化推理效率。
  • 高扩展性和灵活性:支持大规模集群部署,适用于从单节点到数千节点的多样化应用场景。

技术特性

  1. 分离式架构:基于计算与存储分离的设计,将存储资源集中管理,用高速网络(如 RDMA)实现数据的高效传输。让应用以“位置无关”的方式访问存储资源,简化资源管理。
  2. 强一致性(Strong Consistency):为实现强一致性,3FS 基于 CRAQ 技术。基于链式复制确保数据在多个副本间的一致性,用分配查询优化读取性能,减少延迟。
  3. 文件接口(File Interfaces):提供基于事务性键值存储(如 FoundationDB)支持的无状态元数据服务,使用通用的文件接口,无需学习新的存储 API。
  4. Direct I/O 与 RDMA 优化:基于 Direct I/O 直接访问 SSD,避免使用文件缓存,减少 CPU 和内存开销,用 RDMA 技术实现高效的数据传输,进一步提升性能。
  5. KVCache 技术:在推理任务中,基于 KVCache 缓存关键中间结果,避免重复计算,显著提高推理效率。KVCache 结合高吞吐量和大容量的优势,是 DRAM 缓存的低成本替代方案。
  6. 数据局部性优化:基于优化数据布局和访问模式,减少数据传输的延迟和带宽消耗,特别是在大规模分布式训练和推理任务中表现出色。

在本次测试中我们选择使用 c6a.8xlarge 实例,通过构建了 5 个存储节点 +1 个 client 来验证 3FS 的官方部署指南以及对应的技术特性。

环境准备

部署架构

安装环境

NO IP spec role OS Hard disk Remark
1 192.168.19.240 c6a.8xlarge meta/monitor/management/fuse client Ubuntu 22.04 1*300G Disk(GP3 800MB/s 3000IOPS) 12.5G ENA
2 192.168.15.58 c6a.8xlarge storage1 Ubuntu 22.04 3* 500G Disk(GP3 800MB/s 3000IOPS) 12.5G ENA
3 192.168.7.82 c6a.8xlarge storage2 Ubuntu 22.04 3* 500G Disk(GP3 800MB/s 3000IOPS) 12.5G ENA
4 192.168.25.80 c6a.8xlarge storage3 Ubuntu 22.04 3* 500G Disk(GP3 800MB/s 3000IOPS) 12.5G ENA
5 192.168.8.218 c6a.8xlarge storage4 Ubuntu 22.04 3* 500G Disk(GP3 800MB/s 3000IOPS) 12.5G ENA
6 192.168.7.118 c6a.8xlarge storage5 Ubuntu 22.04 3* 500G Disk(GP3 800MB/s 3000IOPS) 12.5G ENA

P.S. 本次部署采用 SoftRCoE,采用基于 12.5G ENA,因此无需尝试支持 EFA 的 EC2 实例。

每个节点的服务和相应的配置文件和官方建议相同,如下所示:

service Binary config files nodeID node
monitor monitor_collector_main monitor_collector_main.toml meta
admin_cli admin_cli admin_cli.toml
fdb.cluster
meta
storage1
storage2
storage3
storage4
storage5
mgmtd mgmtd_main mgmtd_main_launcher.toml
mgmtd_main.toml
mgmtd_main_app.toml
fdb.cluster
1 meta
meta meta_main meta_main_launcher.toml
meta_main.toml
meta_main_app.toml
fdb.cluster
100 meta
storage storage_main storage_main_launcher.toml
storage_main.toml
storage_main_app.toml
10001~10005 storage1
storage2
storage3
storage4
storage5
client hf3fs_fuse_main hf3fs_fuse_main_launcher.toml
hf3fs_fuse_main.toml
meta

安装 SoftRCOE

必须为所有的节点启动 IPv6

在 VPC中启动 IPv6

启动 IPv6 for ENA

安装并启动 SoftRCoE

AWS 默认 ubuntu 22.04 内核,没有加载 rdma-rxe 模块,可以通过如下命令验证:

[ec2-user@ip-192-168-19-240 ~]$ modprobe rdma_rxe
modprobe: FATAL: Module rdma_rxe not found in directory /lib/modules/6.8.0-1021-aws

出现上面报错,则需要安装 rdma_rxe 模块与相关依赖如下:

apt install linux-modules-extra-6.8.0-1021-aws

apt-get install libibverbs1 ibverbs-utils librdmacm1 libibumad3 ibverbs-providers rdma-core
apt install libibverbs-dev
apt install librdmacm-dev
apt install librdmacm-utils
apt install iproute2

添加 rdma link

#add rdma link (using existing ethenet)
rdma link add rxe_eth0 type rxe netdev ens5            

验证 rdma link 状态

# rdma link need to show SoftRCoE device
ubuntu@ip-192-168-19-240:~$ sudo -i rdma link
link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev ens5

安装和配置 SoftRCoE 服务(可选)

创建 rdma.setup service 如下:

vi /etc/systemd/system/rdma-setup.service

并添加如下信息:

[Unit]
Description=Setup RDMA RXE Link
After=network.target

[Service]
Type=oneshot
ExecStart=/sbin/modprobe rdma_rxe
ExecStart=/usr/bin/rdma link add rxe_eth0 type rxe netdev ens5
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

启动服务,并查看状态:

root@ip-192-168-19-240:~# systemctl start rdma-setup.service
root@ip-192-168-19-240:~# systemctl status rdma-setup.service
● rdma-setup.service - Setup RDMA RXE Link
     Loaded: loaded (/etc/systemd/system/rdma-setup.service; disabled; vendor preset: enabled)
     Active: active (exited) since Thu 2025-03-14 01:33:00 CST; 3s ago
    Process: 2410 ExecStart=/sbin/modprobe rdma_rxe (code=exited, status=0/SUCCESS)
    Process: 2414 ExecStart=/usr/bin/rdma link add rxe_eth0 type rxe netdev ens5 (code=exited, status=0/SUCCESS)
   Main PID: 2414 (code=exited, status=0/SUCCESS)
        CPU: 17ms

Apr 03 01:33:00 ip-192-168-8-218 systemd[1]: Starting Setup RDMA RXE Link...
Apr 03 01:33:00 ip-192-168-8-218 systemd[1]: Finished Setup RDMA RXE Link.

模拟 ibdev2netdev

由于 3FS 使用了 mellanox 网卡的 ibdev2netdev,在执行 3FS 命令时会调用,因此我们需要在 SoftRcoE 环境中构造一个命令输出。采用如下方式添加脚本:

vim  /usr/sbin/ibdev2netdev 
添加如下内容
 
#!/bin/bash
echo "rdma_0 port 1 ==> rxe_eth0 (Up)"

安装和编译 3FS

下面的步骤,需要在集群中的每一个节点进行安装和配置,当然也可以选择任一节点,然后制作成 AMI 镜像,作为后续 Meta/Storage/Management 节点的安装基础。

设置 ubuntu 软件仓库源

打开终端,输入以下命令查看 sources.list 文件:

sudo nano /etc/apt/sources.list

确保文件中有有效的源代码。例如,Ubuntu 官方源文件应包含以下条目:

deb http://archive.ubuntu.com/ubuntu/ focal main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu/ focal-updates main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu/ focal-security main restricted universe multiverse

执行更新

sudo apt-get update

安装需要的依赖包

apt install cmake libuv1-dev liblz4-dev liblzma-dev libdouble-conversion-dev \ 
libprocps-dev libdwarf-dev libunwind-dev \ libaio-dev libgflags-dev \ 
libgoogle-glog-dev libgtest-dev libgmock-dev clang-format-14 clang-14 clang-tidy-14 \ 
lld-14 \ libgoogle-perftools-dev google-perftools libssl-dev ccache gcc-12 g++-12 \ 
libboost-all-dev

安装 foundation db

wget https://github.com/apple/foundationdb/releases/download/7.3.63/foundationdb-clients_7.3.63-1_amd64.deb
wget https://github.com/apple/foundationdb/releases/download/7.3.63/foundationdb-server_7.3.63-1_amd64.deb

dpkg -i foundationdb-clients_7.3.63-1_amd64.deb
dpkg -i foundationdb-server_7.3.63-1_amd64.deb

安装 libfuse

wget https://github.com/libfuse/libfuse/releases/download/fuse-3.16.1/fuse-3.16.1.tar.gz
tar vzxf fuse-3.16.1.tar.gz
cd fuse-3.16.1/
mkdir build && cd build
apt install -y meson
meson setup ..
ninja && ninja install

安装 rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Enter “1”

1)proceed with standard installation (default -just press enter)
2) Customize installation
3) Cancel installation
>1

等待安装完成

Rust is installed now. Great!

To get started you may need to restart your current shell.
This would reload your PATH environment variable to include Cargo's bin directory($HOME/.cargo/bin).
To configure your current shell, you need to source
the Corresponding env file under $HOME/.cargo.
This is usually done by running one of the following (note the leading DOT):
. "$HOME/.cargo/env" # For sh/bash/zsh/ash/dash/pdksh
source "$HOME/.cargo/env.fish" # for fish
root@meta-240:~/fuse-3.16.1/build#

安装 3FS

git clone https://github.com/deepseek-ai/3fs
cd 3fs
git submodule update --init --recursive
./patches/apply.sh
cmake -S . -B build -DCMAKE_CXX_COMPILER=clang++-14 -DCMAKE_C_COMPILER=clang-14 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
cmake --build build -j 32
 
# optional config:
cd ~/3fs/configs
sed -i 's/max_sge = 16/max_sge = 1/g' `grep -rl max_sge`

安装和配置 ClickHouse(Meta 节点)

安装 ClickHouse

sudo apt-get install -y apt-transport-https ca-certificates curl gnupg
curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg —dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg

ARCH=$(dpkg —print-architecture)
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg arch=${ARCH}] https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update

Setting Password,如:Amazon123!!

Set up the password for the default user:

sudo apt-get install -y clickhouse-server clickhouse-client

启动 Clickhouse

clickhouse start

配置 Clickhouse

clickhouse-client —password 'Amazon123!!'

创建 Metric table

clickhouse-client —password 'Amazon123!!' -n < ~/3fs/deploy/sql/3fs-monitor.sql

配置 Admin Client (所有节点)

  1. 拷贝文件到节点
    mkdir -p /opt/3fs/{bin,etc}
    rsync -avz meta:~/3fs/build/bin/admin_cli /opt/3fs/bin
    rsync -avz meta:~/3fs/configs/admin_cli.toml /opt/3fs/etc
    rsync -avz meta:/etc/foundationdb/fdb.cluster /opt/3fs/etc
    
  2. 修改 admin_cli.toml
    vim /opt/3fs/etc/admin_cli.toml
    cluster_id = "stage"
    [fdb]
    clusterFile = '/opt/3fs/etc/fdb.cluster'
    
  3. 测试 admin_cli 是否工作
    root@ip-192-168-19-240:~# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml help
    bench                          Usage: bench [--rank VAR] [--timeout VAR] [--coroutines VAR] [--seconds VAR] [--remove] path
    cd                             Usage: cd [-L] [--inode] path
    checksum                       Usage: checksum [--list] [--batch VAR] [--md5] [--fillZero] [--output VAR] path
    create                         Usage: create [--perm VAR] [--chain-table-id VAR] [--chain-table-ver VAR] [--chain-list VAR] [--chunk-size VAR] [--stripe-size VAR] path
    create-range                   Usage: create-range [--concurrency VAR] prefix inclusive_start exclusive_end
    create-target                  Usage: create-target --node-id VAR --disk-index VAR --target-id VAR --chain-id VAR [--add-chunk-size] [--chunk-size VAR...] [--use-new-chunk-engine]
    create-targets                 Usage: create-targets --node-id VAR [--disk-index VAR...] [--allow-existing-target] [--add-chunk-size] [--use-new-chunk-engine]
    current-user                   Usage: current-user
    

安装和配置 monitor service(Meta 节点)

  1. 拷贝文件
    mkdir -p /opt/3fs/{bin,etc}
    mkdir -p /var/log/3fs
    cp ~/3fs/build/bin/monitor_collector_main /opt/3fs/bin
    cp ~/3fs/configs/monitor_collector_main.toml /opt/3fs/etc
    
  2. 修改 monitor_collector_main.toml
    vim /opt/3fs/etc/monitor_collector_main.toml
    #修改如下配置:
    [server.monitor_collector.reporter.clickhouse]
    db = '3fs' # 默认为3fs
    host = '127.0.0.1' # clickhouse所在的节点
    passwd = '9' # 安装时设置的密码
    port = '9000' # 默认监听在9000端口
    user = 'default' # 默认为default用户
    
  3. 启动 monitor service
    cp ~/3fs/deploy/systemd/monitor_collector_main.service /usr/lib/systemd/system
    systemctl start monitor_collector_main
    
  4. 检查 monitor 服务转台
    root@ip-192-168-8-218:~# systemctl status monitor_collector_main
    ● monitor_collector_main.service - monitor_collector_main Server
         Loaded: loaded (/lib/systemd/system/monitor_collector_main.service; enabled; vendor preset: enabled)
         Active: active (running) since Thu 2025-03-14 11:14:56 CST; 3s ago
       Main PID: 2954 (monitor_collect)
          Tasks: 58 (limit: 113376)
         Memory: 271.4M
            CPU: 211ms
         CGroup: /system.slice/monitor_collector_main.service
                 └─2954 /opt/3fs/bin/monitor_collector_main --cfg /opt/3fs/etc/monitor_collector_main.toml
    
    Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO]     "fatal": {
    Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO]       "type": "stream",
    Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO]       "options": {
    Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO]         "level": "FATAL",
    Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO]         "stream": "stderr"
    Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO]       }
    Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO]     }
    Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO]   }
    Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO] }
    Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336700818+08:00 monitor_collect: 2954 OnePhaseApplication.h:87 INFO] LogConfig: {"categories":{".":{"level":"INFO","in>
    

安装和配置 management service(Meta 节点)

  1. 拷贝文件
    cp ~/3fs/build/bin/mgmtd_main /opt/3fs/bin
    cp ~/3fs/configs/{mgmtd_main.toml,mgmtd_main_launcher.toml,mgmtd_main_app.toml} \ 
    /opt/3fs/etc
    
  2. 修改 mgmtd_main_app.toml
    vim /opt/3fs/etc/mgmtd_main_app.toml
    node_id = 1 // 将ID改为1
    
    vim /opt/3fs/etc/mgmtd_main_launcher.toml
    cluster_id = "stage" # 集群ID,全局唯一
    
    [fdb]
    clusterFile = '/opt/3fs/etc/fdb.cluster'
    
    vim /opt/3fs/etc/mgmtd_main.toml
    [common.monitor.reporters.monitor_collector]
    remote_ip = "192.168.19.240:10000" # 这里将IP改为mon节点的IP
    
  3. 启动集群
    /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \ 
    "init-cluster —mgmtd /opt/3fs/etc/mgmtd_main.toml 1 1048576 16"
    # where parameter 1 represents the chainTable ID, 1048576 represents the chunksize (1MB), and 16 represents the file strip size. then start the service and verify the
    
  4. 启动服务
    cp ~/3fs/deploy/systemd/mgmtd_main.service /usr/lib/systemd/system
    systemctl start mgmtd_main
    
  5. 检查 Service 状态
    systemctl start mgmtd_main
  6. 验证服务节点
    root@ip-192-168-19-240:~# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
     --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \ 
     "list-nodes"
    /root/.profile: line 10: /.cargo/env: No such file or directory
    Id     Type     Status               Hostname           Pid   Tags  LastHeartbeatTime    ConfigVersion  ReleaseVersion
    1      MGMTD    PRIMARY_MGMTD        ip-192-168-19-240  2729  []    N/A                  1(UPTODATE)    250228-dev-1-999999-3b273a6d

安装和配置 Meta Service(Meta 节点)

  1. 拷贝文件
    cp ~/3fs/build/bin/meta_main /opt/3fs/bin
    cp ~/3fs/configs/{meta_main_launcher.toml,meta_main.toml,meta_main_app.toml} \
     /opt/3fs/etc
    
  2. 修改 meta_main_app.toml
    vi /opt/3fs/etc/meta_main_app.toml
    node_id = 100
    
    vi /opt/3fs/etc/meta_main_launcher.toml
    cluster_id = "stage"
    [mgmtd_client]
    mgmtd_server_addresses = ["RDMA://192.168.19.240:8000"]
    
    vi /opt/3fs/etc/meta_main.toml
    [server.mgmtd_client]
    mgmtd_server_addresses = ["RDMA://192.168.19.240:8000"]
    [common.monitor.reporters.monitor_collector]
    remote_ip = "192.168.19.240:10000"
    [server.fdb]
    clusterFile = '/opt/3fs/etc/fdb.cluster'
    
  3. 更新配置到 management node
    /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \ 
    —config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \ 
    "set-config —type META —file /opt/3fs/etc/meta_main.toml"
    
  4. 配置与启动 Meta service
    cp ~/3fs/deploy/systemd/meta_main.service /usr/lib/systemd/system
    systemctl start meta_main
    
  5. 检查 Meta service 状态
    root@ip-192-168-19-240:~# systemctl status meta_main
    ● meta_main.service - meta_main Server
         Loaded: loaded (/lib/systemd/system/meta_main.service; enabled; vendor preset: enabled)
         Active: active (running) since Thu 2025-03-14 11:15:52 CST; 3s ago
       Main PID: 3054 (meta_main)
          Tasks: 64 (limit: 113376)
         Memory: 408.8M
            CPU: 386ms
         CGroup: /system.slice/meta_main.service
                 └─3054 /opt/3fs/bin/meta_main --launcher_cfg /opt/3fs/etc/meta_main_launcher.toml --app-cfg /opt/3fs/etc/meta_main_app.toml
    
    Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO]         "rotate": "true",
    Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO]         "max_files": "10",
    Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO]         "max_file_size": "104857600",
    Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO]         "rotate_on_open": "false"
    Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO]       }
    Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO]     }
    Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO]   }
    Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO] }
    Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968767842+08:00 meta_main: 3054 Utils.cc:161 INFO] LogConfig: {"categories":{".":{"level":"INFO","inherit":true,"propagate":"NONE",>
    Apr 03 02:15:53 ip-192-168-8-218 meta_main[3054]: Memory profiling already disabled
    
  6. 验证
    root@ip-192-168-19-240:~# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
     --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \ 
     "list-nodes"
    /root/.profile: line 10: /.cargo/env: No such file or directory
    Id     Type     Status               Hostname           Pid   Tags  LastHeartbeatTime    ConfigVersion  ReleaseVersion
    1      MGMTD    PRIMARY_MGMTD        ip-192-168-19-240  2729  []    N/A                  1(UPTODATE)    250228-dev-1-999999-3b273a6d
    100    META     HEARTBEAT_CONNECTED  ip-192-168-19-240  2873  []    2025-03-14 22:41:34  1(UPTODATE)    250228-dev-1-999999-3b273a6d

安装和配置 Storage Service(存储节点)

  1. 格式化挂载磁盘(这里根据自己的磁盘数量行修改,这里有 3 块磁盘所以是 .3)
    mkdir -p /storage/data{1..3}
    
    mkdir -p /var/log/3fs
    for i in {1..3};do mkfs.xfs -f -L data${i} /dev/nvme${i}n1;mount -o noatime,nodiratime -L data${i} /storage/data${i};done
    mkdir -p /storage/data{1..3}/3fs
    
  2. 修改 sysctl.conf
    echo "fs.aio-max-nr=67108864" >> /etc/sysctl.conf
    sysctl -p
    
  3. 在 meta 节点修改原始配置文件,设置 node_id 和 mgmtd 地址
    vi ~/3fs/configs/storage_main_launcher.toml
    cluster_id = "stage"
    [mgmtd_client]
    mgmtd_server_addresses = ["RDMA://192.168.19.240:8000"]
    
    vi ~/3fs/configs/storage_main.toml
    [server.mgmtd]
    mgmtd_server_address = ["RDMA://192.168.19.240:8000"]
    
    [common.monitor.reporters.monitor_collector]
    remote_ip = "192.168.19.240:10000"
    
    [server.targets]
    target_paths = ["/storage/data1/3fs","/storage/data2/3fs","/storage/data3/3fs"] 
    
  4. 在所有存储节点上将配置文件从 meta 节点上拷贝过来
    rsync -avz meta:~/3fs/build/bin/storage_main /opt/3fs/bin
    rsync -avz meta:~/3fs/configs/{storage_main_launcher.toml,storage_main.toml,storage_main_app.toml} \ 
    /opt/3fs/etc
    
  5. 修改存储节点 ID,每个 ID 必须全局唯一。这里,存储节点使用 10001~10005
    vi /opt/3fs/etc/storage_main_app.toml
    node_id = 10001 # need to set 10001-100005 for 5 storeage nodes
    
  6. 然后每个存储节点更新配置
    /opt/3fs/bin/admin_cli --cfg /opt/3fs/etc/admin_cli.toml \
     --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]'  \ 
     "set-config --type STORAGE --file /opt/3fs/etc/storage_main.toml"
    
  7. 每个存储节点配置并启动服务
    rsync -avz meta:~/3fs/deploy/systemd/storage_main.service /usr/lib/systemd/system
    systemctl start storage_main
    
  8. 每个存储节点查看服务状态
    systemctl status storage_main
  9. 检查集群状态
    root@ip-192-168-15-78:~# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
     --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \ 
     "list-nodes"
    /root/.profile: line 10: /.cargo/env: No such file or directory
    Id     Type     Status               Hostname           Pid   Tags  LastHeartbeatTime    ConfigVersion  ReleaseVersion
    1      MGMTD    PRIMARY_MGMTD        ip-192-168-19-240  2729  []    N/A                  1(UPTODATE)    250228-dev-1-999999-3b273a6d
    100    META     HEARTBEAT_CONNECTED  ip-192-168-19-240  2873  []    2025-03-14 22:41:34  1(UPTODATE)    250228-dev-1-999999-3b273a6d
    10001  STORAGE  HEARTBEAT_CONNECTED  ip-192-168-15-78   1447  []    2025-03-14 22:41:39  8(UPTODATE)    250228-dev-1-999999-3b273a6d
    10002  STORAGE  HEARTBEAT_CONNECTED  ip-192-168-7-82    1399  []    2025-03-14 22:41:39  8(UPTODATE)    250228-dev-1-999999-3b273a6d
    10003  STORAGE  HEARTBEAT_CONNECTED  ip-192-168-25-80   1487  []    2025-03-14 22:41:39  8(UPTODATE)    250228-dev-1-999999-3b273a6d
    10004  STORAGE  HEARTBEAT_CONNECTED  ip-192-168-8-218   1799  []    2025-03-14 22:41:39  8(UPTODATE)    250228-dev-1-999999-3b273a6d
    10005  STORAGE  HEARTBEAT_CONNECTED  ip-192-168-7-118   1641  []    2025-03-14 22:41:39  8(UPTODATE)    250228-dev-1-999999-3b273a6d
    

配置 3FS 元数据 (管理节点)

  1. 创建用户
    /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \ 
    —config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \ 
    "user-add —root —admin 0 root"
    Uid 0
    Name root
    Token AACDso6V8QA42xI222222222(Expired at N/A)
    IsRootUser true
    IsAdmin true
    Gid 0
    SupplementaryGids
    
  2. 将上一步生成 token 保存在/opt/3fs/etc/token.txt 中
    root@ip-192-168-19-240:~# echo "AACDso6V8QA42xI222222222" > /opt/3fs/etc/token.txt
  3. 创建 CRAQ 链表
    pip3 install -r ~/3fs/deploy/data_placement/requirements.txt
    
    # 这一步注意的地方:
    # replication_factor 表示副本数量
    # min_targets_per_disk 表示每个磁盘的target数量
    python3 /opt/software/3fs/deploy/data_placement/src/model/data_placement.py \ 
    -ql -relax -type CR --num_nodes 5 --replication_factor 2 --min_targets_per_disk 6
     
    # 这一步注意的地方:
    # 1.修改num_disks_per_node为实际磁盘数量
    # 2.修改node_id_begin和node_id_end为实际的ID
    python3 /opt/software/3fs/deploy/data_placement/src/setup/gen_chain_table.py \
    --chain_table_type CR --node_id_begin 10001 --node_id_end 10005 --num_disks_per_node \
     3 --num_targets_per_disk 6 --target_id_prefix 1 --chain_id_prefix 5 \ 
     --incidence_matrix_path output/DataPlacementModel-v_5-b_15-r_6-k_2-λ_2-lb_1-ub_0/incidence_matrix.pickle
  4. 创建 storage target
    /opt/3fs/bin/admin_cli —cfg /opt/3fs/etc/admin_cli.toml \ 
    --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
    --config.user_info.token $(<"/opt/3fs/etc/token.txt") < output/create_target_cmd.txt
    
  5. 上传 chains 和 chain table 到 mgmtd service
    /opt/3fs/bin/admin_cli —cfg /opt/3fs/etc/admin_cli.toml \ 
    --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \ 
    --config.user_info.token $(<"/opt/3fs/etc/token.txt") "upload-chains output/generated_chains.csv"
    
    /opt/3fs/bin/admin_cli —cfg /opt/3fs/etc/admin_cli.toml \
    --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
    --config.user_info.token $(<"/opt/3fs/etc/token.txt") "upload-chain-table —desc stage 1 output/generated_chain_table.csv"
    
  6. 检查是否上传成功

检查链状态

root@ip-192-168-19-240:~# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
--config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \ 
"list-chains"
/root/.profile: line 10: /.cargo/env: No such file or directory
ChainId ReferencedBy ChainVersion Status PreferredOrder Target Target
500100001 1 10 SERVING [] 101000000101(SERVING-UPTODATE) 101000100101(SERVING-UPTODATE)
500100002 1 10 SERVING [] 101000000102(SERVING-UPTODATE) 101000100102(SERVING-UPTODATE)
500100003 1 1 SERVING [] 101000000103(SERVING-UPTODATE) 101000100103(SERVING-UPTODATE)
500100004 1 1 SERVING [] 101000000104(SERVING-UPTODATE) 101000100104(SERVING-UPTODATE)
500100005 1 1 SERVING [] 101000000105(SERVING-UPTODATE) 101000100105(SERVING-UPTODATE)
500100006 1 1 SERVING [] 101000000106(SERVING-UPTODATE) 101000100106(SERVING-UPTODATE)
500200001 1 10 SERVING [] 101000000201(SERVING-UPTODATE) 101000100201(SERVING-UPTODATE)
500200002 1 10 SERVING [] 101000000202(SERVING-UPTODATE) 101000100202(SERVING-UPTODATE)
500200003 1 1 SERVING [] 101000000203(SERVING-UPTODATE) 101000100203(SERVING-UPTODATE)
500200004 1 1 SERVING [] 101000000204(SERVING-UPTODATE) 101000100204(SERVING-UPTODATE)
500200005 1 1 SERVING [] 101000000205(SERVING-UPTODATE) 101000100205(SERVING-UPTODATE)
500200006 1 1 SERVING [] 101000000206(SERVING-UPTODATE) 101000100206(SERVING-UPTODATE)
500300001 1 10 SERVING [] 101000000301(SERVING-UPTODATE) 101000100301(SERVING-UPTODATE)
500300002 1 10 SERVING [] 101000000302(SERVING-UPTODATE) 101000100302(SERVING-UPTODATE)
500300003 1 1 SERVING [] 101000000303(SERVING-UPTODATE) 101000100303(SERVING-UPTODATE)
500300004 1 1 SERVING [] 101000000304(SERVING-UPTODATE) 101000100304(SERVING-UPTODATE)
500300005 1 1 SERVING [] 101000000305(SERVING-UPTODATE) 101000100305(SERVING-UPTODATE)
500300006 1 1 SERVING [] 101000000306(SERVING-UPTODATE) 101000100306(SERVING-UPTODATE)

检查链表状态

root@ip-192-168-19-240:~# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \ 
--config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \ 
"list-chain-tables"
/root/.profile: line 10: /.cargo/env: No such file or directory
ChainTableId  ChainTableVersion  ChainCount  ReplicaCount  Desc
1             1                  6           2             stage
1             2                  45          2             stage

配置 FUSE 客户端 (复用 Meta 节点)

  1. 复制配置文件
    cp ~/3fs/build/bin/hf3fs_fuse_main /opt/3fs/bin
    cp ~/3fs/configs/{hf3fs_fuse_main_launcher.toml,hf3fs_fuse_main.toml,hf3fs_fuse_main_app.toml} \ 
    /opt/3fs/etc
    
  2. 创建挂载点
    mkdir -p /3fs/stage
  3. 修改配置 hf3fs_fuse_main_launcher.toml
    vi /opt/3fs/etc/hf3fs_fuse_main_launcher.toml
    cluster_id = "stage"
    mountpoint = '/3fs/stage'
    token_file = '/opt/3fs/etc/token.txt'
    [mgmtd_client]
    mgmtd_server_addresses = ["RDMA://192.168.19.240:8000"]
    
    vi /opt/3fs/etc/hf3fs_fuse_main.toml
    [mgmtd]
    mgmtd_server_addresses = ["RDMA://192.168.19.240:8000"]
    [common.monitor.reporters.monitor_collector]
    remote_ip = "192.168.19.240:10000"
    
  4. 更新 Fuse client 配置到 mgmtd service
    /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \ 
    --config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \ 
    "set-config —type FUSE —file /opt/3fs/etc/hf3fs_fuse_main.toml"
    
  5. 启动 FUSE Client
    cp ~/3fs/deploy/systemd/hf3fs_fuse_main.service /usr/lib/systemd/system
    systemctl start hf3fs_fuse_main
    
  6. 查看服务状态
    ubuntu@ip-192-168-19-240:~$ systemctl status hf3fs_fuse_main
    ● hf3fs_fuse_main.service - fuse_main Server
         Loaded: loaded (/lib/systemd/system/hf3fs_fuse_main.service; enabled; vendor preset: enabled)
         Active: active (running) since Fri 2025-03-14 21:59:46 CST; 45min ago
       Main PID: 5064 (hf3fs_fuse_main)
          Tasks: 64 (limit: 75552)
         Memory: 505.4M
            CPU: 14min 8.476s
         CGroup: /system.slice/hf3fs_fuse_main.service
                 ├─5064 /opt/3fs/bin/hf3fs_fuse_main --launcher_cfg /opt/3fs/etc/hf3fs_fuse_main_launcher.toml
                 └─5134 fusermount3 --auto-unmount -- /3fs/stage

性能测试(复用 Meta 节点)

安装 FIO

apt install -y fio

执行 FIO 测试

fio -numjobs=128 -fallocate=none -iodepth=2 -ioengine=libaio -direct=1 \
-rw=read -bs=4M --group_reporting -size=100M -time_based -runtime=3000 \ 
-name=2depth_128file_4M_direct_read_bw -directory=/3fs/stage

如下,在单节点 C6a-8xlarge 情况下,可以获得 1.54bib=1.434GB/s 带宽, 直接将网卡带宽(12.5Gb)打满。

总结

本文实现了在 AWS 上通过 SoftRCoE 技术利用入门级服务器(如 c6a.8xlarge,无需 EFA)实现 3FS 集群搭建,并通过 FIO 测试验证其性能。该方案可将 c6a.8xlarge 的网络带宽压满,支持用户在无 InfiniBand(IB)网络的场景下,在开发与测试环境中借助 3FS 高性能文件系统进行大模型训练和评估工作,显著提升数据访问效率。

S-BGP 是我们在由光环新网运营的亚马逊云科技中国(北京)区域和由西云数据运营的亚马逊云科技中国(宁夏)区域推出的一项成本优化型网络服务,旨在帮助我们的客户降低经过互联网传输数据出云(Data Transfer Out)的费用。 如果您想申请亚马逊云科技中国区域的 S-BGP 服务,请联系您的客户经理获取进一步帮助。


*前述特定亚马逊云科技生成式人工智能相关的服务仅在亚马逊云科技海外区域可用,亚马逊云科技中国仅为帮助您了解行业前沿技术和发展海外业务选择推介该服务。

本篇作者

朱坦

亚马逊云科技解决方案架构师,负责数据与存储架构设计,目前主要关注在人工智能与机器学习、数据分析、高性能计算,如 EDA、自动驾驶、基因分析等领域的存储设计与优化。