DeepSeek 于 2025 年 2 月 24 日至 28 日开展了为期 5 天的“开源周”行动,并于收官当日开源了 Fire-Flyer File System(3FS)并行文件系统, 揭秘了 3FS 如何“榨干” SSD 和远程直接内存访问(RDMA)网络,全面支持大模型的训练和推理的 KVCache 转存以及向量数据库等能力,为 AI 训练和推理工作负载带来低成本高吞吐量的数据访问体验。
主要功能
- 高性能数据访问:聚合数千个 SSD 的吞吐量和数百个存储节点的网络带宽,提供高达 6.6 TiB/s 的读取吞吐量。支持大规模集群中的高吞吐量并行读写,优化 AI 训练和推理任务中的数据加载效率。
- 通用文件接口:提供无状态元数据服务,支持事务性键值存储(如 FoundationDB),用户无需学习新的存储 API。
- 优化 AI 工作负载:
- 数据准备:高效管理大量中间输出,支持层次化目录结构。
- 数据加载:支持跨计算节点的随机访问,无需预取或数据集洗牌。
- 检查点支持:为大规模训练提供高吞吐量并行检查点功能。
- KVCache:为推理任务提供高吞吐量、大容量的缓存替代方案,优化推理效率。
- 高扩展性和灵活性:支持大规模集群部署,适用于从单节点到数千节点的多样化应用场景。
技术特性
- 分离式架构:基于计算与存储分离的设计,将存储资源集中管理,用高速网络(如 RDMA)实现数据的高效传输。让应用以“位置无关”的方式访问存储资源,简化资源管理。
- 强一致性(Strong Consistency):为实现强一致性,3FS 基于 CRAQ 技术。基于链式复制确保数据在多个副本间的一致性,用分配查询优化读取性能,减少延迟。
- 文件接口(File Interfaces):提供基于事务性键值存储(如 FoundationDB)支持的无状态元数据服务,使用通用的文件接口,无需学习新的存储 API。
- Direct I/O 与 RDMA 优化:基于 Direct I/O 直接访问 SSD,避免使用文件缓存,减少 CPU 和内存开销,用 RDMA 技术实现高效的数据传输,进一步提升性能。
- KVCache 技术:在推理任务中,基于 KVCache 缓存关键中间结果,避免重复计算,显著提高推理效率。KVCache 结合高吞吐量和大容量的优势,是 DRAM 缓存的低成本替代方案。
- 数据局部性优化:基于优化数据布局和访问模式,减少数据传输的延迟和带宽消耗,特别是在大规模分布式训练和推理任务中表现出色。
在本次测试中我们选择使用 c6a.8xlarge 实例,通过构建了 5 个存储节点 +1 个 client 来验证 3FS 的官方部署指南以及对应的技术特性。
环境准备
部署架构
安装环境
NO |
IP |
spec |
role |
OS |
Hard disk |
Remark |
1 |
192.168.19.240 |
c6a.8xlarge |
meta/monitor/management/fuse client |
Ubuntu 22.04 |
1*300G Disk(GP3 800MB/s 3000IOPS) |
12.5G ENA |
2 |
192.168.15.58 |
c6a.8xlarge |
storage1 |
Ubuntu 22.04 |
3* 500G Disk(GP3 800MB/s 3000IOPS) |
12.5G ENA |
3 |
192.168.7.82 |
c6a.8xlarge |
storage2 |
Ubuntu 22.04 |
3* 500G Disk(GP3 800MB/s 3000IOPS) |
12.5G ENA |
4 |
192.168.25.80 |
c6a.8xlarge |
storage3 |
Ubuntu 22.04 |
3* 500G Disk(GP3 800MB/s 3000IOPS) |
12.5G ENA |
5 |
192.168.8.218 |
c6a.8xlarge |
storage4 |
Ubuntu 22.04 |
3* 500G Disk(GP3 800MB/s 3000IOPS) |
12.5G ENA |
6 |
192.168.7.118 |
c6a.8xlarge |
storage5 |
Ubuntu 22.04 |
3* 500G Disk(GP3 800MB/s 3000IOPS) |
12.5G ENA |
P.S. 本次部署采用 SoftRCoE,采用基于 12.5G ENA,因此无需尝试支持 EFA 的 EC2 实例。
每个节点的服务和相应的配置文件和官方建议相同,如下所示:
service |
Binary |
config files |
nodeID |
node |
monitor |
monitor_collector_main |
monitor_collector_main.toml |
|
meta |
admin_cli |
admin_cli |
admin_cli.toml fdb.cluster |
|
meta storage1 storage2 storage3 storage4 storage5 |
mgmtd |
mgmtd_main |
mgmtd_main_launcher.toml mgmtd_main.toml mgmtd_main_app.toml fdb.cluster |
1 |
meta |
meta |
meta_main |
meta_main_launcher.toml meta_main.toml meta_main_app.toml fdb.cluster |
100 |
meta |
storage |
storage_main |
storage_main_launcher.toml storage_main.toml storage_main_app.toml |
10001~10005 |
storage1 storage2 storage3 storage4 storage5 |
client |
hf3fs_fuse_main |
hf3fs_fuse_main_launcher.toml hf3fs_fuse_main.toml |
|
meta |
安装 SoftRCOE
必须为所有的节点启动 IPv6
在 VPC中启动 IPv6
启动 IPv6 for ENA
安装并启动 SoftRCoE
AWS 默认 ubuntu 22.04 内核,没有加载 rdma-rxe 模块,可以通过如下命令验证:
[ec2-user@ip-192-168-19-240 ~]$ modprobe rdma_rxe
modprobe: FATAL: Module rdma_rxe not found in directory /lib/modules/6.8.0-1021-aws
出现上面报错,则需要安装 rdma_rxe 模块与相关依赖如下:
apt install linux-modules-extra-6.8.0-1021-aws
apt-get install libibverbs1 ibverbs-utils librdmacm1 libibumad3 ibverbs-providers rdma-core
apt install libibverbs-dev
apt install librdmacm-dev
apt install librdmacm-utils
apt install iproute2
添加 rdma link
#add rdma link (using existing ethenet)
rdma link add rxe_eth0 type rxe netdev ens5
验证 rdma link 状态
# rdma link need to show SoftRCoE device
ubuntu@ip-192-168-19-240:~$ sudo -i rdma link
link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev ens5
安装和配置 SoftRCoE 服务(可选)
创建 rdma.setup service 如下:
vi /etc/systemd/system/rdma-setup.service
并添加如下信息:
[Unit]
Description=Setup RDMA RXE Link
After=network.target
[Service]
Type=oneshot
ExecStart=/sbin/modprobe rdma_rxe
ExecStart=/usr/bin/rdma link add rxe_eth0 type rxe netdev ens5
RemainAfterExit=true
[Install]
WantedBy=multi-user.target
启动服务,并查看状态:
root@ip-192-168-19-240:~# systemctl start rdma-setup.service
root@ip-192-168-19-240:~# systemctl status rdma-setup.service
● rdma-setup.service - Setup RDMA RXE Link
Loaded: loaded (/etc/systemd/system/rdma-setup.service; disabled; vendor preset: enabled)
Active: active (exited) since Thu 2025-03-14 01:33:00 CST; 3s ago
Process: 2410 ExecStart=/sbin/modprobe rdma_rxe (code=exited, status=0/SUCCESS)
Process: 2414 ExecStart=/usr/bin/rdma link add rxe_eth0 type rxe netdev ens5 (code=exited, status=0/SUCCESS)
Main PID: 2414 (code=exited, status=0/SUCCESS)
CPU: 17ms
Apr 03 01:33:00 ip-192-168-8-218 systemd[1]: Starting Setup RDMA RXE Link...
Apr 03 01:33:00 ip-192-168-8-218 systemd[1]: Finished Setup RDMA RXE Link.
模拟 ibdev2netdev
由于 3FS 使用了 mellanox 网卡的 ibdev2netdev,在执行 3FS 命令时会调用,因此我们需要在 SoftRcoE 环境中构造一个命令输出。采用如下方式添加脚本:
vim /usr/sbin/ibdev2netdev
添加如下内容
#!/bin/bash
echo "rdma_0 port 1 ==> rxe_eth0 (Up)"
安装和编译 3FS
下面的步骤,需要在集群中的每一个节点进行安装和配置,当然也可以选择任一节点,然后制作成 AMI 镜像,作为后续 Meta/Storage/Management 节点的安装基础。
设置 ubuntu 软件仓库源
打开终端,输入以下命令查看 sources.list 文件:
sudo nano /etc/apt/sources.list
确保文件中有有效的源代码。例如,Ubuntu 官方源文件应包含以下条目:
deb http://archive.ubuntu.com/ubuntu/ focal main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu/ focal-updates main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu/ focal-security main restricted universe multiverse
执行更新
安装需要的依赖包
apt install cmake libuv1-dev liblz4-dev liblzma-dev libdouble-conversion-dev \
libprocps-dev libdwarf-dev libunwind-dev \ libaio-dev libgflags-dev \
libgoogle-glog-dev libgtest-dev libgmock-dev clang-format-14 clang-14 clang-tidy-14 \
lld-14 \ libgoogle-perftools-dev google-perftools libssl-dev ccache gcc-12 g++-12 \
libboost-all-dev
安装 foundation db
wget https://github.com/apple/foundationdb/releases/download/7.3.63/foundationdb-clients_7.3.63-1_amd64.deb
wget https://github.com/apple/foundationdb/releases/download/7.3.63/foundationdb-server_7.3.63-1_amd64.deb
dpkg -i foundationdb-clients_7.3.63-1_amd64.deb
dpkg -i foundationdb-server_7.3.63-1_amd64.deb
安装 libfuse
wget https://github.com/libfuse/libfuse/releases/download/fuse-3.16.1/fuse-3.16.1.tar.gz
tar vzxf fuse-3.16.1.tar.gz
cd fuse-3.16.1/
mkdir build && cd build
apt install -y meson
meson setup ..
ninja && ninja install
安装 rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Enter “1”
1)proceed with standard installation (default -just press enter)
2) Customize installation
3) Cancel installation
>1
等待安装完成
Rust is installed now. Great!
To get started you may need to restart your current shell.
This would reload your PATH environment variable to include Cargo's bin directory($HOME/.cargo/bin).
To configure your current shell, you need to source
the Corresponding env file under $HOME/.cargo.
This is usually done by running one of the following (note the leading DOT):
. "$HOME/.cargo/env" # For sh/bash/zsh/ash/dash/pdksh
source "$HOME/.cargo/env.fish" # for fish
root@meta-240:~/fuse-3.16.1/build#
安装 3FS
git clone https://github.com/deepseek-ai/3fs
cd 3fs
git submodule update --init --recursive
./patches/apply.sh
cmake -S . -B build -DCMAKE_CXX_COMPILER=clang++-14 -DCMAKE_C_COMPILER=clang-14 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
cmake --build build -j 32
# optional config:
cd ~/3fs/configs
sed -i 's/max_sge = 16/max_sge = 1/g' `grep -rl max_sge`
安装和配置 ClickHouse(Meta 节点)
安装 ClickHouse
sudo apt-get install -y apt-transport-https ca-certificates curl gnupg
curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg —dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
ARCH=$(dpkg —print-architecture)
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg arch=${ARCH}] https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
Setting Password,如:Amazon123!!
Set up the password for the default user:
sudo apt-get install -y clickhouse-server clickhouse-client
启动 Clickhouse
配置 Clickhouse
clickhouse-client —password 'Amazon123!!'
创建 Metric table
clickhouse-client —password 'Amazon123!!' -n < ~/3fs/deploy/sql/3fs-monitor.sql
配置 Admin Client (所有节点)
- 拷贝文件到节点
mkdir -p /opt/3fs/{bin,etc}
rsync -avz meta:~/3fs/build/bin/admin_cli /opt/3fs/bin
rsync -avz meta:~/3fs/configs/admin_cli.toml /opt/3fs/etc
rsync -avz meta:/etc/foundationdb/fdb.cluster /opt/3fs/etc
- 修改 admin_cli.toml
vim /opt/3fs/etc/admin_cli.toml
cluster_id = "stage"
[fdb]
clusterFile = '/opt/3fs/etc/fdb.cluster'
- 测试 admin_cli 是否工作
root@ip-192-168-19-240:~# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml help
bench Usage: bench [--rank VAR] [--timeout VAR] [--coroutines VAR] [--seconds VAR] [--remove] path
cd Usage: cd [-L] [--inode] path
checksum Usage: checksum [--list] [--batch VAR] [--md5] [--fillZero] [--output VAR] path
create Usage: create [--perm VAR] [--chain-table-id VAR] [--chain-table-ver VAR] [--chain-list VAR] [--chunk-size VAR] [--stripe-size VAR] path
create-range Usage: create-range [--concurrency VAR] prefix inclusive_start exclusive_end
create-target Usage: create-target --node-id VAR --disk-index VAR --target-id VAR --chain-id VAR [--add-chunk-size] [--chunk-size VAR...] [--use-new-chunk-engine]
create-targets Usage: create-targets --node-id VAR [--disk-index VAR...] [--allow-existing-target] [--add-chunk-size] [--use-new-chunk-engine]
current-user Usage: current-user
安装和配置 monitor service(Meta 节点)
- 拷贝文件
mkdir -p /opt/3fs/{bin,etc}
mkdir -p /var/log/3fs
cp ~/3fs/build/bin/monitor_collector_main /opt/3fs/bin
cp ~/3fs/configs/monitor_collector_main.toml /opt/3fs/etc
- 修改 monitor_collector_main.toml
vim /opt/3fs/etc/monitor_collector_main.toml
#修改如下配置:
[server.monitor_collector.reporter.clickhouse]
db = '3fs' # 默认为3fs
host = '127.0.0.1' # clickhouse所在的节点
passwd = '9' # 安装时设置的密码
port = '9000' # 默认监听在9000端口
user = 'default' # 默认为default用户
- 启动 monitor service
cp ~/3fs/deploy/systemd/monitor_collector_main.service /usr/lib/systemd/system
systemctl start monitor_collector_main
- 检查 monitor 服务转台
root@ip-192-168-8-218:~# systemctl status monitor_collector_main
● monitor_collector_main.service - monitor_collector_main Server
Loaded: loaded (/lib/systemd/system/monitor_collector_main.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2025-03-14 11:14:56 CST; 3s ago
Main PID: 2954 (monitor_collect)
Tasks: 58 (limit: 113376)
Memory: 271.4M
CPU: 211ms
CGroup: /system.slice/monitor_collector_main.service
└─2954 /opt/3fs/bin/monitor_collector_main --cfg /opt/3fs/etc/monitor_collector_main.toml
Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO] "fatal": {
Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO] "type": "stream",
Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO] "options": {
Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO] "level": "FATAL",
Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO] "stream": "stderr"
Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO] }
Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO] }
Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO] }
Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336680550+08:00 monitor_collect: 2954 LogConfig.cc:96 INFO] }
Apr 03 02:14:56 ip-192-168-8-218 monitor_collector_main[2954]: [2025-04-03T02:14:56.336700818+08:00 monitor_collect: 2954 OnePhaseApplication.h:87 INFO] LogConfig: {"categories":{".":{"level":"INFO","in>
安装和配置 management service(Meta 节点)
- 拷贝文件
cp ~/3fs/build/bin/mgmtd_main /opt/3fs/bin
cp ~/3fs/configs/{mgmtd_main.toml,mgmtd_main_launcher.toml,mgmtd_main_app.toml} \
/opt/3fs/etc
- 修改 mgmtd_main_app.toml
vim /opt/3fs/etc/mgmtd_main_app.toml
node_id = 1 // 将ID改为1
vim /opt/3fs/etc/mgmtd_main_launcher.toml
cluster_id = "stage" # 集群ID,全局唯一
[fdb]
clusterFile = '/opt/3fs/etc/fdb.cluster'
vim /opt/3fs/etc/mgmtd_main.toml
[common.monitor.reporters.monitor_collector]
remote_ip = "192.168.19.240:10000" # 这里将IP改为mon节点的IP
- 启动集群
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
"init-cluster —mgmtd /opt/3fs/etc/mgmtd_main.toml 1 1048576 16"
# where parameter 1 represents the chainTable ID, 1048576 represents the chunksize (1MB), and 16 represents the file strip size. then start the service and verify the
- 启动服务
cp ~/3fs/deploy/systemd/mgmtd_main.service /usr/lib/systemd/system
systemctl start mgmtd_main
- 检查 Service 状态
systemctl start mgmtd_main
- 验证服务节点
root@ip-192-168-19-240:~# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
--config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
"list-nodes"
/root/.profile: line 10: /.cargo/env: No such file or directory
Id Type Status Hostname Pid Tags LastHeartbeatTime ConfigVersion ReleaseVersion
1 MGMTD PRIMARY_MGMTD ip-192-168-19-240 2729 [] N/A 1(UPTODATE) 250228-dev-1-999999-3b273a6d
安装和配置 Meta Service(Meta 节点)
- 拷贝文件
cp ~/3fs/build/bin/meta_main /opt/3fs/bin
cp ~/3fs/configs/{meta_main_launcher.toml,meta_main.toml,meta_main_app.toml} \
/opt/3fs/etc
- 修改 meta_main_app.toml
vi /opt/3fs/etc/meta_main_app.toml
node_id = 100
vi /opt/3fs/etc/meta_main_launcher.toml
cluster_id = "stage"
[mgmtd_client]
mgmtd_server_addresses = ["RDMA://192.168.19.240:8000"]
vi /opt/3fs/etc/meta_main.toml
[server.mgmtd_client]
mgmtd_server_addresses = ["RDMA://192.168.19.240:8000"]
[common.monitor.reporters.monitor_collector]
remote_ip = "192.168.19.240:10000"
[server.fdb]
clusterFile = '/opt/3fs/etc/fdb.cluster'
- 更新配置到 management node
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
—config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
"set-config —type META —file /opt/3fs/etc/meta_main.toml"
- 配置与启动 Meta service
cp ~/3fs/deploy/systemd/meta_main.service /usr/lib/systemd/system
systemctl start meta_main
- 检查 Meta service 状态
root@ip-192-168-19-240:~# systemctl status meta_main
● meta_main.service - meta_main Server
Loaded: loaded (/lib/systemd/system/meta_main.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2025-03-14 11:15:52 CST; 3s ago
Main PID: 3054 (meta_main)
Tasks: 64 (limit: 113376)
Memory: 408.8M
CPU: 386ms
CGroup: /system.slice/meta_main.service
└─3054 /opt/3fs/bin/meta_main --launcher_cfg /opt/3fs/etc/meta_main_launcher.toml --app-cfg /opt/3fs/etc/meta_main_app.toml
Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO] "rotate": "true",
Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO] "max_files": "10",
Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO] "max_file_size": "104857600",
Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO] "rotate_on_open": "false"
Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO] }
Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO] }
Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO] }
Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968741322+08:00 meta_main: 3054 LogConfig.cc:96 INFO] }
Apr 03 02:15:52 ip-192-168-8-218 meta_main[3054]: [2025-04-03T02:15:52.968767842+08:00 meta_main: 3054 Utils.cc:161 INFO] LogConfig: {"categories":{".":{"level":"INFO","inherit":true,"propagate":"NONE",>
Apr 03 02:15:53 ip-192-168-8-218 meta_main[3054]: Memory profiling already disabled
- 验证
root@ip-192-168-19-240:~# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
--config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
"list-nodes"
/root/.profile: line 10: /.cargo/env: No such file or directory
Id Type Status Hostname Pid Tags LastHeartbeatTime ConfigVersion ReleaseVersion
1 MGMTD PRIMARY_MGMTD ip-192-168-19-240 2729 [] N/A 1(UPTODATE) 250228-dev-1-999999-3b273a6d
100 META HEARTBEAT_CONNECTED ip-192-168-19-240 2873 [] 2025-03-14 22:41:34 1(UPTODATE) 250228-dev-1-999999-3b273a6d
安装和配置 Storage Service(存储节点)
- 格式化挂载磁盘(这里根据自己的磁盘数量行修改,这里有 3 块磁盘所以是 .3)
mkdir -p /storage/data{1..3}
mkdir -p /var/log/3fs
for i in {1..3};do mkfs.xfs -f -L data${i} /dev/nvme${i}n1;mount -o noatime,nodiratime -L data${i} /storage/data${i};done
mkdir -p /storage/data{1..3}/3fs
- 修改 sysctl.conf
echo "fs.aio-max-nr=67108864" >> /etc/sysctl.conf
sysctl -p
- 在 meta 节点修改原始配置文件,设置 node_id 和 mgmtd 地址
vi ~/3fs/configs/storage_main_launcher.toml
cluster_id = "stage"
[mgmtd_client]
mgmtd_server_addresses = ["RDMA://192.168.19.240:8000"]
vi ~/3fs/configs/storage_main.toml
[server.mgmtd]
mgmtd_server_address = ["RDMA://192.168.19.240:8000"]
[common.monitor.reporters.monitor_collector]
remote_ip = "192.168.19.240:10000"
[server.targets]
target_paths = ["/storage/data1/3fs","/storage/data2/3fs","/storage/data3/3fs"]
- 在所有存储节点上将配置文件从 meta 节点上拷贝过来
rsync -avz meta:~/3fs/build/bin/storage_main /opt/3fs/bin
rsync -avz meta:~/3fs/configs/{storage_main_launcher.toml,storage_main.toml,storage_main_app.toml} \
/opt/3fs/etc
- 修改存储节点 ID,每个 ID 必须全局唯一。这里,存储节点使用 10001~10005
vi /opt/3fs/etc/storage_main_app.toml
node_id = 10001 # need to set 10001-100005 for 5 storeage nodes
- 然后每个存储节点更新配置
/opt/3fs/bin/admin_cli --cfg /opt/3fs/etc/admin_cli.toml \
--config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
"set-config --type STORAGE --file /opt/3fs/etc/storage_main.toml"
- 每个存储节点配置并启动服务
rsync -avz meta:~/3fs/deploy/systemd/storage_main.service /usr/lib/systemd/system
systemctl start storage_main
- 每个存储节点查看服务状态
systemctl status storage_main
- 检查集群状态
root@ip-192-168-15-78:~# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
--config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
"list-nodes"
/root/.profile: line 10: /.cargo/env: No such file or directory
Id Type Status Hostname Pid Tags LastHeartbeatTime ConfigVersion ReleaseVersion
1 MGMTD PRIMARY_MGMTD ip-192-168-19-240 2729 [] N/A 1(UPTODATE) 250228-dev-1-999999-3b273a6d
100 META HEARTBEAT_CONNECTED ip-192-168-19-240 2873 [] 2025-03-14 22:41:34 1(UPTODATE) 250228-dev-1-999999-3b273a6d
10001 STORAGE HEARTBEAT_CONNECTED ip-192-168-15-78 1447 [] 2025-03-14 22:41:39 8(UPTODATE) 250228-dev-1-999999-3b273a6d
10002 STORAGE HEARTBEAT_CONNECTED ip-192-168-7-82 1399 [] 2025-03-14 22:41:39 8(UPTODATE) 250228-dev-1-999999-3b273a6d
10003 STORAGE HEARTBEAT_CONNECTED ip-192-168-25-80 1487 [] 2025-03-14 22:41:39 8(UPTODATE) 250228-dev-1-999999-3b273a6d
10004 STORAGE HEARTBEAT_CONNECTED ip-192-168-8-218 1799 [] 2025-03-14 22:41:39 8(UPTODATE) 250228-dev-1-999999-3b273a6d
10005 STORAGE HEARTBEAT_CONNECTED ip-192-168-7-118 1641 [] 2025-03-14 22:41:39 8(UPTODATE) 250228-dev-1-999999-3b273a6d
配置 3FS 元数据 (管理节点)
- 创建用户
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
—config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
"user-add —root —admin 0 root"
Uid 0
Name root
Token AACDso6V8QA42xI222222222(Expired at N/A)
IsRootUser true
IsAdmin true
Gid 0
SupplementaryGids
- 将上一步生成 token 保存在/opt/3fs/etc/token.txt 中
root@ip-192-168-19-240:~# echo "AACDso6V8QA42xI222222222" > /opt/3fs/etc/token.txt
- 创建 CRAQ 链表
pip3 install -r ~/3fs/deploy/data_placement/requirements.txt
# 这一步注意的地方:
# replication_factor 表示副本数量
# min_targets_per_disk 表示每个磁盘的target数量
python3 /opt/software/3fs/deploy/data_placement/src/model/data_placement.py \
-ql -relax -type CR --num_nodes 5 --replication_factor 2 --min_targets_per_disk 6
# 这一步注意的地方:
# 1.修改num_disks_per_node为实际磁盘数量
# 2.修改node_id_begin和node_id_end为实际的ID
python3 /opt/software/3fs/deploy/data_placement/src/setup/gen_chain_table.py \
--chain_table_type CR --node_id_begin 10001 --node_id_end 10005 --num_disks_per_node \
3 --num_targets_per_disk 6 --target_id_prefix 1 --chain_id_prefix 5 \
--incidence_matrix_path output/DataPlacementModel-v_5-b_15-r_6-k_2-λ_2-lb_1-ub_0/incidence_matrix.pickle
- 创建 storage target
/opt/3fs/bin/admin_cli —cfg /opt/3fs/etc/admin_cli.toml \
--config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
--config.user_info.token $(<"/opt/3fs/etc/token.txt") < output/create_target_cmd.txt
- 上传 chains 和 chain table 到 mgmtd service
/opt/3fs/bin/admin_cli —cfg /opt/3fs/etc/admin_cli.toml \
--config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
--config.user_info.token $(<"/opt/3fs/etc/token.txt") "upload-chains output/generated_chains.csv"
/opt/3fs/bin/admin_cli —cfg /opt/3fs/etc/admin_cli.toml \
--config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
--config.user_info.token $(<"/opt/3fs/etc/token.txt") "upload-chain-table —desc stage 1 output/generated_chain_table.csv"
- 检查是否上传成功
检查链状态
root@ip-192-168-19-240:~# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
--config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
"list-chains"
/root/.profile: line 10: /.cargo/env: No such file or directory
ChainId ReferencedBy ChainVersion Status PreferredOrder Target Target
500100001 1 10 SERVING [] 101000000101(SERVING-UPTODATE) 101000100101(SERVING-UPTODATE)
500100002 1 10 SERVING [] 101000000102(SERVING-UPTODATE) 101000100102(SERVING-UPTODATE)
500100003 1 1 SERVING [] 101000000103(SERVING-UPTODATE) 101000100103(SERVING-UPTODATE)
500100004 1 1 SERVING [] 101000000104(SERVING-UPTODATE) 101000100104(SERVING-UPTODATE)
500100005 1 1 SERVING [] 101000000105(SERVING-UPTODATE) 101000100105(SERVING-UPTODATE)
500100006 1 1 SERVING [] 101000000106(SERVING-UPTODATE) 101000100106(SERVING-UPTODATE)
500200001 1 10 SERVING [] 101000000201(SERVING-UPTODATE) 101000100201(SERVING-UPTODATE)
500200002 1 10 SERVING [] 101000000202(SERVING-UPTODATE) 101000100202(SERVING-UPTODATE)
500200003 1 1 SERVING [] 101000000203(SERVING-UPTODATE) 101000100203(SERVING-UPTODATE)
500200004 1 1 SERVING [] 101000000204(SERVING-UPTODATE) 101000100204(SERVING-UPTODATE)
500200005 1 1 SERVING [] 101000000205(SERVING-UPTODATE) 101000100205(SERVING-UPTODATE)
500200006 1 1 SERVING [] 101000000206(SERVING-UPTODATE) 101000100206(SERVING-UPTODATE)
500300001 1 10 SERVING [] 101000000301(SERVING-UPTODATE) 101000100301(SERVING-UPTODATE)
500300002 1 10 SERVING [] 101000000302(SERVING-UPTODATE) 101000100302(SERVING-UPTODATE)
500300003 1 1 SERVING [] 101000000303(SERVING-UPTODATE) 101000100303(SERVING-UPTODATE)
500300004 1 1 SERVING [] 101000000304(SERVING-UPTODATE) 101000100304(SERVING-UPTODATE)
500300005 1 1 SERVING [] 101000000305(SERVING-UPTODATE) 101000100305(SERVING-UPTODATE)
500300006 1 1 SERVING [] 101000000306(SERVING-UPTODATE) 101000100306(SERVING-UPTODATE)
检查链表状态
root@ip-192-168-19-240:~# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
--config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
"list-chain-tables"
/root/.profile: line 10: /.cargo/env: No such file or directory
ChainTableId ChainTableVersion ChainCount ReplicaCount Desc
1 1 6 2 stage
1 2 45 2 stage
配置 FUSE 客户端 (复用 Meta 节点)
- 复制配置文件
cp ~/3fs/build/bin/hf3fs_fuse_main /opt/3fs/bin
cp ~/3fs/configs/{hf3fs_fuse_main_launcher.toml,hf3fs_fuse_main.toml,hf3fs_fuse_main_app.toml} \
/opt/3fs/etc
- 创建挂载点
- 修改配置 hf3fs_fuse_main_launcher.toml
vi /opt/3fs/etc/hf3fs_fuse_main_launcher.toml
cluster_id = "stage"
mountpoint = '/3fs/stage'
token_file = '/opt/3fs/etc/token.txt'
[mgmtd_client]
mgmtd_server_addresses = ["RDMA://192.168.19.240:8000"]
vi /opt/3fs/etc/hf3fs_fuse_main.toml
[mgmtd]
mgmtd_server_addresses = ["RDMA://192.168.19.240:8000"]
[common.monitor.reporters.monitor_collector]
remote_ip = "192.168.19.240:10000"
- 更新 Fuse client 配置到 mgmtd service
/opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml \
--config.mgmtd_client.mgmtd_server_addresses '["RDMA://192.168.19.240:8000"]' \
"set-config —type FUSE —file /opt/3fs/etc/hf3fs_fuse_main.toml"
- 启动 FUSE Client
cp ~/3fs/deploy/systemd/hf3fs_fuse_main.service /usr/lib/systemd/system
systemctl start hf3fs_fuse_main
- 查看服务状态
ubuntu@ip-192-168-19-240:~$ systemctl status hf3fs_fuse_main
● hf3fs_fuse_main.service - fuse_main Server
Loaded: loaded (/lib/systemd/system/hf3fs_fuse_main.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2025-03-14 21:59:46 CST; 45min ago
Main PID: 5064 (hf3fs_fuse_main)
Tasks: 64 (limit: 75552)
Memory: 505.4M
CPU: 14min 8.476s
CGroup: /system.slice/hf3fs_fuse_main.service
├─5064 /opt/3fs/bin/hf3fs_fuse_main --launcher_cfg /opt/3fs/etc/hf3fs_fuse_main_launcher.toml
└─5134 fusermount3 --auto-unmount -- /3fs/stage
性能测试(复用 Meta 节点)
安装 FIO
执行 FIO 测试
fio -numjobs=128 -fallocate=none -iodepth=2 -ioengine=libaio -direct=1 \
-rw=read -bs=4M --group_reporting -size=100M -time_based -runtime=3000 \
-name=2depth_128file_4M_direct_read_bw -directory=/3fs/stage
如下,在单节点 C6a-8xlarge 情况下,可以获得 1.54bib=1.434GB/s 带宽, 直接将网卡带宽(12.5Gb)打满。
总结
本文实现了在 AWS 上通过 SoftRCoE 技术利用入门级服务器(如 c6a.8xlarge,无需 EFA)实现 3FS 集群搭建,并通过 FIO 测试验证其性能。该方案可将 c6a.8xlarge 的网络带宽压满,支持用户在无 InfiniBand(IB)网络的场景下,在开发与测试环境中借助 3FS 高性能文件系统进行大模型训练和评估工作,显著提升数据访问效率。
S-BGP 是我们在由光环新网运营的亚马逊云科技中国(北京)区域和由西云数据运营的亚马逊云科技中国(宁夏)区域推出的一项成本优化型网络服务,旨在帮助我们的客户降低经过互联网传输数据出云(Data Transfer Out)的费用。 如果您想申请亚马逊云科技中国区域的 S-BGP 服务,请联系您的客户经理获取进一步帮助。
*前述特定亚马逊云科技生成式人工智能相关的服务仅在亚马逊云科技海外区域可用,亚马逊云科技中国仅为帮助您了解行业前沿技术和发展海外业务选择推介该服务。
本篇作者