diff --git a/OneFlow/Classification/ConvNets/resnet50v1.5/reports/resnet50_oneflow_v0.4.0_report_0606.md b/OneFlow/Classification/ConvNets/resnet50v1.5/reports/resnet50_oneflow_v0.4.0_report_0606.md new file mode 100644 index 00000000..35cd3206 --- /dev/null +++ b/OneFlow/Classification/ConvNets/resnet50v1.5/reports/resnet50_oneflow_v0.4.0_report_0606.md @@ -0,0 +1,66 @@ +# OneFlow ResNet50-V1.5 Benchmark Test Report + +本报告总结了OneFlow v0.4.0 的ResNet50-V1.5 混合精度情况下dynamic loss scale的评测结果。 + +## Test Environment + +所有的测试都是在4台配置了8张 V100-SXM2-16GB GPU的服务器中进行,主要硬软件配置如下: + +- Tesla V100-SXM2-16GB x 8 +- InfiniBand 100 Gb/sec (4X EDR), Mellanox Technologies MT27700 Family +- Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz +- Memory 384G +- Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-206-generic x86_64) +- CUDA Version: 10.2, Driver Version: 460.67 +- OneFlow: [v0.4.0@325160b](https://github.com/Oneflow-Inc/oneflow/tree/325160bcfb786b166b063e669aea345fadee2da7) +- OneFlow-Benchmark: [c9a9342](https://github.com/Oneflow-Inc/OneFlow-Benchmark/tree/c9a9342a40ff42c55da928a081b6d9c84a489594) +- `nvidia-smi topo -m` + +``` + GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity +GPU0 X NV1 NV1 NV2 NV2 SYS SYS SYS NODE 0-11,24-35 +GPU1 NV1 X NV2 NV1 SYS NV2 SYS SYS NODE 0-11,24-35 +GPU2 NV1 NV2 X NV2 SYS SYS NV1 SYS PIX 0-11,24-35 +GPU3 NV2 NV1 NV2 X SYS SYS SYS NV1 PIX 0-11,24-35 +GPU4 NV2 SYS SYS SYS X NV1 NV1 NV2 SYS 12-23,36-47 +GPU5 SYS NV2 SYS SYS NV1 X NV2 NV1 SYS 12-23,36-47 +GPU6 SYS SYS NV1 SYS NV1 NV2 X NV2 SYS 12-23,36-47 +GPU7 SYS SYS SYS NV1 NV2 NV1 NV2 X SYS 12-23,36-47 +mlx5_0 NODE NODE PIX PIX SYS SYS SYS SYS X + +Legend: + + X = Self + SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) + NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node + PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) + PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) + PIX = Connection traversing at most a single PCIe bridge + NV# = Connection traversing a bonded set of # NVLinks + +``` + +## Test Descriptions + +- OneFlow版本: [v0.4.0@325160b](https://github.com/Oneflow-Inc/oneflow/tree/325160bcfb786b166b063e669aea345fadee2da7) +- OneFlow Benchmark仓库版本: [c9a9342](https://github.com/Oneflow-Inc/OneFlow-Benchmark/tree/c9a9342a40ff42c55da928a081b6d9c84a489594) +- Dynamic Loss Scale: 开启 +- XLA: 未采用 +- 测试共有四组,分别使用单机单卡、单机8卡、2机16卡、4机32卡进行测试,每组测试7次,选取这7次数据中的中位数作为最后结果。 +- 设置cnns/ofrecord_util.py [num_workers=3](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns/ofrecord_util.py#L88) + +## Finial Results + +- ### FP16 + +| num_nodes | gpu_num_per_node | batch_size_per_device | throughput | speedup | +|-----------|------------------|-----------------------|------------|---------| +| 1 | 1 | 256 | 1452.84 | 1.00 | +| 1 | 8 | 256 | 9876.88 | 6.80 | +| 2 | 8 | 256 | 17256.82 | 11.88 | +| 4 | 8 | 256 | 28406.60 | 19.55 | + +- 单机八卡情况下(1n8g),设置设置cnns/ofrecord_util.py [num_workers=5](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns/ofrecord_util.py#L88),可以得10859的吞吐率。 + + +全部日志可以点击[rn50_dls_fp16_256_logs.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/oneflow_test_log/oneflow_0.4.0/rn50_dls_fp16_256_logs.zip)获取。 diff --git a/OneFlow/Classification/ConvNets/resnet50v1.5/scripts/cp_logs.sh b/OneFlow/Classification/ConvNets/resnet50v1.5/scripts/cp_logs.sh index 76b1e4d4..315747e1 100755 --- a/OneFlow/Classification/ConvNets/resnet50v1.5/scripts/cp_logs.sh +++ b/OneFlow/Classification/ConvNets/resnet50v1.5/scripts/cp_logs.sh @@ -6,8 +6,8 @@ REPEAT_ID=$4 log_root=logs/oneflow log_dir=$log_root/${NUM_NODES}n${GPU_NUM_PER_NODE}g -log_file=bert_base_b${BSZ}_fp32_${REPEAT_ID}.log -summary_file=bert_base_b${BSZ}_fp32_${REPEAT_ID}.csv +log_file=rn50_b${BSZ}_fp16_${REPEAT_ID}.log +summary_file=rn50_b${BSZ}_fp16_${REPEAT_ID}.csv [ ! -d "${log_dir}" ] && mkdir -p ${log_dir} diff --git a/OneFlow/LanguageModeling/BERT/reports/bert_base_oneflow_v0.4.0_report_0608.md b/OneFlow/LanguageModeling/BERT/reports/bert_base_oneflow_v0.4.0_report_0608.md new file mode 100644 index 00000000..d5b017f8 --- /dev/null +++ b/OneFlow/LanguageModeling/BERT/reports/bert_base_oneflow_v0.4.0_report_0608.md @@ -0,0 +1,84 @@ +# BERT base Benchmark Test Report + +本报告总结了OneFlow v0.4.0 下BERT base 混合精度开启dynamic loss scale 的评测结果。 + +## Test Environment + +所有的测试都是在4台配置8张V100-SXM2-16GB GPU的服务器中进行,主要硬软件配置如下: + +- Tesla V100-SXM2-16GB x 8 +- InfiniBand 100 Gb/sec (4X EDR), Mellanox Technologies MT27700 Family +- Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz +- Memory 384G +- Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-206-generic x86_64) +- CUDA Version: 10.2, Driver Version: 460.67 +- OneFlow: [v0.4.0@325160b](https://github.com/Oneflow-Inc/oneflow/tree/325160bcfb786b166b063e669aea345fadee2da7) +- OneFlow-Benchmark: [c9a9342](https://github.com/Oneflow-Inc/OneFlow-Benchmark/tree/c9a9342a40ff42c55da928a081b6d9c84a489594) +- `nvidia-smi topo -m` + + ``` + GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity NUMA Affinity + GPU0 X NV1 NV1 NV2 NV2 SYS SYS SYS NODE 0-11,24-35 0 + GPU1 NV1 X NV2 NV1 SYS NV2 SYS SYS NODE 0-11,24-35 0 + GPU2 NV1 NV2 X NV2 SYS SYS NV1 SYS PIX 0-11,24-35 0 + GPU3 NV2 NV1 NV2 X SYS SYS SYS NV1 PIX 0-11,24-35 0 + GPU4 NV2 SYS SYS SYS X NV1 NV1 NV2 SYS 12-23,36-47 1 + GPU5 SYS NV2 SYS SYS NV1 X NV2 NV1 SYS 12-23,36-47 1 + GPU6 SYS SYS NV1 SYS NV1 NV2 X NV2 SYS 12-23,36-47 1 + GPU7 SYS SYS SYS NV1 NV2 NV1 NV2 X SYS 12-23,36-47 1 + mlx5_0 NODE NODE PIX PIX SYS SYS SYS SYS X + + Legend: + + X = Self + SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) + NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node + PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) + PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) + PIX = Connection traversing at most a single PCIe bridge + NV# = Connection traversing a bonded set of # NVLinks + ``` + +## Test Descriptions + +- OneFlow版本: [v0.4.0@325160b](https://github.com/Oneflow-Inc/oneflow/tree/325160bcfb786b166b063e669aea345fadee2da7) +- OneFlow Benchmark仓库版本: [c9a9342](https://github.com/Oneflow-Inc/OneFlow-Benchmark/tree/c9a9342a40ff42c55da928a081b6d9c84a489594) +- Dynamic Loss Scale: 开启 +- XLA: 未采用 +- 测试共有四组,分别使用单机单卡、单机8卡、2机16卡、4机32卡进行测试,每组测试7次,选取这7次数据中的中位数作为最后结果。 + + + +## Test Results + +### FP16 with clip + +- ### batch size = 160 + +| num_nodes | gpu_num_per_node | batch_size_per_device | throughput | speedup | +|-----------|------------------|-----------------------|------------|---------| +| 1 | 1 | 160 | 625.63 | 1.00 | +| 1 | 8 | 160 | 4573.62 | 7.31 | +| 2 | 8 | 160 | 8548.44 | 13.66 | +| 4 | 8 | 160 | 15955.70 | 25.50 | + +- ### batch size = 128 + +| num_nodes | gpu_num_per_node | batch_size_per_device | throughput | speedup | +|-----------|------------------|-----------------------|------------|---------| +| 1 | 1 | 128 | 616.25 | 1.00 | +| 1 | 8 | 128 | 4440.89 | 7.21 | +| 2 | 8 | 128 | 8233.78 | 13.36 | +| 4 | 8 | 128 | 14160.75 | 22.98 | + +- ### batch size = 64 + +| num_nodes | gpu_num_per_node | batch_size_per_device | throughput | speedup | +|-----------|------------------|-----------------------|------------|---------| +| 1 | 1 | 64 | 562.91 | 1.00 | +| 1 | 8 | 64 | 3750.87 | 6.66 | +| 2 | 8 | 64 | 6301.08 | 11.19 | +| 4 | 8 | 64 | 9445.96 | 16.78 | + +全部日志可以点击[bert_dls_fp16_logs.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/oneflow_test_log/oneflow_0.4.0/bert_dls_fp16_logs.zip)获取。 + diff --git a/OneFlow/LanguageModeling/BERT/scripts/bert_base_pretrain.sh b/OneFlow/LanguageModeling/BERT/scripts/bert_base_pretrain.sh index c02bbbb1..7375a858 100755 --- a/OneFlow/LanguageModeling/BERT/scripts/bert_base_pretrain.sh +++ b/OneFlow/LanguageModeling/BERT/scripts/bert_base_pretrain.sh @@ -39,6 +39,7 @@ python3 ./$BENCH_ROOT/run_pretraining.py \ --log_dir=./log \ --model_save_every_n_iter=10000 \ --save_last_snapshot=False \ + --use_fp16 \ --model_save_dir=./snapshots #--node_ips='10.11.0.2','10.11.0.3','10.11.0.4','10.11.0.5' \