Proof / Evidence Hub(実例・ベンチ・Research)¶
広告や紹介から来た方が 迷わず“証拠”を確認できるように、主要リンクをここに集約しています。
検収チェックリスト(3項目)¶
「何をもって“導入OK”とするか」を先に固定すると、迷いが減って導入が早くなります。
- 性能(Performance):最低ラインを数値で確認(例:step/s、ns/day)
→ ベンチ:/benchmarks/lammps-allegro-llzo-bulk-50k/ - 再現性(Reproducibility):同じ入力で同じ結果が出る(環境差で壊れない)
→ 壊しても戻せる基盤(Baseline+Runbook):/lp/rescue-ready-baseline/ - 運用性(Operability):検収後に日々回せる(Slurm/ログ/保守が回る)
→ まずは要件を1分で共有:/lp/h200-nvl-1gpu-runbook/#contact
入力ファイルがあれば最短
pw.x(QE)/ in.lammps などを貼れる範囲だけでOK。貼れない場合も「目的・期限・現状」の3点だけで進められます。
実ログ(抜粋)— 性能 / 再現性 / 運用性¶
Evidence Hubの検収チェックリストに対応する“実物ログ”です(機微情報は含めません)。
1) 性能(Performance)¶
- 対象ログ:
~/qe_bench/llzo_demo_qe/log.300K.lammps
386000 651.10626 -471925.79 -471990.34 64.552323 5928.7809
388000 631.44725 -471921.21 -471983.81 62.603279 10206.577
390000 944.71392 -471883.58 -471977.25 93.66133 4906.1555
392000 360.11325 -471952.59 -471988.3 35.702539 8098.5491
394000 634.54382 -471899.04 -471961.95 62.910281 -1505.0129
396000 542.438 -471924.62 -471978.4 53.778676 13390.788
398000 537.233 -471915.97 -471969.23 53.262639 9760.8953
400000 768.84021 -471866.12 -471942.34 76.224765 -1216.0175
Loop time of 2759.68 on 1 procs for 400000 steps with 768 atoms
Performance: 6.262 ns/day, 3.833 hours/ns, 144.945 timesteps/s, 111.317 katom-step/s
105.3% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 2618.6 | 2618.6 | 2618.6 | 0.0 | 94.89
Neigh | 101.68 | 101.68 | 101.68 | 0.0 | 3.68
Comm | 18.388 | 18.388 | 18.388 | 0.0 | 0.67
Output | 0.064166 | 0.064166 | 0.064166 | 0.0 | 0.00
Modify | 15.671 | 15.671 | 15.671 | 0.0 | 0.57
Other | | 5.248 | | | 0.19
Nlocal: 768 ave 768 max 768 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost: 3495 ave 3495 max 3495 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs: 0 ave 0 max 0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs: 136138 ave 136138 max 136138 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Total # of neighbors = 136138
Ave neighs/atom = 177.26302
Neighbor list builds = 10367
Dangerous builds = 0
write_data final_${Tlab}.data
write_data final_300K.data
System init for write_data ...
Generated 0 of 6 mixed pair_coeff terms from geometric mixing rule
Total wall time: 0:46:14
2) 再現性(Reproducibility)¶
- 対象ログ:
~/qe_bench/llzo_demo_qe/log.lammps
LAMMPS (29 Aug 2024)
using 1 OpenMP thread(s) per MPI task
Loaded 1 plugins from /opt/deepmd-kit/lib/deepmd_lmp
units metal
atom_style atomic
boundary p p p
# 96原子の基底構造(LLZO)を読み → 4x4x4 に拡張(= 96 × 64 = 6144 原子)
read_data data.llzo
Reading data file ...
orthogonal box = (-5.1135447 -9.4665824e-07 -5.3294159) to (8.8460671 9.1068882 8.4665605)
2 by 1 by 2 MPI processor grid
reading atoms ...
96 atoms
read_data CPU = 0.007 seconds
replicate 4 4 4
Replication is creating a 4x4x4 = 64 times larger system...
orthogonal box = (-5.1135447 -9.4665824e-07 -5.3294159) to (50.724903 36.427556 49.85449)
2 by 1 by 2 MPI processor grid
6144 atoms
replicate CPU = 0.002 seconds
# 質量(dataにあっても上書き可) 1:Li, 2:La, 3:Zr, 4:O
mass 1 6.94
mass 2 138.905
mass 3 91.224
mass 4 15.999
# DeePMD モデル
pair_style deepmd graph_3T.pb
Summary of lammps deepmd module ...
>>> Info of deepmd-kit:
installed to: /opt/deepmd-kit
source:
source branch: HEAD
source commit: 8b3dc08
source commit at: 2025-06-11 13:00:46 +0200
support model ver.: 1.1
build variant: cuda
build with tf inc: /opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/include;/opt/deepmd-kit/include
build with tf lib: /opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/libtensorflow_cc.so.2
build with pt lib: torch;torch_library;/opt/deepmd-kit/lib/python3.12/site-packages/torch/lib/libc10.so;CUDA::nvrtc;torch::nvtoolsext;/opt/deepmd-kit/lib/python3.12/site-packages/torch/lib/libc10_cuda.so
set tf intra_op_parallelism_threads: 1
set tf inter_op_parallelism_threads: 1
>>> Info of lammps module:
use deepmd-kit at: /opt/deepmd-kitpair_coeff * * Li La Zr O
# 近接
neighbor 2.0 bin
3) 運用性(Operability / Slurm)¶
- Slurm err:
~/qe_bench/llzo_demo_qe/slurm.llzo_700K_4gpu.2925.err
sh: 1: srun: not found
/var/spool/slurmd/job02925/slurm_script: line 33: 99446 Segmentation fault (core dumped) apptainer exec --nv "$SIF" bash -lc '
export OMP_NUM_THREADS=1
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export PATH=/opt/deepmd-kit/bin:$PATH
export LD_LIBRARY_PATH=/opt/deepmd-kit/lib:$LD_LIBRARY_PATH
export OMPI_MCA_btl="^openib" # IBを使わないなら除外(無害な保険)
# ランクごとにローカルGPUを割当:local_rank(0..3) → CUDA_VISIBLE_DEVICES
/opt/deepmd-kit/bin/mpirun -np 4 \
--map-by ppr:4:node --bind-to none \
bash -lc "export CUDA_VISIBLE_DEVICES=\${OMPI_COMM_WORLD_LOCAL_RANK}; \
echo [rank:\$OMPI_COMM_WORLD_RANK] gpu=\$CUDA_VISIBLE_DEVICES; \
exec /opt/deepmd-kit/bin/lmp -in '"$IN"'"
'
- Slurm out:
~/qe_bench/llzo_demo_qe/slurm.llzo_700K_4gpu.2927.out
[rank:0] GPU=0
[rank:1] GPU=0
[rank:3] GPU=0
[rank:2] GPU=0
LAMMPS (29 Aug 2024)
using 1 OpenMP thread(s) per MPI task
Loaded 1 plugins from /opt/deepmd-kit/lib/deepmd_lmp
Reading data file ...
orthogonal box = (-5.1135447 -9.4665824e-07 -5.3294159) to (8.8460671 9.1068882 8.4665605)
2 by 1 by 2 MPI processor grid
reading atoms ...
96 atoms
read_data CPU = 0.002 seconds
Replication is creating a 3x3x3 = 27 times larger system...
orthogonal box = (-5.1135447 -9.4665824e-07 -5.3294159) to (36.765291 27.320666 36.058513)
2 by 1 by 2 MPI processor grid
2592 atoms
replicate CPU = 0.001 seconds
Summary of lammps deepmd module ...
>>> Info of deepmd-kit:
installed to: /opt/deepmd-kit
source:
source branch: HEAD
source commit: 8b3dc08
source commit at: 2025-06-11 13:00:46 +0200
support model ver.: 1.1
build variant: cuda
build with tf inc: /opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/include;/opt/deepmd-kit/include
build with tf lib: /opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/libtensorflow_cc.so.2
build with pt lib: torch;torch_library;/opt/deepmd-kit/lib/python3.12/site-packages/torch/lib/libc10.so;CUDA::nvrtc;torch::nvtoolsext;/opt/deepmd-kit/lib/python3.12/site-packages/torch/lib/libc10_cuda.so
set tf intra_op_parallelism_threads: 2
Simulation Proof Packs(雛形)¶
SIM-CFD-001:DoMINO(CFDサロゲート)— 4GPU並列スループット実測¶
- 対象:DoMINO-Automotive-Aero NIM(single instance / all-gpus), endpoint
/v1/infer/surface - 指標:
throughput_req_per_min_by_wall(wall-clock)
A100x4 - conc=1: 7.50 req/min → conc=4: 31.79 req/min(4.24×)
H100 NVL 4GPU - conc=1: 9.16 req/min → conc=4: 36.92 req/min(4.03×) - conc=8: 55.49 req/min(conc=4→8で+50%)
H100優位 - conc=1: +22% / conc=4: +16%
→ 詳細記事: DoMINOで“外部空力(CFD)”を秒オーダーで試す(初心者向け)(初心者向け・図つき)
Benchmarks(数値で見る)¶
- LAMMPS + Allegro(LLZO bulk 50k):/benchmarks/lammps-allegro-llzo-bulk-50k/
Research Notes(技術の背景・再現)¶
- Quantum ESPRESSO GPU:/research/qe-gpu/
- LAMMPS:/research/lammps/
- ML-IAP Training:/research/training/
Posts(読み物/実例)¶
- QE→Allegro→LAMMPS(LLZO Li-ion path 3D):/posts/ep2r-qe-gpu-allegro-training-h200-nvl/
- vLLM serving A100 vs H100:/posts/vllm-serving-a100-vs-h100/
Free Tools / Lead Magnets(無償公開)¶
- GPU自動化Runbook:/lp/rescue-ready-baseline/
- HW BOM / Inventory(sg-hw-inventory):/lp/rescue-ready-baseline/
- oneAPI + Intel MPI + Slurm Runbook:/lp/rescue-ready-baseline/
相談が必要な典型(有償が早い)¶
- 本番で失敗できない/検収が必要
- Secure Boot / DKMS / kernel差分で詰まる
- Slurm(multi-node・運用設計)まで固めたい
- FP64前提で性能を出したい(ボトルネック解析含む)